\documentclass{article}
\usepackage{arxiv}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
\usepackage{hyperref} % hyperlinks
\usepackage{url} % simple URL typesetting
\usepackage{booktabs} % professional-quality tables
\usepackage{amsfonts} % blackboard math symbols
\usepackage{nicefrac} % compact symbols for 1/2, etc.
\usepackage{microtype} % microtypography
\usepackage{lipsum}
\usepackage{graphicx }
\title{AirBnb Listing Price Prediction for Seattle City \emph{} }
\author{
Aashirwad Kumar\thanks{Use footnote for providing further
information about author (webpage, alternative
address)---\emph{not} for acknowledging funding agencies.} \\
Integrated Msc in Mathematics and Computing\\
Department of Mathematics\\
Birla Institute of Technology,Mesra\\
Ranchi, Jharkhand-835215 \\
\texttt{imh10004.18@bitmesra.ac.in} \\
%% examples of more authors
\And
Anmol Sharma \\
Integrated Msc in Mathematics and Computing\\
Department of Mathematics\\
Birla Institute of Technology,Mesra\\
Ranchi, Jharkhand-835215 \\
\texttt{imh10057.18@bitmesra.ac.in} \\
\AND
Mentor : Ankit Tewari \\
Artificial Intelligence Engineer \\
Knowledge Engineering and Machine Learning Group \\
\texttt{ankit.tewari@estudiant.upc.edu} \\
%% \And
%% Coauthor \\
%% Affiliation \\
%% Address \\
%% \texttt{email} \\
%% \And
%% Coauthor \\
%% Affiliation \\
%% Address \\
%% \texttt{email} \\
}
\begin{document}
\maketitle
\begin{abstract}
The search for an ideal accommodation on travelling has been a real issue in recent years.AirBnB has revolutionized the way people think of finding places to stay. It has allowed people to open up their homes to visitors and stay at far more interesting places than the same old drab hotels.
We using Data analytic tools and Data visualizations techniques have tried to provide eventful insights into data set. And general findings from data sets have been used by machine learning process to generate a prediction on price.
\end{abstract}
% keywords can be removed
\section{Introduction}
The Open Source Seattle AirBnB Csv file is being added to the project as an input file. This contains three Csv files namely i)Listings ii) Calendar iii) Reviews .
Our system will first process the data set employing data cleaning tools of python on it.Then employing the Data analysis ,visualization techniques we would try to analyze the features affecting the price of the listings .This project's one of major aim is predicting price employing KNN Regression model and Linear Regression model.
The major task can be broken down as 1) Data Cleaning 2) Data Processing 3)Analysis of features of data set 4) Answering important questions like How Neighbourhood affects price ? Which is the best time of year to visit Seattle ? What are good Reviews for a listing ? How is the price distribution and availability throughout the year?
Which kind of property is more commomn and are there specific locations which are favourable for a particular type of listing ?
\section{Data set }
The Listings Data set has complete information on location ,coordinates , reviews , price , availability of each Host(listing). Rows : 0 to 3817
Data columns (total 92 columns)
Calendar has day to day values listed and daily stats of listings. Rows:0 to 1393569
Data columns (total 4 columns)
Reviews has review from various customers listed. Rows: 0 to 84848
Data columns (total 6 columns)
We then applying methods of Pandas and Numpy try visualizing the amount of NULL VALUES in all the three data sets and then remove the rows with missing values.
\subsection{We have visualized the no of null values in Listings data set ,Calendar data set and Reviews data set. Similarly we have visualized Null Values for two other data sets in our project.}
\includegraphics[height=12cm,width=17cm]{tt12.png}
\section{Features and Analysis of Dataset drawing important inferences}
We have analyzed the listing data set and tried to answer a no of questions for convenience of tourists in selecting a perfect time and perfect kind of for them. Here we are presenting few important inferences from the project.
Distribution of listings over the heatmap folium of Seattle City.
\includegraphics[height=5cm,width=15cm]{tt13.png}
\subsection{Relation of Neighbourhood with Price Plotted on MAP.}
We have tried mapping coordinates and colour variations showing the changing prices over the various areas of Seattle City showing how the prices vary across the city. Also we have a scatter plot showing relation between Neighbourhood,Price,Property Type and Room type.
\includegraphics[height=10cm,width=15cm]{tt3.png}
\subsection{Analyzing Review's of listings }
We have plotted reviews of various customers to listings and scores which shows us a trend about reviews on AirBnB's data sets that very few listings are reviewed below 8 .
\includegraphics[height=7cm,width=16cm]{tt4.png}
\subsection{Availability Vs various other variables co relation using heat-maps.}
The heat-maps showing co relation between availability over the i)month ii) two months iii) three months and iv) year with
\begin{enumerate}
\item Price
\item Property type
\item Room type and room features like beds, bathrooms and bedrooms.
\item Reviews.
\end{enumerate}
We see that there is very unlikely co-relation between them which is surprising but this means that availability of listings is not affected by these features,
\includegraphics[height=5cm,width=18cm]{tt5.png}
\subsection{Analysis of changes in variables over the time.}
We plot the change in average no of listings ,average price and average availability over the time.
We see that there is drop in price of listings in 2015 in January ,June,July,August which coincides with larger no of listings in those months.
\includegraphics[height=5cm,width=18cm]{tt6.png}
\section{Methods and Experiments}
\label{sec:headings}
We have employed KNN Regression functions and Linear Regression model to predict Price of listings by features supporting use of regression model and having significant co relation with the price of the listing dataset.
This co-relation is checked by using corr method in python and heat-maps.
\subsection{Heat-Map}
A heat map is a two-dimensional representation of data in which values are represented by colors. A simple heat map provides an immediate visual summary of information. More elaborate heat maps allow the viewer to understand complex data sets.
There can be many ways to display heat maps, but they all share one thing in common -- they use color to communicate relationships between data values that would be would be much harder to understand if presented numerically in a spreadsheet.
\includegraphics[height=5cm,width=18cm]{tt7.png}
We see that
\begin{itemize}
\item Bedrooms
\item Bathrooms
\item Beds
\item Accomodates
\end{itemize}
Have very strong co relation with price. So we will use these features in prediction of price employing KNN method build from scratch and Linear regression to see the dependencies of predicted price with coefficients of variables taken into account while prediction.
\subsection{KNN Regression }
\[Pr(Y=j|X=x0)=1K summation( i belongs to N0I(yi=j))\]
K Nearest Neighbors is a simple algorithm that uses all available cases and use those to predict for new and unseen cases used. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.
We Divide the data set into train and test. With test size of 40 percent of data set
Getting minimum RMSE for testing data. For optimal value of K.
Defining a Predict Function.
In this funtion we calculate the nearest distance and get the corresponding values of y train for value of K and get nearest neighbors value means of y train for more accuracy in predicted value and new values are so predicted.
\subsubsection{Root Mean Square Error}
\includegraphics[height=1cm,width=3cm]{tt8.png}
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results.
Graph between RMSE and K value:
\includegraphics[height=3cm,width=3cm]{tt9.png}
\subsection{Linear Regression }
Linear Regression is a Linear approach for modeling the relationship between Target Variable(Dependent Variable) and Other Variables(Independent Variable)
Mathematically We Take Account Linear Regression as:-
f(X)=b0 + b1X1 + b2X2 + ...+ bpXp
Where,
b0=Intercept
b1,b2....,bp=Coefficients
X1,X2....,Xp=Independent Variable
Value of Coefficients Defines how properlly the Independent Variables are Correlated to the Dependent Variable. i.e., How much our funtion will get effected for one unit increase in the Dependent Variable With Respect To The Value Of Coefficient.
So after predicting the values we employ Linear Regression to get values of coefficients of dependent variables.
Here in the figure we show the regression plot of most important variables with price.
\includegraphics[height=6cm,width=8cm]{tt11.png}
\subsection{Results}
\label{sec:others}
We predicted the price of Listings using KNN regression machine learning algorithm and employed various methods like RMSE value and Distribution plot between ytest and predicted value to check for accuracy of our our which for randomly varying values of K comes pretty accurate.
Here in the figure we show the distribution plot of actual data and predicted value.
\includegraphics[height=6cm,width=8cm]{tt10.png}
\section{Conclusion and Future Works}
This project shows the import co relations between various amenities and features of listings in Seattle.
After employing analysis visualization and co relation methods we outlined the variables necessary for price prediction
We have predicted the price for listings by using machine learning method of KNN and Linear Regression. RMSE value helps in best possible implementation of KNN algorithm.The accuracy of predicted Price is judged by using distribution plots.
Though KNN Regression seems pretty accurate when using few regression variables for price prediction but we can employ various advance machine learning algorithms also for price prediction which would thence increase the accuracy thus further reducing variance.
In future we can employ algorithms to also take into account classification qualitative data to get more precise analysis and prediction.
\section{Acknowledgment }
We would sincerely thank our mentor Mr. Ankit Tewari for providing support to us throughout this project through his insightful views and hepling us understand key concepts of KNN and Linear Regression.
\section{References}
\url{https://jakevdp.github.io/PythonDataScienceHandbook/04.13-geographic-data-with-basemap.html}
\url{https://stackoverflow.com/questions/17682216/scatter-plot-and-color-mapping-in-python}
\url{https://www.statisticshowto.datasciencecentral.com}
\url{https://statisticsbyjim.com/regression/}
\url{https://seaborn.pydata.org/generated/}
\url{https://stackoverflow.com}
\subsection{GITHUB LINK FOR PROJECT :}
\url{https://github.com/aashirwad01/Project_knn.git}
\end{document}