Introduced by Tuhin Sarkar, Vatsal Rathod, Victor Okoro, Pierre-Louis Ehret, and Aayusi Biswas

The real estate market has been an important section of society ever since civilizations came into existence. Analysis of the real estate market has been an integral part of the decision making procedure. Our application works on machine learning algorithms to predict the rates of a property up to 2 years in future. Currently, it works on data specific to Brooklyn.

Real estate consultants have been ripping off money in the name of ‘expert predictions’ since a long time. The lack of knowledge on the buyer’s part has been fueling their confidence. Another issue is that builders and architects often end up investing in the wrong properties when left on their own. This can eventually add up to the loan losses a country has to bear.

We came up with a mobile application that can predict the price of a specified property based on the neighborhood, the building class, build year and area of a particular plot. Our intended interface is very simple. A user has to download our app, fill in the required fields and hit the graph button. The result will be the predictions of future prices of the pointed property as a graph.

After web scraping and acquiring enough data about the Brooklyn Real Estate Market, we began by cleaning the data. We removed all the rows and columns with a NULL value. Then we tried to condense the data by learning about historic events like recession of 2008 and price hike of 2016-17. This helped us condense our training data to be from 2010 to 2015 and testing data of 2016-17. We were left with 74,000 data rows and 26 columns as opposed to 400,000 data rows and 109 columns which we began with. We ran a correlation matrix to find out the features that affect the sale price the most with respect to time.

We label encoded all the strings/object type features. We normalized the sale price so as to reduce the skewness of the data. This was the end of data processing. For the training part, we made chunks of the entire training data so as to fasten the training process and increase accuracy. We then formed functions for model training and model evaluation for different kind of training methods like Ridge, Lasso, gradient boost, xgb and Random Forest because our data was large and we needed to perform feature selection. The best testing accuracy on the basis of closeness to the actual price was found with the XGBoost model and was 66% percent for 2016 and 76% percent for 2017. The training accuracy was 99% and validation accuracy 80%. We started by looking for data for Delhi, India and failed miserably; that’s when we realized that the Indian government sites aren’t updated regularly.

With that, we shifted to Melbourne which had little historical data records to offer. Finally, we hit a jackpot with Brooklyn. Understanding data correlation, selecting features according to relevance etc were what we learned while data cleaning and processing. And then finally, we learned that all prediction models have specific features and structures. It’s all about what data you have and how you shape your code.