Overview:
- dataset: LA with 500 results
- 428 datapoints after removing Nan rows
- no train/test split
graphs with inverse transformed/unscaled data
How did your model fare?
- The MSE is 131769005358912.25, with scaled data
- scale: x 1/1000 for ‘price’ and ‘LivingArea’ and ‘predicted price’
- estimator of error (y-y_pred): the lowest one is 2985.374634
- y and y_pred used are scaled
- Although I do not have any other model to make comparison with,
just looking at the estimator of error, there is still a long way to go
before reaching the ideal error = or near 0.
- this model has limitations and hence yielded such unoptimal result
- because it is a neural network with only 1 layer
- and 500 results/datapoints are not enough to build a good model
- therefore, ideally, we would need more than 500 and at least thousands if not
tens of thousands of datapoints in addition to adding a couople more
layers to make the model better. Also, splitting dataset into train, test groups
may help as well, but that would be if we have more datapoints. 500 datapoints
render splitting not so effective.
In your estimation is there a particular variable that may improve model performance?
Which of the predictions were the most accurate? In which percentile do these most accurate predictions reside? Did your model trend towards over or under predicting home values?
- row 23:
- actual price 2800, bedrooms: 2; bathrooms: 2, LivingArea: 1200
- row 252:
- actual price 89888, bedrooms: 1; bathrooms: 1; living area: 520
- row 23 and row 252 of the dataframe are the most accurate predictions
- because their y-y_pred is the top 2 smallest (2985 and 91160)
- from third smallest on, the % change in difference is not as large as the first two
- the percentile these two most accurate predictions reside is top 0.467%
- 2/428
- because there are 428 total datapoints or rows
- model trends toward overestimating because when I count the number of
positive vs. negative y-y_pred values, all of them are positive
Which feature appears to be the most significant predictor?
- I ran models 6 times, each time only with one of the following features and with the target ‘price’
- features: bathrooms, bedrooms, latitude, longitude, livingArea, yearBuilt
- then I run MSE 6 times
- ‘bedrooms’ came out to have the lowest MSE (3497)
- thus, it would be the most significant predictor compared to the other 5
- because MSe tells how close predicted values are to actual values
- and the lower the better (or more accurate)
- REVISION:
- ‘neighborhood’ is the most significant predictor of ‘price’ because
- looking along the column/row of ‘price’ on heatmap, ‘neighborhood’ has the highest
correlation coefficient of 0.83, compared to other variables
- heatmap