You may need to install Cairo on your operating system to run this notebook. To learn more see the Cairo documentation at https://www.cairographics.org/download/.
if(!require(Cairo)) install.packages("Cairo", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(car)) install.packages("car", repos = "http://cran.us.r-project.org")
if(!require(nortest)) install.packages("nortest", repos = "http://cran.us.r-project.org")
library(readr)
library(ggplot2)
library(knitr)
library(tidyverse)
library(caret)
library(leaps)
library(car)
library(mice)
library(scales)
library(RColorBrewer)
library(plotly)
library(nortest)
library(lmtest)
The purpose of this project is to predict the price of houses in California in 1990 based on a number of possible location-based predictors, including latitude, longitude, and information about other houses within a particular block.
While this project focuses on prediction, we are fully aware that housing prices have increased dramatically since 1990, when the data was collected. This model should not be used to predict today’s housing prices in California. This is purely an academic endeavor to explore statistical prediction.
The goal of the project is to create the model that can best predict home prices in California given reasonable test/train splits in the data.
We’re using the California Housing Prices dataset (housing.csv
) from the following Kaggle site: https://www.kaggle.com/camnugent/california-housing-prices. This data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data.
housing_data = read_csv("housing.csv")
## Parsed with column specification:
## cols(
## longitude = col_double(),
## latitude = col_double(),
## housing_median_age = col_double(),
## total_rooms = col_double(),
## total_bedrooms = col_double(),
## population = col_double(),
## households = col_double(),
## median_income = col_double(),
## median_house_value = col_double(),
## ocean_proximity = col_character()
## )
The dataset contains \(20640\) observations and 10 attributes (9 predictors and 1 response). Below is a list of the variables with descriptions taken from the original Kaggle site given above.
longitude
: A measure of how far west a house is; a higher value is farther westlatitude
: A measure of how far north a house is; a higher value is farther northhousing_median_age
: Median age of a house within a block; a lower number is a newer buildingtotal_rooms
: Total number of rooms within a blocktotal_bedrooms
: Total number of bedrooms within a blockpopulation
: Total number of people residing within a blockhouseholds
: Total number of households, a group of people residing within a home unit, for a blockmedian_income
: Median income for households within a block of houses (measured in tens of thousands of US Dollars)ocean_proximity
: Location of the house w.r.t ocean/seamedian_house_value
: Median house value for households within a block (measured in US Dollars)This dataset meets all of the stated criteria for this project including:
median_house_value
ocean_proximity
Let’s look at a summary of each variable:
summary(housing_data)
## longitude latitude housing_median_age total_rooms
## Min. :-124 Min. :33 Min. : 1 Min. : 2
## 1st Qu.:-122 1st Qu.:34 1st Qu.:18 1st Qu.: 1448
## Median :-118 Median :34 Median :29 Median : 2127
## Mean :-120 Mean :36 Mean :29 Mean : 2636
## 3rd Qu.:-118 3rd Qu.:38 3rd Qu.:37 3rd Qu.: 3148
## Max. :-114 Max. :42 Max. :52 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1 Min. : 3 Min. : 1 Min. : 0.5
## 1st Qu.: 296 1st Qu.: 787 1st Qu.: 280 1st Qu.: 2.6
## Median : 435 Median : 1166 Median : 409 Median : 3.5
## Mean : 538 Mean : 1425 Mean : 500 Mean : 3.9
## 3rd Qu.: 647 3rd Qu.: 1725 3rd Qu.: 605 3rd Qu.: 4.7
## Max. :6445 Max. :35682 Max. :6082 Max. :15.0
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
Note that the total_bedrooms
variable has 207 NA values. We will address this issue in the Data Cleaning section in Methods.
Below is a visual representation of all data points in the dataset with longitude
on the x-axis, latitude
on the y-axis, and median_house_value
shown in a color codes.
plot_map = ggplot(housing_data,
aes(x = longitude, y = latitude, color = median_house_value,
hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
hh = households, mi = median_income)) +
geom_point(aes(size = population), alpha = 0.4) +
xlab("Longitude") +
ylab("Latitude") +
ggtitle("Data Map - Longtitude vs Latitude and Associated Variables") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_distiller(palette = "Paired", labels = comma) +
labs(color = "Median House Value (in $USD)", size = "Population")
plot_map
We see that on average the houses nearest to the ocean tend to have higher median house values, whereas those inland have the lower median values. This difference is quite substantial and tells us that the variable ocean_proximity
will likely play a large role in predicting median house value.
It’s worth noting that some ocean-adjacent locations, like the more isolated towns along California’s northern most coast, have lower median house prices than other coastal housing, and some inland locations, like the towns near destinations such as Lake Tahoe or California’s inland capital Sacramento, do have higher median house prices than other inland housing. Also interesting to note that the houses closest to large metropolitan areas such as San Francisco and Los Angeles have among the highest median house values, whereas the largely agricultural Central Valley has among the lowest. This lets us know that while ocean_proximity
is very important, we may not be able to rely on it alone to tell the whole story.
Initial exploration of the data showed us that there were a few steps we needed to take to make the data more useable. Firstly, we changed the categorical variable ocean_proximity
from text-based to a factor variable.
housing_data$ocean_proximity = as.factor(housing_data$ocean_proximity)
levels(housing_data$ocean_proximity)
## [1] "<1H OCEAN" "INLAND" "ISLAND" "NEAR BAY" "NEAR OCEAN"
Taking a deeper dive into ocean_proximity
, we see that the level ISLAND
has a very low count compared to the other levels.
ggplot(housing_data, aes(x = factor(ocean_proximity))) +
geom_bar(stat = "count", color = "black", fill = "blue")
We looked at the total count of each level to get a better idea of the skew in ocean_proximity
.
summary(housing_data$ocean_proximity)
## <1H OCEAN INLAND ISLAND NEAR BAY NEAR OCEAN
## 9136 6551 5 2290 2658
In fact, ISLAND
only has five rows, where every other level has over 2,000 rows. Due to possible issues with model fitting, we decided to remove this level from ocean_proximity
. We accepted the risk that our model will less accurately predict the value of houses on islands.
housing_data = housing_data[housing_data$ocean_proximity != "ISLAND", ]
The other thing we considered was missing data.
sum(is.na(housing_data))
## [1] 207
total_bedrooms = housing_data$total_bedrooms
sum(is.na(total_bedrooms))
## [1] 207
There are \(207\) observations with missing data for total_bedrooms
. One thing we can do to solve this issue of NA values in total_bedrooms
is to impute data points. We first want to take a look at the distribution of this variable to see which imputation method will work best.
bedroom_mean = mean(housing_data$total_bedrooms, na.rm=TRUE)
bedroom_median = median(housing_data$total_bedrooms, na.rm=TRUE)
ggplot(housing_data, aes(x = total_bedrooms)) +
geom_histogram(bins = 40, color = "black", fill = "blue") +
geom_vline(aes(xintercept = bedroom_mean, color = "Mean"), lwd = 1.5) +
geom_vline(aes(xintercept = bedroom_median, color = "Median"), lwd = 1.5) +
xlab("Total Bedrooms") +
ylab("Frequency") +
ggtitle("Histogram of Total Bedrooms (noncontinuous variable)") +
scale_color_manual(name = "Summary Stats", labels = c("Mean", "Median"), values = c("red", "green"))
## Warning: Removed 207 rows containing non-finite values (stat_bin).
Looking at the distribution of the data in the histogram above, we decided to use the median of the total_bedrooms
variable for our imputation method.
housing_data$total_bedrooms[is.na(housing_data$total_bedrooms)] = bedroom_median
Looking at the structure of the dataset after cleaning the data, we see that besides the one factor variable ocean_proximity
, we have nine numeric variables, three of which are continuous (longitude
, latitude
, and median_income
) and six of which are discrete (housing_median_age
, total_rooms
, total_bedrooms
, population
, households
, and median_house_value
).
str(housing_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 20635 obs. of 10 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num 880 7099 1467 1274 1627 ...
## $ total_bedrooms : num 129 1106 190 235 280 ...
## $ population : num 322 2401 496 558 565 ...
## $ households : num 126 1138 177 219 259 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 452600 358500 352100 341300 342200 ...
## $ ocean_proximity : Factor w/ 5 levels "<1H OCEAN","INLAND",..: 4 4 4 4 4 4 4 4 4 4 ...
For a better sense of the distribution of the nine numeric variables, we looked at histograms for each of them.
par(mfrow = c(3, 3))
hist(housing_data$longitude, breaks = 20, main = "longitude", border="darkorange", col="dodgerblue")
hist(housing_data$latitude, breaks = 20, main = "latitude", border="darkorange", col="dodgerblue")
hist(housing_data$housing_median_age, breaks = 20, main = "housing_median_age", border="darkorange", col="dodgerblue")
hist(housing_data$total_rooms, breaks = 20, main = "total_rooms", border="darkorange", col="dodgerblue")
hist(housing_data$total_bedrooms, breaks = 20, main = "total_bedrooms", border="darkorange", col="dodgerblue")
hist(housing_data$population, breaks = 20, main = "population", border="darkorange", col="dodgerblue")
hist(housing_data$households, breaks = 20, main = "households", border="darkorange", col="dodgerblue")
hist(housing_data$median_income, breaks = 20, main = "median_income", border="darkorange", col="dodgerblue")
hist(housing_data$median_house_value, breaks = 20, main = "median_house_value", border="darkorange", col="dodgerblue")
We also looked at the pairs to better understand the relationship between the all the variables.
pairs(housing_data, col = "dodgerblue")
It looks like there are some additional variables which may not be necessary due to collinearity. We looked further into the correlations between the numeric variables and show them in the table below:
housing_data_nc = housing_data[, -10]
corrmatrix = cor(housing_data_nc)
kable(t(corrmatrix))
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
longitude | 1.00 | -0.92 | -0.11 | 0.04 | 0.07 | 0.10 | 0.06 | -0.02 | -0.05 |
latitude | -0.92 | 1.00 | 0.01 | -0.04 | -0.07 | -0.11 | -0.07 | -0.08 | -0.14 |
housing_median_age | -0.11 | 0.01 | 1.00 | -0.36 | -0.32 | -0.30 | -0.30 | -0.12 | 0.11 |
total_rooms | 0.04 | -0.04 | -0.36 | 1.00 | 0.93 | 0.86 | 0.92 | 0.20 | 0.13 |
total_bedrooms | 0.07 | -0.07 | -0.32 | 0.93 | 1.00 | 0.87 | 0.97 | -0.01 | 0.05 |
population | 0.10 | -0.11 | -0.30 | 0.86 | 0.87 | 1.00 | 0.91 | 0.00 | -0.02 |
households | 0.06 | -0.07 | -0.30 | 0.92 | 0.97 | 0.91 | 1.00 | 0.01 | 0.07 |
median_income | -0.02 | -0.08 | -0.12 | 0.20 | -0.01 | 0.00 | 0.01 | 1.00 | 0.69 |
median_house_value | -0.05 | -0.14 | 0.11 | 0.13 | 0.05 | -0.02 | 0.07 | 0.69 | 1.00 |
One thing that sticks out is the high collinearity between households
and total_bedrooms
, as well as households
and total_rooms
. While these variables are highly correlated, we believe it is best to leave them in as they are influential in the pricing of homes typically. Overall, we do not believe that collinearility will play a substantial role in influencing our models so we will be keeping all variables.
First we want to split the data into training and testing data. We will use an 70/30 split of a randomized sample. We will use a set seed to get repeatability.
Note that we decided to partition using ocean_proximity
since this is a predictor that we believe will play a large role in predicting housing prices.
set.seed(420)
housing_trn_idx = createDataPartition(housing_data$ocean_proximity, p = .70, list = FALSE)
## Warning in createDataPartition(housing_data$ocean_proximity, p = 0.7, list
## = FALSE): Some classes have no records ( ISLAND ) and these will be ignored
housing_trn_data = housing_data[housing_trn_idx, ]
housing_tst_data = housing_data[-housing_trn_idx, ]
As a starting point we want to get an idea which parameters and their interactions are good ones to use in a potential model. To start this, we will create three initial models that will be fitted on our training data. The first model will be an additive model that utilizes all variables. The second model will be a two-way model that uses all variables as well as their two-way interactions. The third, and final, initial model will be a three-way model that uses all variables as well as their two and three-way interactions. We will create these models below.
full_additive_model = lm(median_house_value ~ ., data = housing_trn_data)
full_additive_adjr2 = summary(full_additive_model)$adj.r.squared
full_twoway_model = lm(median_house_value ~ (.)^2, data = housing_trn_data)
full_twoway_adjr2 = summary(full_twoway_model)$adj.r.squared
full_threeway_model = lm(median_house_value ~ (.)^3, data = housing_trn_data)
full_threeway_adjr2 = summary(full_threeway_model)$adj.r.squared
In order to take a look at initial results of these models, we will be using model selection criterion \(AIC\) and \(Adj. R^2\). We will also include the total number of predictors in each model to get a feel of the total complexity of each.
beginning_mods_results = data.frame(
"Total Predictors" =
c("Additive Model" = extractAIC(full_additive_model)[1],
"Two-Way Int. Model" = extractAIC(full_twoway_model)[1],
"Three-Way Int. Model" = extractAIC(full_threeway_model)[1]),
"AIC" =
c("Additive Model" = extractAIC(full_additive_model)[2],
"Two-Way Int. Model" = extractAIC(full_twoway_model)[2],
"Three-Way Int. Model" = extractAIC(full_threeway_model)[2]),
"Adj R-Squared" =
c("Additive Model" = full_additive_adjr2,
"Two-Way Int. Model" = full_twoway_adjr2,
"Three-Way Int. Model" = full_threeway_adjr2))
kable(beginning_mods_results, align = c("c", "r"))
Total.Predictors | AIC | Adj.R.Squared | |
---|---|---|---|
Additive Model | 12 | 321749 | 0.65 |
Two-Way Int. Model | 64 | 319188 | 0.70 |
Three-Way Int. Model | 204 | 317717 | 0.74 |
We see that the model with the best (i.e., lowest) AIC is full_threeway_model
, with a score of \(3.18\times 10^{5}\). However, although this model has the lowest AIC, it also has far more predictors (and therefore is more complex) than the other three models - \(204\) compared to \(64\) predictors for full_twoway_model
, and \(12\) predictors for full_additive_model
. This is something to keep in mind as we move forward, as model complexity should be taken into account.
All three of these models are good candidates for stepwise and backward AIC and BIC, however we will only be doing stepwise and backward search on the three-way models due to compute power needed as well as time it takes for the step function to search. Since this was our best model out of the inital three, we will leave it as-is and compare to our results from stepwise and backwards search on the additive and two-way models.
back_additive_mod_finish_aic = step(full_additive_model, direction = "backward", trace = 0)
both_additive_mod_finish_aic = step(full_additive_model, direction = "both", trace = 0)
n = length(resid(full_additive_model))
back_additive_mod_finish_bic = step(full_additive_model, direction = "backward", k = log(n), trace = 0)
both_additive_mod_finish_bic = step(full_additive_model, direction = "both", k = log(n), trace = 0)
back_twoway_mod_finish_aic = step(full_twoway_model, direction = "backward", trace = 0)
both_twoway_mod_finish_aic = step(full_twoway_model, direction = "both", trace = 0)
n = length(resid(full_twoway_model))
back_twoway_mod_finish_bic = step(full_twoway_model, direction = "backward", k = log(n), trace = 0)
both_twoway_mod_finish_bic = step(full_twoway_model, direction = "both", k = log(n), trace = 0)
aic_and_bic_results = data.frame(
"AIC" =
c("Backward" =
c("Additive" = extractAIC(back_additive_mod_finish_aic)[2],
"Two-Way" = extractAIC(back_twoway_mod_finish_aic)[2],
"Three-way" = extractAIC(full_threeway_model)[2]),
"Both" =
c("Additive" = extractAIC(both_additive_mod_finish_aic)[2],
"Two-Way" = extractAIC(both_twoway_mod_finish_aic)[2],
"Three-way" = extractAIC(full_threeway_model)[2])),
"BIC" =
c("Backward" =
c("Additive" = extractAIC(back_additive_mod_finish_bic)[2],
"Two-Way" = extractAIC(back_twoway_mod_finish_bic)[2],
"Three-way" = extractAIC(full_threeway_model)[2]),
"Both" =
c("Additive" = extractAIC(both_additive_mod_finish_bic)[2],
"Two-Way" = extractAIC(both_twoway_mod_finish_bic)[2],
"Three-way" = extractAIC(full_threeway_model)[2])))
kable(aic_and_bic_results)
AIC | BIC | |
---|---|---|
Backward.Additive | 321749 | 321749 |
Backward.Two-Way | 319179 | 319194 |
Backward.Three-way | 317717 | 317717 |
Both.Additive | 321749 | 321749 |
Both.Two-Way | 319179 | 319194 |
Both.Three-way | 317717 | 317717 |
From the table above, we can see that the inital three-way model beats out all models selected from stepwise and backwards search using both AIC and BIC criterion. This leads us to believe this could possibly be our best model. However, we should note that the three-way model includes a large amount of variables which introduces a lot of complexity to the model.
With complexity in mind, we will take a look at AIC of the other models. From now on, we will be using AIC as our main selection criterion. The model with the second lowest AIC is tied between models selected using backwards and stepwise direction of the two-way model. We believe that these models are the exact same. Using the identical
function in R, we check if these models are identical which gives us the result TRUE. Since this results to TRUE
, we will use back_twoway_mod_finish_aic
as our best model from this chart.
We will now want to do diagnostics of some models we created to check the model assumptions of normality as well as heteroskedasticity, since we cannot solely rely on AIC for model selection and must take assumptions into account. These models will be back_additive_mod_finish_aic
, back_twoway_mod_finish_aic
, and full_threeway_model
. These models were selected by having the lowest AIC in each class of initial models (additive, two-way, and three-way
).
diagnostics = function(model, alpha = .05, pointcol = "orange", linecol = "blue", plots = TRUE, tests = TRUE, pointtype = 16) {
if (plots == TRUE) {
par(mfrow = c(1, 3))
plot(
fitted(model),
resid(model),
pch = pointtype,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Fitted vs Residuals",
col = pointcol
)
abline(h = 0, lwd = 2, col = linecol)
qqnorm(
resid(model),
pch = pointtype,
main = "QQNorm Plot",
col = pointcol
)
qqline(
resid(model),
lwd = 2,
col = linecol
)
hist(
resid(model),
main = "Histogram of Residuals",
col = pointcol,
xlab = "Residuals",
ylab = "Frequency"
)
}
if (tests == TRUE) {
ad_test = ad.test(resid(model))
bp_test = bptest(model)
test_results = data.frame(
"Anderson-Darling Normality Test" =
c("Test Statistic" = round(ad_test$statistic, 5),
"P-Value" = ad_test$p.value,
"Result" = ifelse(ad_test$p.value < alpha, "Reject", "Fail To Reject")),
"Breusch-Pagan Test" =
c("Test Statistic" = round(bp_test$statistic, 5),
"P-Value" = bp_test$p.value,
"Result" = ifelse(bp_test$p.value < alpha, "Reject", "Fail To Reject")))
kable(t(test_results), col.names = c("Test Statistic", "P-Value", "Decision"))
}
}
diagnostics(back_additive_mod_finish_aic)
Test Statistic | P-Value | Decision | |
---|---|---|---|
Anderson.Darling.Normality.Test | 250.58939 | 3.7e-24 | Reject |
Breusch.Pagan.Test | 597.23911 | 5.45661256594302e-121 | Reject |
diagnostics(back_twoway_mod_finish_aic)
Test Statistic | P-Value | Decision | |
---|---|---|---|
Anderson.Darling.Normality.Test | 277.03195 | 3.7e-24 | Reject |
Breusch.Pagan.Test | 1217.97917 | 2.27700592382994e-217 | Reject |
diagnostics(full_threeway_model)
Test Statistic | P-Value | Decision | |
---|---|---|---|
Anderson.Darling.Normality.Test | 288.85194 | 3.7e-24 | Reject |
Breusch.Pagan.Test | 1577.09254 | 5.71910719516228e-211 | Reject |
For our diagnostics function we include a scatter plot of fitted values against residuals, a QQNorm plot, and histogram of the residuals in order to visualize our model and how our assumptions hold. For our tabular results, we are running two tests.
The first test is the Anderson-Darling Test, which tests our normality assumption. The null hypothesis for this test is that our data is sampled from a normal distribution. Note that we’re using Anderson-Darling instead of Shapiro-Wilk because we are using a large dataset with \(20635\) observations and Anderson-Darling performs better on larger datasets of this size.
The second test is the Breusch-Pagan Test, which tests whether our model is homoskedastic (or, whether the residuals have a constant variance). The null hypothesis for this test is that our model is homoskedastic.
Looking at the plots, we are concerned that our models will not hold to our model assumptions of normality and homoskedasticity. Looking further at our test results, we see that for each model we reject our null hypothesis’s. With a rejection of our null hypotheses, we have decided that these models would be best fit for prediction versus inference since assumptions do not hold.
Recall that our best model is back_twoway_mod_finish_aic
, which was the model found from a backward search on the full two-way model using AIC as criterion. In order to combat our rejection of the null hypothesis’s above, we will attempt to transform the response variable, median_home_value
, to hopefully get closer to our model assumptions.
back_twoway_mod_finish_aic_log = lm(formula = log(median_house_value) ~ longitude + latitude + housing_median_age +
total_rooms + total_bedrooms + population + households +
median_income + ocean_proximity + longitude:latitude + longitude:housing_median_age +
longitude:total_rooms + longitude:total_bedrooms + longitude:households +
longitude:median_income + longitude:ocean_proximity + latitude:housing_median_age +
latitude:total_rooms + latitude:total_bedrooms + latitude:median_income +
latitude:ocean_proximity + housing_median_age:total_rooms +
housing_median_age:population + housing_median_age:households +
housing_median_age:median_income + housing_median_age:ocean_proximity +
total_rooms:population + total_rooms:households + total_rooms:median_income +
total_rooms:ocean_proximity + total_bedrooms:households +
total_bedrooms:median_income + total_bedrooms:ocean_proximity +
population:households + population:median_income + population:ocean_proximity +
households:median_income + households:ocean_proximity + median_income:ocean_proximity,
data = housing_trn_data)
diagnostics(back_twoway_mod_finish_aic_log)
Test Statistic | P-Value | Decision | |
---|---|---|---|
Anderson.Darling.Normality.Test | 50.79596 | 3.7e-24 | Reject |
Breusch.Pagan.Test | 1089.85284 | 7.15443046217138e-191 | Reject |
Doing so, we can see that we are closer to normal distribution when taking the log of median_house_value
. However, we are still rejecting both null hypotheses. Since this did not give us much improvement on our assumptions, we will stick to using the non-logged model and accept that our assumptions do not hold. While this will decrease the validity of our model, it can still be useful for prediction rather than inference. Since we are attempting to predict values of homes, we are okay with this.
For our last model selection criterion, we will be utilizing leave-one-out cross-validated RMSE (hereafter referred to as LOOCV RMSE). This can be written as:
\[ \text{RMSE}_{\text{LOOCV}} = \sqrt{\frac{1}{n}\sum_{i=1}^n \left(\frac{e_{i}}{1-h_{i}}\right)^2} \]
where \(e_i\) are the residuals and \(h_i\) are the leverages.
LOOCV RMSE corrects for the fact that RMSE alone always prefers the larger model by instead assessing how the model fits “left out” data that wasn’t used to perform the regression.
# From the text: https://daviddalpiaz.github.io/appliedstats/variable-selection-and-model-building.html
calc_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
calc_rmse = function(actual, predicted) {
sqrt(sum((actual - predicted)^2) / length(actual))
}
calc_avg_per_error = function(actual, predicted) {
inter_abs = abs(predicted - actual)
100 * (sum(inter_abs / actual)) / length(actual)
}
additive_loocv_rmse = calc_loocv_rmse(back_additive_mod_finish_aic)
twoway_loocv_rmse = calc_loocv_rmse(back_twoway_mod_finish_aic)
threeway_loocv_rmse = calc_loocv_rmse(full_threeway_model)
loocv_rmse_results = data.frame(
"LOOCV-RMSE" =
c("Backwards Additive" = additive_loocv_rmse,
"Backwards Two-Way" = twoway_loocv_rmse,
"Initial Three-way" = threeway_loocv_rmse))
kable(loocv_rmse_results)
LOOCV.RMSE | |
---|---|
Backwards Additive | 68766 |
Backwards Two-Way | 63098 |
Initial Three-way | 67982 |
As we can see from the table above, the model created from backwards search of the two-way model using AIC as selection criterion obtains the lowest LOOCV-RMSE. Since we are using LOOCV-RMSE as our final model selection criterion, this gives us the final result that our Backwards Two-Way Search model, back_twoway_mod_finish_aic
, is our best model. We will now evaluate this model on our test set that have had held back in order to see how this model performs on unseen data.
# the actual median house values from the test set
test_actual = housing_tst_data$median_house_value
# the predicted house values for the test set
test_predictions = predict(back_twoway_mod_finish_aic, housing_tst_data)
# the RMSE
test_rmse = calc_rmse(test_actual, test_predictions)
# the percentage error
test_perc_error = calc_avg_per_error(test_actual, test_predictions)
test_rmse
## [1] 65411
test_perc_error
## [1] 26
We evaluated our final model, the Backwards Two-Way Search model, on unseen test data using RMSE as well as average percent error. We got an RMSE of \(65410.62\), which is similar to the LOOCV-RMSE we obtained on this model using our training data (\(63098.08\)). Since these RMSEs are close in value, we can see that our model is not overfitting. For average percent error, we got a result of \(25.5\)%.
Our final model, when evaluated on unseen test data, had an average error of \(65410.62\). This means that, on average, our model’s predicted housing price will be \(\pm\) \(65410.62\) in comparison to the actual price. This gives us an average percent error of \(\pm\) \(25.5\)% between the actual and predicted prices.
While this is higher than we might have hoped for, we believe this error isn’t too bad considering that the majority of the predictors were based on location. As we noted in the Introduction, while location - and more specifically ocean_proximity
- gives us quite a lot of information about housing prices, but it doesn’t tell the whole story.
Below we show the range of our model’s error by ocean_proximity
(Note that there is not data for the \(ISLAND\) level because we removed it from the dataset to avoid skew because of its low representation):
op0 = par()
op1 = op0$mar
op1[2] = 7
par(mar = op1)
plot(test_predictions - test_actual, housing_tst_data$ocean_proximity,
xlab = "Predicted Price Over / Under Actual Price",
ylab = "",
main = "Predicted Price Over / Under Actual Price vs Ocean Proximity",
col = "dodgerblue", yaxt = "none")
axis(2, at = 1:5, labels = levels(housing_tst_data$ocean_proximity), las = 1)
A general trend worth noting is that we tend to underestimate the price a bit more frequently than overestimating. Dividing the data by ocean_proximity
, we see that for the most part, the range of the errors is consistent for each of the levels. However, there are some extreme outliers when the ocean_proximity
is \(INLAND\). In particular, our model has underestimated the price of two houses by more than $800,000. This is quite a jump from the rest of the errors and well beyond our average error of \(65410.62\).
In order to see if these outlier data points align with the issues we saw in the Introduction, we plotted each of our model’s errors in the test data.
plot_map = ggplot(housing_tst_data,
aes(x = longitude, y = latitude,
color = test_predictions - test_actual,
hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
hh = households, mi = median_income)) +
geom_point(aes(size = abs(test_predictions - test_actual)), alpha = 0.4) +
xlab("Longitude") +
ylab("Latitude") +
ggtitle("Predicted Price Over / Under Actual Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_distiller(palette = "Paired", labels = comma) +
labs(color = "Predicted Price Over / Under (in $USD)",
size = "Magnitude of Price Difference")
plot_map
Those two troublesome data points are indeed located in one of the inland areas with more expensive prices - namely, near the Sacramento area. It’s possible that removing these two values could have helped us more accurately estimate the prices of inland houses.
Overall, our model was able to predict the prices of houses in California in 1990 within a \(\pm\) \(25.5\)% error rate based on the California Housing Prices dataset.