Technical Requirements

You may need to install Cairo on your operating system to run this notebook. To learn more see the Cairo documentation at https://www.cairographics.org/download/.

if(!require(Cairo)) install.packages("Cairo", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(car)) install.packages("car", repos = "http://cran.us.r-project.org")
if(!require(nortest)) install.packages("nortest", repos = "http://cran.us.r-project.org")
library(readr)
library(ggplot2)
library(knitr)
library(tidyverse)
library(caret)
library(leaps)
library(car)
library(mice)
library(scales)
library(RColorBrewer)
library(plotly)
library(nortest)
library(lmtest)

Introduction

Predicting Housing Prices

The purpose of this project is to predict the price of houses in California in 1990 based on a number of possible location-based predictors, including latitude, longitude, and information about other houses within a particular block.

While this project focuses on prediction, we are fully aware that housing prices have increased dramatically since 1990, when the data was collected. This model should not be used to predict today’s housing prices in California. This is purely an academic endeavor to explore statistical prediction.

The goal of the project is to create the model that can best predict home prices in California given reasonable test/train splits in the data.

California Housing Prices Dataset

We’re using the California Housing Prices dataset (housing.csv) from the following Kaggle site: https://www.kaggle.com/camnugent/california-housing-prices. This data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data.

housing_data = read_csv("housing.csv")
## Parsed with column specification:
## cols(
##   longitude = col_double(),
##   latitude = col_double(),
##   housing_median_age = col_double(),
##   total_rooms = col_double(),
##   total_bedrooms = col_double(),
##   population = col_double(),
##   households = col_double(),
##   median_income = col_double(),
##   median_house_value = col_double(),
##   ocean_proximity = col_character()
## )

The dataset contains \(20640\) observations and 10 attributes (9 predictors and 1 response). Below is a list of the variables with descriptions taken from the original Kaggle site given above.

  • longitude: A measure of how far west a house is; a higher value is farther west
  • latitude: A measure of how far north a house is; a higher value is farther north
  • housing_median_age: Median age of a house within a block; a lower number is a newer building
  • total_rooms: Total number of rooms within a block
  • total_bedrooms: Total number of bedrooms within a block
  • population: Total number of people residing within a block
  • households: Total number of households, a group of people residing within a home unit, for a block
  • median_income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
  • ocean_proximity: Location of the house w.r.t ocean/sea
  • median_house_value: Median house value for households within a block (measured in US Dollars)

This dataset meets all of the stated criteria for this project including:

  • A minimum 200 observations
  • A numeric response variable: median_house_value
  • At least one categorical predictor: ocean_proximity
  • At least two numeric predictors: the remaining attributes

Let’s look at a summary of each variable:

summary(housing_data)
##    longitude       latitude  housing_median_age  total_rooms   
##  Min.   :-124   Min.   :33   Min.   : 1         Min.   :    2  
##  1st Qu.:-122   1st Qu.:34   1st Qu.:18         1st Qu.: 1448  
##  Median :-118   Median :34   Median :29         Median : 2127  
##  Mean   :-120   Mean   :36   Mean   :29         Mean   : 2636  
##  3rd Qu.:-118   3rd Qu.:38   3rd Qu.:37         3rd Qu.: 3148  
##  Max.   :-114   Max.   :42   Max.   :52         Max.   :39320  
##                                                                
##  total_bedrooms   population      households   median_income 
##  Min.   :   1   Min.   :    3   Min.   :   1   Min.   : 0.5  
##  1st Qu.: 296   1st Qu.:  787   1st Qu.: 280   1st Qu.: 2.6  
##  Median : 435   Median : 1166   Median : 409   Median : 3.5  
##  Mean   : 538   Mean   : 1425   Mean   : 500   Mean   : 3.9  
##  3rd Qu.: 647   3rd Qu.: 1725   3rd Qu.: 605   3rd Qu.: 4.7  
##  Max.   :6445   Max.   :35682   Max.   :6082   Max.   :15.0  
##  NA's   :207                                                 
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
## 

Note that the total_bedrooms variable has 207 NA values. We will address this issue in the Data Cleaning section in Methods.

Below is a visual representation of all data points in the dataset with longitude on the x-axis, latitude on the y-axis, and median_house_value shown in a color codes.

plot_map = ggplot(housing_data, 
                  aes(x = longitude, y = latitude, color = median_house_value, 
                      hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
                      hh = households, mi = median_income)) +
              geom_point(aes(size = population), alpha = 0.4) +
              xlab("Longitude") +
              ylab("Latitude") +
              ggtitle("Data Map - Longtitude vs Latitude and Associated Variables") +
              theme(plot.title = element_text(hjust = 0.5)) +
              scale_color_distiller(palette = "Paired", labels = comma) +
              labs(color = "Median House Value (in $USD)", size = "Population")
plot_map

We see that on average the houses nearest to the ocean tend to have higher median house values, whereas those inland have the lower median values. This difference is quite substantial and tells us that the variable ocean_proximity will likely play a large role in predicting median house value.

It’s worth noting that some ocean-adjacent locations, like the more isolated towns along California’s northern most coast, have lower median house prices than other coastal housing, and some inland locations, like the towns near destinations such as Lake Tahoe or California’s inland capital Sacramento, do have higher median house prices than other inland housing. Also interesting to note that the houses closest to large metropolitan areas such as San Francisco and Los Angeles have among the highest median house values, whereas the largely agricultural Central Valley has among the lowest. This lets us know that while ocean_proximity is very important, we may not be able to rely on it alone to tell the whole story.

Methods

Data Cleaning

Categorical Variables

Initial exploration of the data showed us that there were a few steps we needed to take to make the data more useable. Firstly, we changed the categorical variable ocean_proximity from text-based to a factor variable.

housing_data$ocean_proximity = as.factor(housing_data$ocean_proximity)
levels(housing_data$ocean_proximity)
## [1] "<1H OCEAN"  "INLAND"     "ISLAND"     "NEAR BAY"   "NEAR OCEAN"

Taking a deeper dive into ocean_proximity, we see that the level ISLAND has a very low count compared to the other levels.

ggplot(housing_data, aes(x = factor(ocean_proximity))) +
  geom_bar(stat = "count", color = "black", fill = "blue")

We looked at the total count of each level to get a better idea of the skew in ocean_proximity.

summary(housing_data$ocean_proximity)
##  <1H OCEAN     INLAND     ISLAND   NEAR BAY NEAR OCEAN 
##       9136       6551          5       2290       2658

In fact, ISLAND only has five rows, where every other level has over 2,000 rows. Due to possible issues with model fitting, we decided to remove this level from ocean_proximity. We accepted the risk that our model will less accurately predict the value of houses on islands.

housing_data = housing_data[housing_data$ocean_proximity != "ISLAND", ]

Missing Data

The other thing we considered was missing data.

sum(is.na(housing_data))
## [1] 207
total_bedrooms = housing_data$total_bedrooms
sum(is.na(total_bedrooms))
## [1] 207

There are \(207\) observations with missing data for total_bedrooms. One thing we can do to solve this issue of NA values in total_bedrooms is to impute data points. We first want to take a look at the distribution of this variable to see which imputation method will work best.

bedroom_mean = mean(housing_data$total_bedrooms, na.rm=TRUE)
bedroom_median = median(housing_data$total_bedrooms, na.rm=TRUE)
ggplot(housing_data, aes(x = total_bedrooms)) +
  geom_histogram(bins = 40, color = "black", fill = "blue") +
  geom_vline(aes(xintercept = bedroom_mean, color = "Mean"), lwd = 1.5) +
  geom_vline(aes(xintercept = bedroom_median, color = "Median"), lwd = 1.5) +
  xlab("Total Bedrooms") +
  ylab("Frequency") +
  ggtitle("Histogram of Total Bedrooms (noncontinuous variable)") +
  scale_color_manual(name = "Summary Stats", labels = c("Mean", "Median"), values = c("red", "green"))
## Warning: Removed 207 rows containing non-finite values (stat_bin).

Looking at the distribution of the data in the histogram above, we decided to use the median of the total_bedrooms variable for our imputation method.

housing_data$total_bedrooms[is.na(housing_data$total_bedrooms)] = bedroom_median

Post-Cleaning

Looking at the structure of the dataset after cleaning the data, we see that besides the one factor variable ocean_proximity, we have nine numeric variables, three of which are continuous (longitude, latitude, and median_income) and six of which are discrete (housing_median_age, total_rooms, total_bedrooms, population, households, and median_house_value).

str(housing_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    20635 obs. of  10 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num  880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num  129 1106 190 235 280 ...
##  $ population        : num  322 2401 496 558 565 ...
##  $ households        : num  126 1138 177 219 259 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : Factor w/ 5 levels "<1H OCEAN","INLAND",..: 4 4 4 4 4 4 4 4 4 4 ...

For a better sense of the distribution of the nine numeric variables, we looked at histograms for each of them.

par(mfrow = c(3, 3))
hist(housing_data$longitude, breaks = 20, main = "longitude", border="darkorange", col="dodgerblue")
hist(housing_data$latitude, breaks = 20, main = "latitude", border="darkorange", col="dodgerblue")
hist(housing_data$housing_median_age, breaks = 20, main = "housing_median_age", border="darkorange", col="dodgerblue")
hist(housing_data$total_rooms, breaks = 20, main = "total_rooms", border="darkorange", col="dodgerblue")
hist(housing_data$total_bedrooms, breaks = 20, main = "total_bedrooms", border="darkorange", col="dodgerblue")
hist(housing_data$population, breaks = 20, main = "population", border="darkorange", col="dodgerblue")
hist(housing_data$households, breaks = 20, main = "households", border="darkorange", col="dodgerblue")
hist(housing_data$median_income, breaks = 20, main = "median_income", border="darkorange", col="dodgerblue")
hist(housing_data$median_house_value, breaks = 20, main = "median_house_value", border="darkorange", col="dodgerblue")

We also looked at the pairs to better understand the relationship between the all the variables.

pairs(housing_data, col = "dodgerblue")

It looks like there are some additional variables which may not be necessary due to collinearity. We looked further into the correlations between the numeric variables and show them in the table below:

housing_data_nc = housing_data[, -10]

corrmatrix = cor(housing_data_nc)

kable(t(corrmatrix))
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
longitude 1.00 -0.92 -0.11 0.04 0.07 0.10 0.06 -0.02 -0.05
latitude -0.92 1.00 0.01 -0.04 -0.07 -0.11 -0.07 -0.08 -0.14
housing_median_age -0.11 0.01 1.00 -0.36 -0.32 -0.30 -0.30 -0.12 0.11
total_rooms 0.04 -0.04 -0.36 1.00 0.93 0.86 0.92 0.20 0.13
total_bedrooms 0.07 -0.07 -0.32 0.93 1.00 0.87 0.97 -0.01 0.05
population 0.10 -0.11 -0.30 0.86 0.87 1.00 0.91 0.00 -0.02
households 0.06 -0.07 -0.30 0.92 0.97 0.91 1.00 0.01 0.07
median_income -0.02 -0.08 -0.12 0.20 -0.01 0.00 0.01 1.00 0.69
median_house_value -0.05 -0.14 0.11 0.13 0.05 -0.02 0.07 0.69 1.00

One thing that sticks out is the high collinearity between households and total_bedrooms, as well as households and total_rooms. While these variables are highly correlated, we believe it is best to leave them in as they are influential in the pricing of homes typically. Overall, we do not believe that collinearility will play a substantial role in influencing our models so we will be keeping all variables.

Training and Test Data

First we want to split the data into training and testing data. We will use an 70/30 split of a randomized sample. We will use a set seed to get repeatability.

Note that we decided to partition using ocean_proximity since this is a predictor that we believe will play a large role in predicting housing prices.

set.seed(420)
housing_trn_idx = createDataPartition(housing_data$ocean_proximity, p = .70, list = FALSE)
## Warning in createDataPartition(housing_data$ocean_proximity, p = 0.7, list
## = FALSE): Some classes have no records ( ISLAND ) and these will be ignored
housing_trn_data = housing_data[housing_trn_idx, ]
housing_tst_data = housing_data[-housing_trn_idx, ]

Model Selection

Using all Model Subsets

As a starting point we want to get an idea which parameters and their interactions are good ones to use in a potential model. To start this, we will create three initial models that will be fitted on our training data. The first model will be an additive model that utilizes all variables. The second model will be a two-way model that uses all variables as well as their two-way interactions. The third, and final, initial model will be a three-way model that uses all variables as well as their two and three-way interactions. We will create these models below.

full_additive_model = lm(median_house_value ~ ., data = housing_trn_data)
full_additive_adjr2 = summary(full_additive_model)$adj.r.squared

full_twoway_model = lm(median_house_value ~ (.)^2, data = housing_trn_data)
full_twoway_adjr2 = summary(full_twoway_model)$adj.r.squared

full_threeway_model = lm(median_house_value ~ (.)^3, data = housing_trn_data)
full_threeway_adjr2 = summary(full_threeway_model)$adj.r.squared

In order to take a look at initial results of these models, we will be using model selection criterion \(AIC\) and \(Adj. R^2\). We will also include the total number of predictors in each model to get a feel of the total complexity of each.

beginning_mods_results = data.frame(
  "Total Predictors" =
    c("Additive Model" = extractAIC(full_additive_model)[1],
      "Two-Way Int. Model" = extractAIC(full_twoway_model)[1],
      "Three-Way Int. Model" = extractAIC(full_threeway_model)[1]),
  "AIC" =
    c("Additive Model" = extractAIC(full_additive_model)[2],
      "Two-Way Int. Model" = extractAIC(full_twoway_model)[2],
      "Three-Way Int. Model" = extractAIC(full_threeway_model)[2]),
  "Adj R-Squared" =
    c("Additive Model" = full_additive_adjr2,
      "Two-Way Int. Model" = full_twoway_adjr2,
      "Three-Way Int. Model" = full_threeway_adjr2))

kable(beginning_mods_results, align = c("c", "r"))
Total.Predictors AIC Adj.R.Squared
Additive Model 12 321749 0.65
Two-Way Int. Model 64 319188 0.70
Three-Way Int. Model 204 317717 0.74

We see that the model with the best (i.e., lowest) AIC is full_threeway_model, with a score of \(3.18\times 10^{5}\). However, although this model has the lowest AIC, it also has far more predictors (and therefore is more complex) than the other three models - \(204\) compared to \(64\) predictors for full_twoway_model, and \(12\) predictors for full_additive_model. This is something to keep in mind as we move forward, as model complexity should be taken into account.

Using AIC and BIC

All three of these models are good candidates for stepwise and backward AIC and BIC, however we will only be doing stepwise and backward search on the three-way models due to compute power needed as well as time it takes for the step function to search. Since this was our best model out of the inital three, we will leave it as-is and compare to our results from stepwise and backwards search on the additive and two-way models.

back_additive_mod_finish_aic = step(full_additive_model, direction = "backward", trace = 0)
both_additive_mod_finish_aic = step(full_additive_model, direction = "both", trace = 0)

n = length(resid(full_additive_model))
back_additive_mod_finish_bic = step(full_additive_model, direction = "backward", k = log(n), trace = 0)
both_additive_mod_finish_bic = step(full_additive_model, direction = "both", k = log(n), trace = 0)

back_twoway_mod_finish_aic = step(full_twoway_model, direction = "backward", trace = 0)
both_twoway_mod_finish_aic = step(full_twoway_model, direction = "both", trace = 0)

n = length(resid(full_twoway_model))
back_twoway_mod_finish_bic = step(full_twoway_model, direction = "backward", k = log(n), trace = 0)
both_twoway_mod_finish_bic = step(full_twoway_model, direction = "both", k = log(n), trace = 0)
aic_and_bic_results = data.frame(
  "AIC" =
    c("Backward" =
        c("Additive" = extractAIC(back_additive_mod_finish_aic)[2],
          "Two-Way" = extractAIC(back_twoway_mod_finish_aic)[2],
          "Three-way" = extractAIC(full_threeway_model)[2]),
      "Both" =
        c("Additive" = extractAIC(both_additive_mod_finish_aic)[2],
          "Two-Way" = extractAIC(both_twoway_mod_finish_aic)[2],
          "Three-way" = extractAIC(full_threeway_model)[2])),
  "BIC" =
    c("Backward" =
        c("Additive" = extractAIC(back_additive_mod_finish_bic)[2],
          "Two-Way" = extractAIC(back_twoway_mod_finish_bic)[2],
          "Three-way" = extractAIC(full_threeway_model)[2]),
      "Both" =
        c("Additive" = extractAIC(both_additive_mod_finish_bic)[2],
          "Two-Way" = extractAIC(both_twoway_mod_finish_bic)[2],
          "Three-way" = extractAIC(full_threeway_model)[2])))

kable(aic_and_bic_results)
AIC BIC
Backward.Additive 321749 321749
Backward.Two-Way 319179 319194
Backward.Three-way 317717 317717
Both.Additive 321749 321749
Both.Two-Way 319179 319194
Both.Three-way 317717 317717

From the table above, we can see that the inital three-way model beats out all models selected from stepwise and backwards search using both AIC and BIC criterion. This leads us to believe this could possibly be our best model. However, we should note that the three-way model includes a large amount of variables which introduces a lot of complexity to the model.

With complexity in mind, we will take a look at AIC of the other models. From now on, we will be using AIC as our main selection criterion. The model with the second lowest AIC is tied between models selected using backwards and stepwise direction of the two-way model. We believe that these models are the exact same. Using the identical function in R, we check if these models are identical which gives us the result TRUE. Since this results to TRUE, we will use back_twoway_mod_finish_aic as our best model from this chart.

We will now want to do diagnostics of some models we created to check the model assumptions of normality as well as heteroskedasticity, since we cannot solely rely on AIC for model selection and must take assumptions into account. These models will be back_additive_mod_finish_aic, back_twoway_mod_finish_aic, and full_threeway_model. These models were selected by having the lowest AIC in each class of initial models (additive, two-way, and three-way).

diagnostics = function(model, alpha = .05, pointcol = "orange", linecol = "blue", plots = TRUE, tests = TRUE, pointtype = 16) {
    if (plots == TRUE) {
        par(mfrow = c(1, 3))
        plot(
                fitted(model),
                resid(model),
                pch = pointtype,
                xlab = "Fitted Values",
                ylab = "Residuals",
                main = "Fitted vs Residuals",
                col = pointcol
            )
        abline(h = 0, lwd = 2, col = linecol)
        
        qqnorm(
                resid(model),
                pch = pointtype,
                main = "QQNorm Plot",
                col = pointcol
            )
        qqline(
                resid(model),
                lwd = 2,
                col = linecol
                )
        hist(
            resid(model),
            main = "Histogram of Residuals",
            col = pointcol,
            xlab = "Residuals",
            ylab = "Frequency"
            )
    }
    if (tests == TRUE) {
        ad_test = ad.test(resid(model))
        bp_test = bptest(model)
        test_results = data.frame(
          "Anderson-Darling Normality Test" =
            c("Test Statistic" = round(ad_test$statistic, 5),
              "P-Value" = ad_test$p.value,
              "Result" = ifelse(ad_test$p.value < alpha, "Reject", "Fail To Reject")),
          "Breusch-Pagan Test" =
            c("Test Statistic" = round(bp_test$statistic, 5),
              "P-Value" = bp_test$p.value,
              "Result" = ifelse(bp_test$p.value < alpha, "Reject", "Fail To Reject")))

        kable(t(test_results), col.names = c("Test Statistic", "P-Value", "Decision"))
    }
}
diagnostics(back_additive_mod_finish_aic)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 250.58939 3.7e-24 Reject
Breusch.Pagan.Test 597.23911 5.45661256594302e-121 Reject
diagnostics(back_twoway_mod_finish_aic)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 277.03195 3.7e-24 Reject
Breusch.Pagan.Test 1217.97917 2.27700592382994e-217 Reject
diagnostics(full_threeway_model)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 288.85194 3.7e-24 Reject
Breusch.Pagan.Test 1577.09254 5.71910719516228e-211 Reject

For our diagnostics function we include a scatter plot of fitted values against residuals, a QQNorm plot, and histogram of the residuals in order to visualize our model and how our assumptions hold. For our tabular results, we are running two tests.

The first test is the Anderson-Darling Test, which tests our normality assumption. The null hypothesis for this test is that our data is sampled from a normal distribution. Note that we’re using Anderson-Darling instead of Shapiro-Wilk because we are using a large dataset with \(20635\) observations and Anderson-Darling performs better on larger datasets of this size.

The second test is the Breusch-Pagan Test, which tests whether our model is homoskedastic (or, whether the residuals have a constant variance). The null hypothesis for this test is that our model is homoskedastic.

Looking at the plots, we are concerned that our models will not hold to our model assumptions of normality and homoskedasticity. Looking further at our test results, we see that for each model we reject our null hypothesis’s. With a rejection of our null hypotheses, we have decided that these models would be best fit for prediction versus inference since assumptions do not hold.

Recall that our best model is back_twoway_mod_finish_aic, which was the model found from a backward search on the full two-way model using AIC as criterion. In order to combat our rejection of the null hypothesis’s above, we will attempt to transform the response variable, median_home_value, to hopefully get closer to our model assumptions.

back_twoway_mod_finish_aic_log = lm(formula = log(median_house_value) ~ longitude + latitude + housing_median_age + 
    total_rooms + total_bedrooms + population + households + 
    median_income + ocean_proximity + longitude:latitude + longitude:housing_median_age + 
    longitude:total_rooms + longitude:total_bedrooms + longitude:households + 
    longitude:median_income + longitude:ocean_proximity + latitude:housing_median_age + 
    latitude:total_rooms + latitude:total_bedrooms + latitude:median_income + 
    latitude:ocean_proximity + housing_median_age:total_rooms + 
    housing_median_age:population + housing_median_age:households + 
    housing_median_age:median_income + housing_median_age:ocean_proximity + 
    total_rooms:population + total_rooms:households + total_rooms:median_income + 
    total_rooms:ocean_proximity + total_bedrooms:households + 
    total_bedrooms:median_income + total_bedrooms:ocean_proximity + 
    population:households + population:median_income + population:ocean_proximity + 
    households:median_income + households:ocean_proximity + median_income:ocean_proximity, 
    data = housing_trn_data)
diagnostics(back_twoway_mod_finish_aic_log)

Test Statistic P-Value Decision
Anderson.Darling.Normality.Test 50.79596 3.7e-24 Reject
Breusch.Pagan.Test 1089.85284 7.15443046217138e-191 Reject

Doing so, we can see that we are closer to normal distribution when taking the log of median_house_value. However, we are still rejecting both null hypotheses. Since this did not give us much improvement on our assumptions, we will stick to using the non-logged model and accept that our assumptions do not hold. While this will decrease the validity of our model, it can still be useful for prediction rather than inference. Since we are attempting to predict values of homes, we are okay with this.

Detect Overfitting

For our last model selection criterion, we will be utilizing leave-one-out cross-validated RMSE (hereafter referred to as LOOCV RMSE). This can be written as:

\[ \text{RMSE}_{\text{LOOCV}} = \sqrt{\frac{1}{n}\sum_{i=1}^n \left(\frac{e_{i}}{1-h_{i}}\right)^2} \]

where \(e_i\) are the residuals and \(h_i\) are the leverages.

LOOCV RMSE corrects for the fact that RMSE alone always prefers the larger model by instead assessing how the model fits “left out” data that wasn’t used to perform the regression.

# From the text: https://daviddalpiaz.github.io/appliedstats/variable-selection-and-model-building.html
calc_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
calc_rmse = function(actual, predicted) {
  sqrt(sum((actual - predicted)^2) / length(actual)) 
}
calc_avg_per_error = function(actual, predicted) {
    inter_abs = abs(predicted - actual)
    100 * (sum(inter_abs / actual)) / length(actual)
}
additive_loocv_rmse = calc_loocv_rmse(back_additive_mod_finish_aic)
twoway_loocv_rmse = calc_loocv_rmse(back_twoway_mod_finish_aic)
threeway_loocv_rmse = calc_loocv_rmse(full_threeway_model)

loocv_rmse_results = data.frame(
  "LOOCV-RMSE" =
    c("Backwards Additive" = additive_loocv_rmse,
      "Backwards Two-Way" = twoway_loocv_rmse,
      "Initial Three-way" = threeway_loocv_rmse))

kable(loocv_rmse_results)
LOOCV.RMSE
Backwards Additive 68766
Backwards Two-Way 63098
Initial Three-way 67982

As we can see from the table above, the model created from backwards search of the two-way model using AIC as selection criterion obtains the lowest LOOCV-RMSE. Since we are using LOOCV-RMSE as our final model selection criterion, this gives us the final result that our Backwards Two-Way Search model, back_twoway_mod_finish_aic, is our best model. We will now evaluate this model on our test set that have had held back in order to see how this model performs on unseen data.

Results

Predicting Housing Prices

# the actual median house values from the test set
test_actual = housing_tst_data$median_house_value
# the predicted house values for the test set
test_predictions = predict(back_twoway_mod_finish_aic, housing_tst_data)
# the RMSE
test_rmse = calc_rmse(test_actual, test_predictions)
# the percentage error
test_perc_error = calc_avg_per_error(test_actual, test_predictions)

test_rmse
## [1] 65411
test_perc_error
## [1] 26

We evaluated our final model, the Backwards Two-Way Search model, on unseen test data using RMSE as well as average percent error. We got an RMSE of \(65410.62\), which is similar to the LOOCV-RMSE we obtained on this model using our training data (\(63098.08\)). Since these RMSEs are close in value, we can see that our model is not overfitting. For average percent error, we got a result of \(25.5\)%.

Discussion

Our final model, when evaluated on unseen test data, had an average error of \(65410.62\). This means that, on average, our model’s predicted housing price will be \(\pm\) \(65410.62\) in comparison to the actual price. This gives us an average percent error of \(\pm\) \(25.5\)% between the actual and predicted prices.

While this is higher than we might have hoped for, we believe this error isn’t too bad considering that the majority of the predictors were based on location. As we noted in the Introduction, while location - and more specifically ocean_proximity - gives us quite a lot of information about housing prices, but it doesn’t tell the whole story.

Below we show the range of our model’s error by ocean_proximity (Note that there is not data for the \(ISLAND\) level because we removed it from the dataset to avoid skew because of its low representation):

op0 = par()
op1 = op0$mar
op1[2] = 7
par(mar = op1)
plot(test_predictions - test_actual, housing_tst_data$ocean_proximity,
     xlab = "Predicted Price Over / Under Actual Price",
     ylab = "",
     main = "Predicted Price Over / Under Actual Price vs Ocean Proximity",
     col = "dodgerblue", yaxt = "none")
axis(2, at = 1:5, labels = levels(housing_tst_data$ocean_proximity), las = 1)

A general trend worth noting is that we tend to underestimate the price a bit more frequently than overestimating. Dividing the data by ocean_proximity, we see that for the most part, the range of the errors is consistent for each of the levels. However, there are some extreme outliers when the ocean_proximity is \(INLAND\). In particular, our model has underestimated the price of two houses by more than $800,000. This is quite a jump from the rest of the errors and well beyond our average error of \(65410.62\).

In order to see if these outlier data points align with the issues we saw in the Introduction, we plotted each of our model’s errors in the test data.

plot_map = ggplot(housing_tst_data, 
                  aes(x = longitude, y = latitude, 
                      color = test_predictions - test_actual, 
                      hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
                      hh = households, mi = median_income)) +
              geom_point(aes(size = abs(test_predictions - test_actual)), alpha = 0.4) +
              xlab("Longitude") +
              ylab("Latitude") +
              ggtitle("Predicted Price Over / Under Actual Price") +
              theme(plot.title = element_text(hjust = 0.5)) +
              scale_color_distiller(palette = "Paired", labels = comma) +
              labs(color = "Predicted Price Over / Under (in $USD)", 
                   size = "Magnitude of Price Difference")
plot_map

Those two troublesome data points are indeed located in one of the inland areas with more expensive prices - namely, near the Sacramento area. It’s possible that removing these two values could have helped us more accurately estimate the prices of inland houses.

Overall, our model was able to predict the prices of houses in California in 1990 within a \(\pm\) \(25.5\)% error rate based on the California Housing Prices dataset.