Technical Requirements

You may need to install Cairo on your operating system to run this notebook. To learn more see the Cairo documentation at https://www.cairographics.org/download/.

if(!require(Cairo)) install.packages("Cairo", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(car)) install.packages("car", repos = "http://cran.us.r-project.org")
if(!require(nortest)) install.packages("nortest", repos = "http://cran.us.r-project.org")
library(readr)
library(ggplot2)
library(knitr)
library(tidyverse)
library(caret)
library(leaps)
library(car)
library(mice)
library(scales)
library(RColorBrewer)
library(plotly)
library(nortest)
library(lmtest)

Introduction

Predicting Housing Prices

The purpose of this project is to predict the price of houses in California in 1990 based on a number of possible location-based predictors, including latitude, longitude, and information about other houses within a particular block.

While this project focuses on prediction, we are fully aware that housing prices have increased dramatically since 1990, when the data was collected. This model should not be used to predict today’s housing prices in California. This is purely an academic endeavor to explore statistical prediction.

The goal of the project is to create the model that can best predict home prices in California given reasonable test/train splits in the data.

California Housing Prices Dataset

We’re using the California Housing Prices dataset (housing.csv) from the following Kaggle site: https://www.kaggle.com/camnugent/california-housing-prices. This data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data.

housing_data = read_csv("housing.csv")
## Parsed with column specification:
## cols(
##   longitude = col_double(),
##   latitude = col_double(),
##   housing_median_age = col_double(),
##   total_rooms = col_double(),
##   total_bedrooms = col_double(),
##   population = col_double(),
##   households = col_double(),
##   median_income = col_double(),
##   median_house_value = col_double(),
##   ocean_proximity = col_character()
## )

The dataset contains \(20640\) observations and 10 attributes (9 predictors and 1 response). Below is a list of the variables with descriptions taken from the original Kaggle site given above.

  • longitude: A measure of how far west a house is; a higher value is farther west
  • latitude: A measure of how far north a house is; a higher value is farther north
  • housing_median_age: Median age of a house within a block; a lower number is a newer building
  • total_rooms: Total number of rooms within a block
  • total_bedrooms: Total number of bedrooms within a block
  • population: Total number of people residing within a block
  • households: Total number of households, a group of people residing within a home unit, for a block
  • median_income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
  • ocean_proximity: Location of the house w.r.t ocean/sea
  • median_house_value: Median house value for households within a block (measured in US Dollars)

This dataset meets all of the stated criteria for this project including:

  • A minimum 200 observations
  • A numeric response variable: median_house_value
  • At least one categorical predictor: ocean_proximity
  • At least two numeric predictors: the remaining attributes

Let’s look at a summary of each variable:

summary(housing_data)
##    longitude       latitude  housing_median_age  total_rooms   
##  Min.   :-124   Min.   :33   Min.   : 1         Min.   :    2  
##  1st Qu.:-122   1st Qu.:34   1st Qu.:18         1st Qu.: 1448  
##  Median :-118   Median :34   Median :29         Median : 2127  
##  Mean   :-120   Mean   :36   Mean   :29         Mean   : 2636  
##  3rd Qu.:-118   3rd Qu.:38   3rd Qu.:37         3rd Qu.: 3148  
##  Max.   :-114   Max.   :42   Max.   :52         Max.   :39320  
##                                                                
##  total_bedrooms   population      households   median_income 
##  Min.   :   1   Min.   :    3   Min.   :   1   Min.   : 0.5  
##  1st Qu.: 296   1st Qu.:  787   1st Qu.: 280   1st Qu.: 2.6  
##  Median : 435   Median : 1166   Median : 409   Median : 3.5  
##  Mean   : 538   Mean   : 1425   Mean   : 500   Mean   : 3.9  
##  3rd Qu.: 647   3rd Qu.: 1725   3rd Qu.: 605   3rd Qu.: 4.7  
##  Max.   :6445   Max.   :35682   Max.   :6082   Max.   :15.0  
##  NA's   :207                                                 
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
## 

Note that the total_bedrooms variable has 207 NA values. We will address this issue in the Data Cleaning section in Methods.

Below is a visual representation of all data points in the dataset with longitude on the x-axis, latitude on the y-axis, and median_house_value shown in a color codes.

plot_map = ggplot(housing_data, 
                  aes(x = longitude, y = latitude, color = median_house_value, 
                      hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
                      hh = households, mi = median_income)) +
              geom_point(aes(size = population), alpha = 0.4) +
              xlab("Longitude") +
              ylab("Latitude") +
              ggtitle("Data Map - Longtitude vs Latitude and Associated Variables") +
              theme(plot.title = element_text(hjust = 0.5)) +
              scale_color_distiller(palette = "Paired", labels = comma) +
              labs(color = "Median House Value (in $USD)", size = "Population")
plot_map