You may need to install Cairo on your operating system to run this notebook. To learn more see the Cairo documentation at https://www.cairographics.org/download/.
if(!require(Cairo)) install.packages("Cairo", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(car)) install.packages("car", repos = "http://cran.us.r-project.org")
if(!require(nortest)) install.packages("nortest", repos = "http://cran.us.r-project.org")
library(readr)
library(ggplot2)
library(knitr)
library(tidyverse)
library(caret)
library(leaps)
library(car)
library(mice)
library(scales)
library(RColorBrewer)
library(plotly)
library(nortest)
library(lmtest)
The purpose of this project is to predict the price of houses in California in 1990 based on a number of possible location-based predictors, including latitude, longitude, and information about other houses within a particular block.
While this project focuses on prediction, we are fully aware that housing prices have increased dramatically since 1990, when the data was collected. This model should not be used to predict today’s housing prices in California. This is purely an academic endeavor to explore statistical prediction.
The goal of the project is to create the model that can best predict home prices in California given reasonable test/train splits in the data.
We’re using the California Housing Prices dataset (housing.csv
) from the following Kaggle site: https://www.kaggle.com/camnugent/california-housing-prices. This data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data.
housing_data = read_csv("housing.csv")
## Parsed with column specification:
## cols(
## longitude = col_double(),
## latitude = col_double(),
## housing_median_age = col_double(),
## total_rooms = col_double(),
## total_bedrooms = col_double(),
## population = col_double(),
## households = col_double(),
## median_income = col_double(),
## median_house_value = col_double(),
## ocean_proximity = col_character()
## )
The dataset contains \(20640\) observations and 10 attributes (9 predictors and 1 response). Below is a list of the variables with descriptions taken from the original Kaggle site given above.
longitude
: A measure of how far west a house is; a higher value is farther westlatitude
: A measure of how far north a house is; a higher value is farther northhousing_median_age
: Median age of a house within a block; a lower number is a newer buildingtotal_rooms
: Total number of rooms within a blocktotal_bedrooms
: Total number of bedrooms within a blockpopulation
: Total number of people residing within a blockhouseholds
: Total number of households, a group of people residing within a home unit, for a blockmedian_income
: Median income for households within a block of houses (measured in tens of thousands of US Dollars)ocean_proximity
: Location of the house w.r.t ocean/seamedian_house_value
: Median house value for households within a block (measured in US Dollars)This dataset meets all of the stated criteria for this project including:
median_house_value
ocean_proximity
Let’s look at a summary of each variable:
summary(housing_data)
## longitude latitude housing_median_age total_rooms
## Min. :-124 Min. :33 Min. : 1 Min. : 2
## 1st Qu.:-122 1st Qu.:34 1st Qu.:18 1st Qu.: 1448
## Median :-118 Median :34 Median :29 Median : 2127
## Mean :-120 Mean :36 Mean :29 Mean : 2636
## 3rd Qu.:-118 3rd Qu.:38 3rd Qu.:37 3rd Qu.: 3148
## Max. :-114 Max. :42 Max. :52 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1 Min. : 3 Min. : 1 Min. : 0.5
## 1st Qu.: 296 1st Qu.: 787 1st Qu.: 280 1st Qu.: 2.6
## Median : 435 Median : 1166 Median : 409 Median : 3.5
## Mean : 538 Mean : 1425 Mean : 500 Mean : 3.9
## 3rd Qu.: 647 3rd Qu.: 1725 3rd Qu.: 605 3rd Qu.: 4.7
## Max. :6445 Max. :35682 Max. :6082 Max. :15.0
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
Note that the total_bedrooms
variable has 207 NA values. We will address this issue in the Data Cleaning section in Methods.
Below is a visual representation of all data points in the dataset with longitude
on the x-axis, latitude
on the y-axis, and median_house_value
shown in a color codes.
plot_map = ggplot(housing_data,
aes(x = longitude, y = latitude, color = median_house_value,
hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
hh = households, mi = median_income)) +
geom_point(aes(size = population), alpha = 0.4) +
xlab("Longitude") +
ylab("Latitude") +
ggtitle("Data Map - Longtitude vs Latitude and Associated Variables") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_distiller(palette = "Paired", labels = comma) +
labs(color = "Median House Value (in $USD)", size = "Population")
plot_map