longley
Macroeconomic Data)The built-in dataset longley
contains macroeconomic data for predicting employment. We will attempt to model the Employed
variable.
View(longley)
?longley
(a) What is the largest correlation between any pair of predictors in the dataset?
round(cor(longley), 3)
## GNP.deflator GNP Unemployed Armed.Forces Population Year
## GNP.deflator 1.000 0.992 0.621 0.465 0.979 0.991
## GNP 0.992 1.000 0.604 0.446 0.991 0.995
## Unemployed 0.621 0.604 1.000 -0.177 0.687 0.668
## Armed.Forces 0.465 0.446 -0.177 1.000 0.364 0.417
## Population 0.979 0.991 0.687 0.364 1.000 0.994
## Year 0.991 0.995 0.668 0.417 0.994 1.000
## Employed 0.971 0.984 0.502 0.457 0.960 0.971
## Employed
## GNP.deflator 0.971
## GNP 0.984
## Unemployed 0.502
## Armed.Forces 0.457
## Population 0.960
## Year 0.971
## Employed 1.000
\(\color{blue}{\text{The largest correlation between any pair of predictors is between $Year$ and $GNP$ which is $0.995$}}\)
(b) Fit a model with Employed
as the response and the remaining variables as predictors. Calculate and report the variance inflation factor (VIF) for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?
employed_model_add <- lm(Employed ~ ., data=longley)
summary(employed_model_add)
##
## Call:
## lm(formula = Employed ~ ., data = longley)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4101 -0.1577 -0.0282 0.1016 0.4554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.48e+03 8.90e+02 -3.91 0.00356 **
## GNP.deflator 1.51e-02 8.49e-02 0.18 0.86314
## GNP -3.58e-02 3.35e-02 -1.07 0.31268
## Unemployed -2.02e-02 4.88e-03 -4.14 0.00254 **
## Armed.Forces -1.03e-02 2.14e-03 -4.82 0.00094 ***
## Population -5.11e-02 2.26e-01 -0.23 0.82621
## Year 1.83e+00 4.55e-01 4.02 0.00304 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.305 on 9 degrees of freedom
## Multiple R-squared: 0.995, Adjusted R-squared: 0.992
## F-statistic: 330 on 6 and 9 DF, p-value: 4.98e-10
library(faraway)
vif(employed_model_add)
## GNP.deflator GNP Unemployed Armed.Forces Population Year
## 135.532 1788.513 33.619 3.589 399.151 758.981
\(\color{blue}{\text{The VIFs for each of the predictors are fairly high other than $Armed.Forces$ meaning there is likely multicollinearity.}}\)
vif(employed_model_add)[which.max(vif(employed_model_add))]
## GNP
## 1789
\(\color{blue}{\text{GNP has the largest VIF of 1788.5135.}}\)
(c) What proportion of the observed variation in Population
is explained by a linear relationship with the other predictors?
population_model_no_emp <- lm(Population ~ . - Employed, data=longley)
summary(population_model_no_emp)$r.squared
## [1] 0.9975
\(\color{blue}{\text{99.7495% of the observed variation in $Population$ is explained by a linear relationship with the other predictors }}\)
(d) Calculate the partial correlation coefficient for Population
and Employed
with the effects of the other predictors removed.
employed_model_no_pop <- lm(Employed ~ . - Population, data = longley)
cor(resid(population_model_no_emp), resid(employed_model_no_pop))
## [1] -0.07514
\(\color{blue}{\text{The partial correlation coefficient for $Population$ and $Employed$ -0.0751}}\)
(e) Fit a new model with Employed
as the response and the predictors from the model in (b) that were significant. (Use \(\alpha = 0.05\).) Calculate and report the variance inflation factor for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?
employed_model_small <- lm(Employed ~ Unemployed + Armed.Forces + Year, data = longley)
vif(employed_model_small)
## Unemployed Armed.Forces Year
## 3.318 2.223 3.891
\(\color{blue}{\text{$Year$ has the largest VIF. None of the VIF values seem to be high enough to suggest multicollinearity}}\)
(f) Use an \(F\)-test to compare the models in parts (b) and (e). Report the following:
anova(employed_model_small, employed_model_add)
## Analysis of Variance Table
##
## Model 1: Employed ~ Unemployed + Armed.Forces + Year
## Model 2: Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population +
## Year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 12 1.323
## 2 9 0.836 3 0.487 1.75 0.23
\(\color{blue}{\text{Alternative hypothesis would be at least one of the coefficients:}}\) \(H_1: at\ least\ one\ of\ the\ parameters\ \beta_{GNP.deflator}, \beta_{GNP}, \beta_{Population} \ne 0\)
\(\color{blue}{\text{We would prefer smaller or null model.}}\)
(g) Check the assumptions of the model chosen in part (f). Do any assumptions appear to be violated?
plot_fitted_resid = function(model, pointcol = "dodgerblue", linecol = "darkorange") {
plot(fitted(model), resid(model),
col = pointcol, pch = 20, cex = 1.5,
xlab = "Fitted", ylab = "Residuals", main = "Fitted vs. Residuals")
abline(h = 0, col = linecol, lwd = 2)
}
plot_qq = function(model, pointcol = "dodgerblue", linecol = "darkorange") {
qqnorm(resid(model), col = pointcol, pch = 20, cex = 1.5)
qqline(resid(model), col = linecol, lwd = 2)
}
par(mfrow = c(1, 2))
plot_fitted_resid(employed_model_small)
plot_qq(employed_model_small)
\(\color{blue}{\text{It does not appear that the assumptions are clearly violated.}}\) \(\color{blue}{\text{Given the small size of the data set, it is hard to tell. There are a couple of outliers that should likely be scrutinized from the Fitted vs. Residual plot.}}\) \(\color{blue}{\text{The tails on the Normal Q-Q Plot are slightly off the line so normality could be questioned, however the tails are not that far off, and it is only a few points. }}\)
Credit
Data)For this exercise, use the Credit
data from the ISLR
package. Use the following code to remove the ID
variable which is not useful for modeling.
library(ISLR)
data(Credit)
Credit = subset(Credit, select = -c(ID))
Use ?Credit
to learn about this dataset.
#?Credit
(a) Find a “good” model for balance
using the available predictors. Use any methods seen in class except transformations of the response. The model should:
135
0.90
Store your model in a variable called mod_a
. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.
library(lmtest)
get_bp_decision = function(model, alpha) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_sw_decision = function(model, alpha) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_num_params = function(model) {
length(coef(model))
}
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
\(\color{blue}{\text{Helper method for checking all the conditions in the problem.}}\)
check_all_conditions = function(model, model_name, test_type = "bp", loocv_limit = 135, adj_r2_limit = 0.90, max_param = 10) {
df_conditions = data.frame(loocv = double(), loocv_accept = logical(),
adj_r2 = double(), adj_r2_accept = logical(),
test_decision=character(), test_decision_accept = logical(),
num_params = integer(), num_params_accept = logical(),
overall_accept = logical())
df_conditions[model_name, "loocv"] = get_loocv_rmse(model)
df_conditions[model_name, "loocv_accept"] = df_conditions[model_name, "loocv"] < loocv_limit
df_conditions[model_name, "adj_r2"] = get_adj_r2(model)
df_conditions[model_name, "adj_r2_accept"] = df_conditions[model_name, "adj_r2"] > adj_r2_limit
if(test_type == "sw") {
df_conditions["test_decision"] = get_sw_decision(model, alpha = 0.01)
df_conditions[model_name, "test_decision_accept"] = grepl("Fail", df_conditions["test_decision"], fixed=TRUE)
}else{
df_conditions["test_decision"] = get_bp_decision(model, alpha = 0.01)
df_conditions[model_name, "test_decision_accept"] = grepl("Fail", df_conditions["test_decision"], fixed=TRUE)
}
df_conditions[model_name, "num_params"] = get_num_params(model)
df_conditions[model_name, "num_params_accept"] = get_num_params(model) < max_param
df_conditions[model_name, "overall_accept"] = all(df_conditions[model_name, "loocv_accept"], df_conditions[model_name, "adj_r2_accept"], df_conditions[model_name, "test_decision_accept"], df_conditions[model_name, "num_params_accept"], na.rm = FALSE)
return(df_conditions)
}
balance_add = lm(Balance ~ . , data = Credit)
check_all_conditions(balance_add, "balance_add", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_add 100.4 TRUE 0.9538 TRUE Reject
## test_decision_accept num_params num_params_accept overall_accept
## balance_add FALSE 12 FALSE FALSE
\(\color{blue}{\text{Obviously, there are too many parameters, so we cannot us the full additive model. }}\) \(\color{blue}{\text{The addititive model does not meet the Breusch-Pagan test with an $\alpha$ of $0.01$. Let's look for colinearity.}}\)
pairs(Credit, col = "dodgerblue")
\(\color{blue}{\text{It looks like Limit and Rating all have strong liner ties so we will likely only need one of those.}}\) \(\color{blue}{\text{Gender, Student, and Married appear to be binary and Ethnicity appears to have 3 groups based on the visual. }}\) \(\color{blue}{\text{For the first pass we will will leave these out also.}}\)
balance_5pred = lm(Balance ~ Income + Limit + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5pred, "balance_5pred", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5pred 164 FALSE 0.8746 FALSE Fail to Reject
## test_decision_accept num_params num_params_accept overall_accept
## balance_5pred TRUE 6 TRUE FALSE
\(\color{blue}{\text{This hits the condition that we get a "Fail to Reject", however fails on LOOCV and Adjusted R2 requirements.}}\) \(\color{blue}{\text{So let's add those the binary predictors back in and try to stay under the limit of 10 parameters}}\)
balance_8pred = lm(Balance ~ Income + Limit + Cards + Age + Education + Gender + Student + Married, data = Credit)
check_all_conditions(balance_8pred, "balance_8pred", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_8pred 100.4 TRUE 0.9534 TRUE Reject
## test_decision_accept num_params num_params_accept overall_accept
## balance_8pred FALSE 9 TRUE FALSE
\(\color{blue}{\text{Puts us back in the not meeting the criteria on the Breusch-Pagan test}}\) \(\color{blue}{\text{Let's look at our assumptions with this model.}}\)
par(mfrow = c(1, 2))
plot_fitted_resid(balance_8pred)
plot_qq(balance_8pred)
\(\color{blue}{\text{ Constant Variance sees to be off. The tails on the Normal Q-Q Plot are running away a little.}}\) \(\color{blue}{\text{Let's try to take the log of one of the predictors. }}\) \(\color{blue}{\text{Let's step back to our model with 5 predictors that was able to get a Fail to Reject}}\)
balance_5logincome = lm(Balance ~ log(Income) + Limit + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5logincome, "balance_5logincome", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5logincome 182.3 FALSE 0.8451 FALSE Fail to Reject
## test_decision_accept num_params num_params_accept
## balance_5logincome TRUE 6 TRUE
## overall_accept
## balance_5logincome FALSE
\(\color{blue}{\text{OK we can still get Fail to Reject with this. Do our assumptions look any better?}}\)
par(mfrow = c(1, 2))
plot_fitted_resid(balance_5logincome)
plot_qq(balance_5logincome)
\(\color{blue}{\text{The Q-Q Plot appears to be closer to the line on the low end at least. The Fitted vs. Residuals is far from perfect, but does look more balanced.}}\) \(\color{blue}{\text{Let's try another predictor.}}\)
balance_5loglimit = lm(Balance ~ Income + log(Limit) + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5loglimit, "balance_5loglimit", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5loglimit 259.1 FALSE 0.6895 FALSE Reject
## test_decision_accept num_params num_params_accept
## balance_5loglimit FALSE 6 TRUE
## overall_accept
## balance_5loglimit FALSE
\(\color{blue}{\text{That did not get a "Fail to Reject."}}\) \(\color{blue}{\text{How about taking the log of two predictors.}}\)
balance_5logincomelimit = lm(Balance ~ log(Income) + log(Limit) + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5logincomelimit, "balance_5logincomelimit", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5logincomelimit 252.3 FALSE 0.7046 FALSE Reject
## test_decision_accept num_params num_params_accept
## balance_5logincomelimit FALSE 6 TRUE
## overall_accept
## balance_5logincomelimit FALSE
\(\color{blue}{\text{That did not get a Fail to Reject.}}\) \(\color{blue}{\text{How about log on income from the larger additive model?}}\)
balance_8logincome = lm(Balance ~ log(Income) + Limit + Cards + Age + Education + Gender + Student + Married, data = Credit)
check_all_conditions(balance_8logincome, "balance_8logincome", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_8logincome 131.5 TRUE 0.9206 TRUE Fail to Reject
## test_decision_accept num_params num_params_accept
## balance_8logincome TRUE 9 TRUE
## overall_accept
## balance_8logincome TRUE
\(\color{blue}{\text{That model meets all of the criteria.}}\) \(\color{blue}{\text{NOTE: there was a LOT more guessing and checking that was left out of this discussion.}}\)
mod_a = balance_8logincome
get_loocv_rmse(mod_a)
## [1] 131.5
get_adj_r2(mod_a)
## [1] 0.9206
get_bp_decision(mod_a, alpha = 0.01)
## [1] "Fail to Reject"
get_num_params(mod_a)
## [1] 9
(b) Find another “good” model for balance
using the available predictors. Use any methods seen in class except transformations of the response. The model should:
125
0.91
Store your model in a variable called mod_b
. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.
library(lmtest)
get_bp_decision = function(model, alpha) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_sw_decision = function(model, alpha) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_num_params = function(model) {
length(coef(model))
}
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
balance_add = lm(Balance ~ . , data = Credit)
check_all_conditions(balance_add, "balance_add", "sw", 135, 0.91, 25)
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_add 100.4 TRUE 0.9538 TRUE Reject
## test_decision_accept num_params num_params_accept overall_accept
## balance_add FALSE 12 TRUE FALSE
\(\color{blue}{\text{The full addititive model does not meet the Shapiro-Wilk test with an $\alpha$ of $0.01$. }}\) \(\color{blue}{\text{Try the model that worked for part A}}\)
balance_8logincome = lm(Balance ~ log(Income) + Limit + Cards + Age + Education + Gender + Student + Married, data = Credit)
check_all_conditions(balance_8logincome, "balance_8logincome", "sw", 135, 0.91, 25)
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_8logincome 131.5 TRUE 0.9206 TRUE Reject
## test_decision_accept num_params num_params_accept
## balance_8logincome FALSE 9 TRUE
## overall_accept
## balance_8logincome FALSE
\(\color{blue}{\text{That didn't work. Guess and check a number of other things.}}\)
balance_5logincome = lm(Balance ~ log(Income) + Limit + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5logincome, "balance_5logincome", "sw", 135, 0.91, 25)
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5logincome 182.3 FALSE 0.8451 FALSE Reject
## test_decision_accept num_params num_params_accept
## balance_5logincome FALSE 6 TRUE
## overall_accept
## balance_5logincome FALSE
balance_5logincome_3 = lm(Balance ~ (log(Income) + Limit + Cards + Age + Education)^3, data = Credit)
check_all_conditions(balance_5logincome_3, "balance_5logincome_3", "sw", 135, 0.91, 25)
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5logincome_3 181.7 FALSE 0.8547 FALSE Reject
## test_decision_accept num_params num_params_accept
## balance_5logincome_3 FALSE 26 FALSE
## overall_accept
## balance_5logincome_3 FALSE
balance_8logincome_3 = lm(Balance ~ (log(Income) + Limit + Cards + Age + Education + Gender + Student + Married)^3, data = Credit)
#summary(balance_8logincome_3)$coef
check_all_conditions(balance_8logincome_3, "balance_8logincome_3", "sw", 135, 0.91, 25)
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_8logincome_3 146.8 FALSE 0.935 TRUE Fail to Reject
## test_decision_accept num_params num_params_accept
## balance_8logincome_3 TRUE 93 FALSE
## overall_accept
## balance_8logincome_3 FALSE
balance_8_back_bic = step(balance_8logincome_3, scope = Balance ~ (log(Income) + Limit + Cards + Age + Education + Gender + Student + Married) ^ 3, direction = "back", k = log(length(Credit)), trace = 0)
#summary(balance_8_back_bic)$coef
check_all_conditions(balance_8_back_bic, "balance_8_back_bic", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_8_back_bic 115.7 TRUE 0.941 TRUE Reject
## test_decision_accept num_params num_params_accept
## balance_8_back_bic FALSE 30 FALSE
## overall_accept
## balance_8_back_bic FALSE
balance_5_back_bic = step(balance_5logincome_3, scope = Balance ~ (log(Income) + Limit + Cards + Age + Education) ^ 3, direction = "back", k = log(length(Credit)), trace = 0)
check_all_conditions(balance_5_back_bic, "balance_5_back_bic", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5_back_bic 175.5 FALSE 0.8574 FALSE Fail to Reject
## test_decision_accept num_params num_params_accept
## balance_5_back_bic TRUE 9 TRUE
## overall_accept
## balance_5_back_bic FALSE
balance_5_back_aic = step(balance_5logincome_3, scope = Balance ~ (log(Income) + Limit + Cards + Age + Education) ^ 3, direction = "back", trace = 0)
check_all_conditions(balance_5_back_aic, "balance_5_back_aic", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5_back_aic 175.5 FALSE 0.8574 FALSE Fail to Reject
## test_decision_accept num_params num_params_accept
## balance_5_back_aic TRUE 9 TRUE
## overall_accept
## balance_5_back_aic FALSE
balance_6_both_aic = step(balance_5logincome_3, scope = Balance ~ (log(Income) + Limit + Cards + Age + Education + Student) ^ 3, direction = "back", trace = 0)
summary(balance_6_both_aic)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.21277 256.967166 0.4717 6.374e-01
## log(Income) -215.90646 69.891341 -3.0892 2.151e-03
## Limit 0.34381 0.020735 16.5813 3.848e-47
## Cards 17.92309 6.390699 2.8046 5.290e-03
## Age 3.50986 2.702251 1.2989 1.948e-01
## Education -29.91032 14.343951 -2.0852 3.770e-02
## log(Income):Limit -0.02626 0.004987 -5.2660 2.306e-07
## log(Income):Age -1.23437 0.732579 -1.6850 9.279e-02
## log(Income):Education 8.67205 3.922167 2.2110 2.761e-02
check_all_conditions(balance_6_both_aic, "balance_6_both_aic", "bp")
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_6_both_aic 175.5 FALSE 0.8574 FALSE Fail to Reject
## test_decision_accept num_params num_params_accept
## balance_6_both_aic TRUE 9 TRUE
## overall_accept
## balance_6_both_aic FALSE
\(\color{blue}{\text{Brute force tried things to get to here.}}\)
balance_4logincome_3 = lm(Balance ~ (log(Income) + Limit + Cards + Student)^3, data = Credit)
check_all_conditions(balance_4logincome_3, "balance_4logincome_3", "sw", 135, 0.91, 25)
## loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_4logincome_3 119.5 TRUE 0.9354 TRUE Fail to Reject
## test_decision_accept num_params num_params_accept
## balance_4logincome_3 TRUE 15 TRUE
## overall_accept
## balance_4logincome_3 TRUE
mod_b = balance_4logincome_3
get_loocv_rmse(mod_b)
## [1] 119.5
get_adj_r2(mod_b)
## [1] 0.9354
get_sw_decision(mod_b, alpha = 0.01)
## [1] "Fail to Reject"
get_num_params(mod_b)
## [1] 15
Sacramento
Housing Data)For this exercise, use the Sacramento
data from the caret
package. Use the following code to perform some preprocessing of the data.
# NOTE for Mac, I needed to run `brew install libomp` for caret to install
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:faraway':
##
## melanoma
## Loading required package: ggplot2
library(ggplot2)
data(Sacramento)
sac_data = Sacramento
sac_data$limits = factor(ifelse(sac_data$city == "SACRAMENTO", "in", "out"))
sac_data = subset(sac_data, select = -c(city, zip))
Instead of using the city
or zip
variables that exist in the dataset, we will simply create a variable (limits
) indicating whether or not a house is technically within the city limits of Sacramento. (We do this because they would both be factor variables with a large number of levels. This is a choice that is made due to laziness, not necessarily because it is justified. Think about what issues these variables might cause.)
Use ?Sacramento
to learn more about this dataset.
#?Sacramento
A plot of longitude versus latitude gives us a sense of where the city limits are.
qplot(y = longitude, x = latitude, data = sac_data,
col = limits, main = "Sacramento City Limits ")
After these modifications, we test-train split the data.
set.seed(420)
sac_trn_idx = sample(nrow(sac_data), size = trunc(0.80 * nrow(sac_data)))
sac_trn_data = sac_data[sac_trn_idx, ]
sac_tst_data = sac_data[-sac_trn_idx, ]
The training data should be used for all model fitting. Our goal is to find a model that is useful for predicting home prices.
(a) Find a “good” model for price
. Use any methods seen in class. The model should reach a LOOCV-RMSE below 77,500 in the training data. Do not use any transformations of the response variable.
\(\color{blue}{\text{Helper function to check if the criteria are met or not.}}\)
check_loocv = function(model, model_name, loocv_limit = 77500) {
df_conditions = data.frame(loocv = double(), loocv_accept = logical())
df_conditions[model_name, "loocv"] = get_loocv_rmse(model)
df_conditions[model_name, "loocv_accept"] = df_conditions[model_name, "loocv"] < loocv_limit
return(df_conditions)
}
\(\color{blue}{\text{First we will try the full additive model.}}\)
sac_add = lm(price ~ ., data = sac_trn_data)
#summary(sac_add)
check_loocv(sac_add, "sac_add")
## loocv loocv_accept
## sac_add 78671 FALSE
\(\color{blue}{\text{This is just above our criteria of a Leave One Out Cross Validation Root Mean Square Error of below 77,500}}\) \(\color{blue}{\text{Let's try a smaller model first to see what happens}}\)
sac_5p = lm(price ~ beds + sqft + type + latitude + longitude, data = sac_trn_data)
#summary(sac_5p)
check_loocv(sac_5p, "sac_5p")
## loocv loocv_accept
## sac_5p 78623 FALSE
\(\color{blue}{\text{Interestingly this is similarly just above our criteria.}}\) \(\color{blue}{\text{Let's try a bigger model of crossing all the variables.}}\)
sac_2full = lm(price ~ (beds + sqft + type + latitude + longitude)^2, data = sac_trn_data)
#summary(sac_2full)
check_loocv(sac_2full, "sac_2full")
## loocv loocv_accept
## sac_2full 78811 FALSE
\(\color{blue}{\text{Interestingly again this is just above our criteria.}}\) \(\color{blue}{\text{Let's try a backward aic}}\)
sac_backward_aic = step(sac_5p, scope = price ~ (beds + sqft + type + latitude + longitude)^2, direction = "backward", trace = 0)
summary(sac_backward_aic)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16480589.3 3.180e+06 5.182 2.837e-07
## beds -22918.8 4.963e+03 -4.617 4.582e-06
## sqft 150.2 6.006e+00 25.011 1.801e-100
## typeMulti_Family -13834.3 2.676e+04 -0.517 6.053e-01
## typeResidential 36838.4 1.259e+04 2.927 3.530e-03
## latitude 49050.1 2.330e+04 2.105 3.560e-02
## longitude 151108.9 2.248e+04 6.723 3.575e-11
vif(sac_backward_aic)
## beds sqft typeMulti_Family typeResidential
## 2.423 2.218 1.386 1.454
## latitude longitude
## 1.194 1.246
check_loocv(sac_backward_aic, "sac_backward_aic")
## loocv loocv_accept
## sac_backward_aic 78623 FALSE
\(\color{blue}{\text{Interestingly again this is just above our criteria of 77,500.}}\) \(\color{blue}{\text{Let's try both direction aic}}\)
sac_both_aic = step(sac_5p, scope = price ~ (beds + sqft + type + latitude + longitude)^2, direction = "both", trace = 0)
summary(sac_both_aic)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.300e+09 8.772e+08 -1.4818 1.388e-01
## beds 1.525e+07 3.017e+06 5.0535 5.484e-07
## sqft 2.102e+02 1.691e+01 12.4319 2.475e-32
## typeMulti_Family -1.548e+07 7.844e+06 -1.9733 4.884e-02
## typeResidential 8.450e+05 3.776e+06 0.2238 8.230e-01
## latitude 3.280e+07 2.264e+07 1.4491 1.477e-01
## longitude -1.069e+07 7.228e+06 -1.4791 1.395e-01
## beds:longitude 1.256e+05 2.484e+04 5.0547 5.449e-07
## beds:sqft -1.586e+01 3.979e+00 -3.9861 7.389e-05
## typeMulti_Family:latitude 4.006e+05 2.031e+05 1.9718 4.901e-02
## typeResidential:latitude -2.138e+04 9.771e+04 -0.2189 8.268e-01
## latitude:longitude 2.698e+05 1.865e+05 1.4464 1.485e-01
vif(sac_both_aic)
## beds sqft typeMulti_Family
## 936540.59 18.39 124544.49
## typeResidential latitude longitude
## 136888.83 1179479.77 134787.23
## beds:longitude beds:sqft typeMulti_Family:latitude
## 934784.96 32.77 124455.81
## typeResidential:latitude latitude:longitude
## 136527.36 1007863.54
check_loocv(sac_both_aic, "sac_both_aic")
## loocv loocv_accept
## sac_both_aic 77077 TRUE
\(\color{blue}{\text{We now have a LOOCV-RMSE of 77076.5091, TRUE which is below 77,500}}\) \(\color{blue}{\text{We will set this for use below in part b.}}\)
sac_model = sac_both_aic
(b) Is a model that achieves a LOOCV-RMSE below 77,500 useful in this case? That is, is an average error of 77,500 low enough when predicting home prices? To further investigate, use the held-out test data and your model from part (a) to do two things:
Based on all of this information, argue whether or not this model is useful.
actual = sac_tst_data$price
predicted = predict(sac_model, newdata = sac_tst_data)
# As discussed at: https://piazza.com/class/jviiq7m0zx16ar?cid=962
# Assignment asks for dividing by "predicted"
100 * mean((abs(predicted - actual)/ predicted))
## [1] 22.94
# However this is actually probably more correct:
100 * mean((abs(actual - predicted)) / actual)
## [1] 24.72
plot(actual ~ predicted, col = "dodgerblue", xlab = "Actual Home Prices", ylab = "Predicted Home Prices")
abline(a = 0, b = 1, lwd = 3, col = "darkorange")
\(\color{blue}{\text{The usefulness of a \$77,500 root mean square error is highly dependent on the price of the house.}}\) \(\color{blue}{\text{For a house that is on the low end of \$100,000 that would be way too large of an error to be}}\) \(\color{blue}{\text{acceptable since that would be almost off by 80%. However for a \$750,000 home it would only be off by 10%. }}\) \(\color{blue}{\text{So when we look at the percent of error we see that overall it is on average around 25%. 25% still}}\) \(\color{blue}{\text{seems sort of high, however when we look at the plot we see that there are a few outliers that are}}\) \(\color{blue}{\text{being predicted way over the actual price that may be skewing the overall percent. Most of the point appear close to the line.}}\)
In this exercise, we will investigate how well backwards AIC and BIC actually perform. For either to be “working” correctly, they should result in a low number of both false positives and false negatives. In model selection,
Consider the true model
\[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7 + \beta_8 x_8 + \beta_9 x_9 + \beta_{10} x_{10} + \epsilon \]
where \(\epsilon \sim N(0, \sigma^2 = 4)\). The true values of the \(\beta\) parameters are given in the R
code below.
beta_0 = 1
beta_1 = -1
beta_2 = 2
beta_3 = -2
beta_4 = 1
beta_5 = 1
beta_6 = 0
beta_7 = 0
beta_8 = 0
beta_9 = 0
beta_10 = 0
sigma = 2
Then, as we have specified them, some variables are significant, and some are not. We store their names in R
variables for use later.
not_sig = c("x_6", "x_7", "x_8", "x_9", "x_10")
signif = c("x_1", "x_2", "x_3", "x_4", "x_5")
We now simulate values for these x
variables, which we will use throughout part (a).
#set.seed(420)
set.seed(19781105)
n = 100
x_1 = runif(n, 0, 10)
x_2 = runif(n, 0, 10)
x_3 = runif(n, 0, 10)
x_4 = runif(n, 0, 10)
x_5 = runif(n, 0, 10)
x_6 = runif(n, 0, 10)
x_7 = runif(n, 0, 10)
x_8 = runif(n, 0, 10)
x_9 = runif(n, 0, 10)
x_10 = runif(n, 0, 10)
We then combine these into a data frame and simulate y
according to the true model.
sim_data_1 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 +
beta_5 * x_5 + rnorm(n, 0 , sigma)
)
We do a quick check to make sure everything looks correct.
head(sim_data_1)
## x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 y
## 1 1.4992 2.0825 2.077 1.57392 4.559 1.599 5.9728 9.0543 0.5824 1.784 7.769
## 2 9.4861 0.8646 1.022 9.95006 3.992 7.784 3.9387 4.7519 7.2986 3.345 4.192
## 3 8.2667 7.3802 2.841 0.06523 6.536 8.788 5.8062 7.1948 8.9837 8.792 9.889
## 4 7.3608 6.8598 5.306 7.31392 1.649 3.635 0.5872 5.3727 0.6993 6.302 7.978
## 5 9.3510 4.9396 2.094 4.77580 4.253 9.805 5.6786 0.5231 7.6863 2.535 6.300
## 6 0.2112 3.5737 7.223 9.55127 4.775 4.650 9.4180 9.6812 5.2140 8.724 6.618
Now, we fit an incorrect model.
fit = lm(y ~ x_1 + x_2 + x_6 + x_7, data = sim_data_1)
coef(fit)
## (Intercept) x_1 x_2 x_6 x_7
## 4.6966 -1.3232 1.3133 0.4112 -0.1123
Notice, we have coefficients for x_1
, x_2
, x_6
, and x_7
. This means that x_6
and x_7
are false positives, while x_3
, x_4
, and x_5
are false negatives.
To detect the false negatives, use:
# which are false negatives?
!(signif %in% names(coef(fit)))
## [1] FALSE FALSE TRUE TRUE TRUE
To detect the false positives, use:
# which are false positives?
names(coef(fit)) %in% not_sig
## [1] FALSE FALSE FALSE TRUE TRUE
Note that in both cases, you could sum()
the result to obtain the number of false negatives or positives.
(a) Set a seed equal to your birthday; then, using the given data for each x
variable above in sim_data_1
, simulate the response variable y
300 times. Each time,
x
variables.Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table.
birthday = 19781105
set.seed(birthday)
num_sims = 300
false_negative_aic_sim1 = rep(0,num_sims)
false_negative_bic_sim1 = rep(0,num_sims)
false_positive_aic_sim1 = rep(0,num_sims)
false_positive_bic_sim1 = rep(0,num_sims)
for(i in 1:num_sims){
sim_data_1$y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + beta_5 * x_5 + rnorm(n, 0, sigma)
model = lm(y ~ ., data = sim_data_1)
len = length(resid(model))
model_aic = step(model, direction = "backward", trace = 0)
model_bic = step(model, direction = "backward", k=log(len),trace = 0)
false_negative_aic_sim1[i] = sum(!(signif %in% names(coef(model_aic))))
false_positive_aic_sim1[i] = sum(names(coef(model_aic)) %in% not_sig)
false_negative_bic_sim1[i] = sum(!(signif %in% names(coef(model_bic))))
false_positive_bic_sim1[i] = sum(names(coef(model_bic)) %in% not_sig)
}
library(knitr)
table <- data.frame(Method = c("AIC","BIC"),
False_Negative = c(sum(false_negative_aic_sim1)/num_sims,sum(false_negative_bic_sim1)/num_sims),
False_Positive = c(sum(false_positive_aic_sim1)/num_sims,sum(false_positive_bic_sim1)/num_sims ))
kable(table,caption="Model Selection Table")
Method | False_Negative | False_Positive |
---|---|---|
AIC | 0 | 0.92 |
BIC | 0 | 0.23 |
\(\color{blue}{\text{For both AIC and BIC the False Negatives are 0. So the models being produced are likely too big to some degree.}}\) \(\color{blue}{\text{BIC returns fewer false positives, which is what we would expect. BIC should produce smaller models.}}\)
(b) Set a seed equal to your birthday; then, using the given data for each x
variable below in sim_data_2
, simulate the response variable y
300 times. Each time,
x
variables.Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table. Also compare to your answers in part (a) and suggest a reason for any differences.
#set.seed(420)
birthday = 19781105
set.seed(birthday)
x_1 = runif(n, 0, 10)
x_2 = runif(n, 0, 10)
x_3 = runif(n, 0, 10)
x_4 = runif(n, 0, 10)
x_5 = runif(n, 0, 10)
x_6 = runif(n, 0, 10)
x_7 = runif(n, 0, 10)
x_8 = x_1 + rnorm(n, 0, 0.1)
x_9 = x_1 + rnorm(n, 0, 0.1)
x_10 = x_2 + rnorm(n, 0, 0.1)
sim_data_2 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 +
beta_5 * x_5 + rnorm(n, 0 , sigma)
)
num_sims = 300
false_negative_aic_sim2 = rep(0,num_sims)
false_negative_bic_sim2 = rep(0,num_sims)
false_positive_aic_sim2 = rep(0,num_sims)
false_positive_bic_sim2 = rep(0,num_sims)
for(i in 1:num_sims){
sim_data_2$y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + beta_5 * x_5 + rnorm(n, 0, sigma)
model <- lm(y ~ ., data = sim_data_2)
len <- length(resid(model))
model_aic <- step(model, direction = "backward", trace=0)
model_bic <- step(model, direction = "backward", k=log(len), trace=0)
false_negative_aic_sim2[i] <- sum(!(signif %in% names(coef(model_aic))))
false_positive_aic_sim2[i] <- sum(names(coef(model_aic)) %in% not_sig)
false_negative_bic_sim2[i] <- sum(!(signif %in% names(coef(model_bic))))
false_positive_bic_sim2[i] <- sum(names(coef(model_bic)) %in% not_sig)
}
table <- data.frame(Method = c("AIC","BIC"),
False_Negative = c(sum(false_negative_aic_sim2)/num_sims,sum(false_negative_bic_sim2/num_sims)),
False_Positive = c(sum(false_positive_aic_sim2)/num_sims,sum(false_positive_bic_sim2/num_sims)))
kable(table,caption="Model Selection Table")
Method | False_Negative | False_Positive |
---|---|---|
AIC | 0.73 | 1.49 |
BIC | 0.77 | 0.94 |
\(\color{blue}{\text{This time there are some False Negatives for both models. Again we see that BIC returns fewer false positives.}}\)