Week 9 - Homework, Aaron Blythe

Exercise 1 (longley Macroeconomic Data)
Exercise 2 (Credit Data)
Exercise 3 (Sacramento Housing Data)
Exercise 4 (Does It Work?)

Exercise 1 (`longley` Macroeconomic Data)

The built-in dataset longley contains macroeconomic data for predicting employment. We will attempt to model the Employed variable.

View(longley)
?longley

(a) What is the largest correlation between any pair of predictors in the dataset?

round(cor(longley), 3)

##              GNP.deflator   GNP Unemployed Armed.Forces Population  Year
## GNP.deflator        1.000 0.992      0.621        0.465      0.979 0.991
## GNP                 0.992 1.000      0.604        0.446      0.991 0.995
## Unemployed          0.621 0.604      1.000       -0.177      0.687 0.668
## Armed.Forces        0.465 0.446     -0.177        1.000      0.364 0.417
## Population          0.979 0.991      0.687        0.364      1.000 0.994
## Year                0.991 0.995      0.668        0.417      0.994 1.000
## Employed            0.971 0.984      0.502        0.457      0.960 0.971
##              Employed
## GNP.deflator    0.971
## GNP             0.984
## Unemployed      0.502
## Armed.Forces    0.457
## Population      0.960
## Year            0.971
## Employed        1.000

$\color{blue}{\text{The largest correlation between any pair of predictors is between $Year$ and $GNP$ which is $0.995$}}$

(b) Fit a model with Employed as the response and the remaining variables as predictors. Calculate and report the variance inflation factor (VIF) for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?

employed_model_add <- lm(Employed ~ ., data=longley)
summary(employed_model_add)

## 
## Call:
## lm(formula = Employed ~ ., data = longley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4101 -0.1577 -0.0282  0.1016  0.4554 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.48e+03   8.90e+02   -3.91  0.00356 ** 
## GNP.deflator  1.51e-02   8.49e-02    0.18  0.86314    
## GNP          -3.58e-02   3.35e-02   -1.07  0.31268    
## Unemployed   -2.02e-02   4.88e-03   -4.14  0.00254 ** 
## Armed.Forces -1.03e-02   2.14e-03   -4.82  0.00094 ***
## Population   -5.11e-02   2.26e-01   -0.23  0.82621    
## Year          1.83e+00   4.55e-01    4.02  0.00304 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.305 on 9 degrees of freedom
## Multiple R-squared:  0.995,  Adjusted R-squared:  0.992 
## F-statistic:  330 on 6 and 9 DF,  p-value: 4.98e-10

library(faraway)
vif(employed_model_add)

## GNP.deflator          GNP   Unemployed Armed.Forces   Population         Year 
##      135.532     1788.513       33.619        3.589      399.151      758.981

$\color{blue}{\text{The VIFs for each of the predictors are fairly high other than $Armed.Forces$ meaning there is likely multicollinearity.}}$

vif(employed_model_add)[which.max(vif(employed_model_add))]

##  GNP 
## 1789

$\color{blue}{\text{GNP has the largest VIF of 1788.5135.}}$

(c) What proportion of the observed variation in Population is explained by a linear relationship with the other predictors?

population_model_no_emp <- lm(Population ~ . - Employed, data=longley)
summary(population_model_no_emp)$r.squared

## [1] 0.9975

$\color{blue}{\text{99.7495% of the observed variation in $Population$ is explained by a linear relationship with the other predictors }}$

(d) Calculate the partial correlation coefficient for Population and Employed with the effects of the other predictors removed.

employed_model_no_pop <- lm(Employed ~ . - Population, data = longley)
cor(resid(population_model_no_emp), resid(employed_model_no_pop))

## [1] -0.07514

$\color{blue}{\text{The partial correlation coefficient for $Population$ and $Employed$ -0.0751}}$

(e) Fit a new model with Employed as the response and the predictors from the model in (b) that were significant. (Use $\alpha = 0.05$.) Calculate and report the variance inflation factor for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?

employed_model_small <- lm(Employed ~  Unemployed + Armed.Forces + Year, data = longley)
vif(employed_model_small)

##   Unemployed Armed.Forces         Year 
##        3.318        2.223        3.891

$\color{blue}{\text{$Year$ has the largest VIF. None of the VIF values seem to be high enough to suggest multicollinearity}}$

(f) Use an $F$-test to compare the models in parts (b) and (e). Report the following:

The null hypothesis
The test statistic
The distribution of the test statistic under the null hypothesis
The p-value
A decision
Which model you prefer, (b) or (e)

anova(employed_model_small, employed_model_add)

## Analysis of Variance Table
## 
## Model 1: Employed ~ Unemployed + Armed.Forces + Year
## Model 2: Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population + 
##     Year
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1     12 1.323                         
## 2      9 0.836  3     0.487 1.75   0.23

$\color{blue}{\text{Null hypothesis would be: }}$ $H_0: \beta_{GNP.deflator} = \beta_{GNP} = \beta_{Population} = 0$
$\color{blue}{\text{Alternative hypothesis would be at least one of the coefficients:}}$ $H_1: at\ least\ one\ of\ the\ parameters\ \beta_{GNP.deflator}, \beta_{GNP}, \beta_{Population} \ne 0$
$\color{blue}{\text{The value of test statistic is 1.7465.}}$
$\color{blue}{\text{Test statistics follows F-distribution with `3` and `9` degrees of freedom.}}$
$\color{blue}{\text{p-value is 0.227.}}$
$\color{blue}{\text{The p-value is over the significance level of 0.05, we FAIL to reject null hypothesis.}}$
$\color{blue}{\text{We would prefer smaller or null model.}}$

(g) Check the assumptions of the model chosen in part (f). Do any assumptions appear to be violated?

plot_fitted_resid = function(model, pointcol = "dodgerblue", linecol = "darkorange") {
  plot(fitted(model), resid(model), 
       col = pointcol, pch = 20, cex = 1.5,
       xlab = "Fitted", ylab = "Residuals", main = "Fitted vs. Residuals")
       
  abline(h = 0, col = linecol, lwd = 2)
}

plot_qq = function(model, pointcol = "dodgerblue", linecol = "darkorange") {
  qqnorm(resid(model), col = pointcol, pch = 20, cex = 1.5)
  qqline(resid(model), col = linecol, lwd = 2)
}

par(mfrow = c(1, 2))
plot_fitted_resid(employed_model_small)
plot_qq(employed_model_small)

$\color{blue}{\text{It does not appear that the assumptions are clearly violated.}}$ $\color{blue}{\text{Given the small size of the data set, it is hard to tell. There are a couple of outliers that should likely be scrutinized from the Fitted vs. Residual plot.}}$ $\color{blue}{\text{The tails on the Normal Q-Q Plot are slightly off the line so normality could be questioned, however the tails are not that far off, and it is only a few points. }}$

Exercise 2 (`Credit` Data)

For this exercise, use the Credit data from the ISLR package. Use the following code to remove the ID variable which is not useful for modeling.

library(ISLR)
data(Credit)
Credit = subset(Credit, select = -c(ID))

Use ?Credit to learn about this dataset.

#?Credit

(a) Find a “good” model for balance using the available predictors. Use any methods seen in class except transformations of the response. The model should:

Reach a LOOCV-RMSE below 135
Obtain an adjusted $R^2$ above 0.90
Fail to reject the Breusch-Pagan test with an $\alpha$ of $0.01$
Use fewer than 10 $\beta$ parameters

Store your model in a variable called mod_a. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.

library(lmtest)

get_bp_decision = function(model, alpha) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision = function(model, alpha) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_num_params = function(model) {
  length(coef(model))
}

get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}

$\color{blue}{\text{Helper method for checking all the conditions in the problem.}}$

check_all_conditions =  function(model, model_name, test_type = "bp", loocv_limit = 135, adj_r2_limit = 0.90, max_param = 10) {
  df_conditions = data.frame(loocv = double(), loocv_accept = logical(),
                             adj_r2 = double(), adj_r2_accept = logical(),
                             test_decision=character(), test_decision_accept = logical(),
                             num_params = integer(), num_params_accept = logical(),
                             overall_accept = logical())
  df_conditions[model_name, "loocv"] = get_loocv_rmse(model)
  df_conditions[model_name, "loocv_accept"] = df_conditions[model_name, "loocv"] < loocv_limit
  df_conditions[model_name, "adj_r2"] = get_adj_r2(model)
  df_conditions[model_name, "adj_r2_accept"] = df_conditions[model_name, "adj_r2"] > adj_r2_limit
  if(test_type == "sw") {
    df_conditions["test_decision"] = get_sw_decision(model, alpha = 0.01)
    df_conditions[model_name, "test_decision_accept"] = grepl("Fail", df_conditions["test_decision"], fixed=TRUE)
  }else{
    df_conditions["test_decision"] = get_bp_decision(model, alpha = 0.01)
    df_conditions[model_name, "test_decision_accept"] = grepl("Fail", df_conditions["test_decision"], fixed=TRUE)
  }
  df_conditions[model_name, "num_params"] = get_num_params(model)
  df_conditions[model_name, "num_params_accept"] = get_num_params(model) < max_param
  df_conditions[model_name, "overall_accept"] = all(df_conditions[model_name, "loocv_accept"], df_conditions[model_name, "adj_r2_accept"], df_conditions[model_name, "test_decision_accept"], df_conditions[model_name, "num_params_accept"], na.rm = FALSE)
  return(df_conditions)
}

balance_add = lm(Balance ~ . , data = Credit)
check_all_conditions(balance_add, "balance_add", "bp")

##             loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_add 100.4         TRUE 0.9538          TRUE        Reject
##             test_decision_accept num_params num_params_accept overall_accept
## balance_add                FALSE         12             FALSE          FALSE

$\color{blue}{\text{Obviously, there are too many parameters, so we cannot us the full additive model. }}$ $\color{blue}{\text{The addititive model does not meet the Breusch-Pagan test with an $\alpha$ of $0.01$. Let's look for colinearity.}}$

pairs(Credit, col = "dodgerblue")

$\color{blue}{\text{It looks like Limit and Rating all have strong liner ties so we will likely only need one of those.}}$ $\color{blue}{\text{Gender, Student, and Married appear to be binary and Ethnicity appears to have 3 groups based on the visual. }}$ $\color{blue}{\text{For the first pass we will will leave these out also.}}$

balance_5pred = lm(Balance ~ Income + Limit + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5pred, "balance_5pred", "bp")

##               loocv loocv_accept adj_r2 adj_r2_accept  test_decision
## balance_5pred   164        FALSE 0.8746         FALSE Fail to Reject
##               test_decision_accept num_params num_params_accept overall_accept
## balance_5pred                 TRUE          6              TRUE          FALSE

$\color{blue}{\text{This hits the condition that we get a "Fail to Reject", however fails on LOOCV and Adjusted R2 requirements.}}$ $\color{blue}{\text{So let's add those the binary predictors back in and try to stay under the limit of 10 parameters}}$

balance_8pred = lm(Balance ~ Income + Limit + Cards + Age + Education + Gender + Student + Married, data = Credit)
check_all_conditions(balance_8pred, "balance_8pred", "bp")

##               loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_8pred 100.4         TRUE 0.9534          TRUE        Reject
##               test_decision_accept num_params num_params_accept overall_accept
## balance_8pred                FALSE          9              TRUE          FALSE

$\color{blue}{\text{Puts us back in the not meeting the criteria on the Breusch-Pagan test}}$ $\color{blue}{\text{Let's look at our assumptions with this model.}}$

par(mfrow = c(1, 2))
plot_fitted_resid(balance_8pred)
plot_qq(balance_8pred)

$\color{blue}{\text{ Constant Variance sees to be off. The tails on the Normal Q-Q Plot are running away a little.}}$ $\color{blue}{\text{Let's try to take the log of one of the predictors. }}$ $\color{blue}{\text{Let's step back to our model with 5 predictors that was able to get a Fail to Reject}}$

balance_5logincome = lm(Balance ~ log(Income) + Limit + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5logincome, "balance_5logincome", "bp")

##                    loocv loocv_accept adj_r2 adj_r2_accept  test_decision
## balance_5logincome 182.3        FALSE 0.8451         FALSE Fail to Reject
##                    test_decision_accept num_params num_params_accept
## balance_5logincome                 TRUE          6              TRUE
##                    overall_accept
## balance_5logincome          FALSE

$\color{blue}{\text{OK we can still get Fail to Reject with this. Do our assumptions look any better?}}$

par(mfrow = c(1, 2))
plot_fitted_resid(balance_5logincome)
plot_qq(balance_5logincome)

$\color{blue}{\text{The Q-Q Plot appears to be closer to the line on the low end at least. The Fitted vs. Residuals is far from perfect, but does look more balanced.}}$ $\color{blue}{\text{Let's try another predictor.}}$

balance_5loglimit = lm(Balance ~ Income + log(Limit) + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5loglimit, "balance_5loglimit", "bp")

##                   loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5loglimit 259.1        FALSE 0.6895         FALSE        Reject
##                   test_decision_accept num_params num_params_accept
## balance_5loglimit                FALSE          6              TRUE
##                   overall_accept
## balance_5loglimit          FALSE

$\color{blue}{\text{That did not get a "Fail to Reject."}}$ $\color{blue}{\text{How about taking the log of two predictors.}}$

balance_5logincomelimit = lm(Balance ~ log(Income) + log(Limit) + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5logincomelimit, "balance_5logincomelimit", "bp")

##                         loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5logincomelimit 252.3        FALSE 0.7046         FALSE        Reject
##                         test_decision_accept num_params num_params_accept
## balance_5logincomelimit                FALSE          6              TRUE
##                         overall_accept
## balance_5logincomelimit          FALSE

$\color{blue}{\text{That did not get a Fail to Reject.}}$ $\color{blue}{\text{How about log on income from the larger additive model?}}$

balance_8logincome = lm(Balance ~ log(Income) + Limit + Cards + Age + Education + Gender + Student + Married, data = Credit)
check_all_conditions(balance_8logincome, "balance_8logincome", "bp")

##                    loocv loocv_accept adj_r2 adj_r2_accept  test_decision
## balance_8logincome 131.5         TRUE 0.9206          TRUE Fail to Reject
##                    test_decision_accept num_params num_params_accept
## balance_8logincome                 TRUE          9              TRUE
##                    overall_accept
## balance_8logincome           TRUE

$\color{blue}{\text{That model meets all of the criteria.}}$ $\color{blue}{\text{NOTE: there was a LOT more guessing and checking that was left out of this discussion.}}$

mod_a = balance_8logincome

get_loocv_rmse(mod_a)

## [1] 131.5

get_adj_r2(mod_a)

## [1] 0.9206

get_bp_decision(mod_a, alpha = 0.01)

## [1] "Fail to Reject"

get_num_params(mod_a)

## [1] 9

(b) Find another “good” model for balance using the available predictors. Use any methods seen in class except transformations of the response. The model should:

Reach a LOOCV-RMSE below 125
Obtain an adjusted $R^2$ above 0.91
Fail to reject the Shapiro-Wilk test with an $\alpha$ of $0.01$
Use fewer than 25 $\beta$ parameters

Store your model in a variable called mod_b. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.

library(lmtest)

get_bp_decision = function(model, alpha) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision = function(model, alpha) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_num_params = function(model) {
  length(coef(model))
}

get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}

balance_add = lm(Balance ~ . , data = Credit)
check_all_conditions(balance_add, "balance_add", "sw", 135, 0.91, 25)

##             loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_add 100.4         TRUE 0.9538          TRUE        Reject
##             test_decision_accept num_params num_params_accept overall_accept
## balance_add                FALSE         12              TRUE          FALSE

$\color{blue}{\text{The full addititive model does not meet the Shapiro-Wilk test with an $\alpha$ of $0.01$. }}$ $\color{blue}{\text{Try the model that worked for part A}}$

balance_8logincome = lm(Balance ~ log(Income) + Limit + Cards + Age + Education + Gender + Student + Married, data = Credit)
check_all_conditions(balance_8logincome, "balance_8logincome", "sw", 135, 0.91, 25)

##                    loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_8logincome 131.5         TRUE 0.9206          TRUE        Reject
##                    test_decision_accept num_params num_params_accept
## balance_8logincome                FALSE          9              TRUE
##                    overall_accept
## balance_8logincome          FALSE

$\color{blue}{\text{That didn't work. Guess and check a number of other things.}}$

balance_5logincome = lm(Balance ~ log(Income) + Limit + Cards + Age + Education, data = Credit)
check_all_conditions(balance_5logincome, "balance_5logincome", "sw", 135, 0.91, 25)

##                    loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5logincome 182.3        FALSE 0.8451         FALSE        Reject
##                    test_decision_accept num_params num_params_accept
## balance_5logincome                FALSE          6              TRUE
##                    overall_accept
## balance_5logincome          FALSE

balance_5logincome_3 = lm(Balance ~ (log(Income) + Limit + Cards + Age + Education)^3, data = Credit)
check_all_conditions(balance_5logincome_3, "balance_5logincome_3", "sw", 135, 0.91, 25)

##                      loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_5logincome_3 181.7        FALSE 0.8547         FALSE        Reject
##                      test_decision_accept num_params num_params_accept
## balance_5logincome_3                FALSE         26             FALSE
##                      overall_accept
## balance_5logincome_3          FALSE

balance_8logincome_3 = lm(Balance ~ (log(Income) + Limit + Cards + Age + Education + Gender + Student + Married)^3, data = Credit)
#summary(balance_8logincome_3)$coef
check_all_conditions(balance_8logincome_3, "balance_8logincome_3", "sw", 135, 0.91, 25)

##                      loocv loocv_accept adj_r2 adj_r2_accept  test_decision
## balance_8logincome_3 146.8        FALSE  0.935          TRUE Fail to Reject
##                      test_decision_accept num_params num_params_accept
## balance_8logincome_3                 TRUE         93             FALSE
##                      overall_accept
## balance_8logincome_3          FALSE

balance_8_back_bic = step(balance_8logincome_3, scope = Balance ~ (log(Income) + Limit + Cards + Age + Education + Gender + Student + Married) ^ 3, direction = "back", k = log(length(Credit)), trace = 0)
#summary(balance_8_back_bic)$coef
check_all_conditions(balance_8_back_bic, "balance_8_back_bic", "bp")

##                    loocv loocv_accept adj_r2 adj_r2_accept test_decision
## balance_8_back_bic 115.7         TRUE  0.941          TRUE        Reject
##                    test_decision_accept num_params num_params_accept
## balance_8_back_bic                FALSE         30             FALSE
##                    overall_accept
## balance_8_back_bic          FALSE

balance_5_back_bic = step(balance_5logincome_3, scope = Balance ~ (log(Income) + Limit + Cards + Age + Education) ^ 3, direction = "back", k = log(length(Credit)), trace = 0)
check_all_conditions(balance_5_back_bic, "balance_5_back_bic", "bp")

##                    loocv loocv_accept adj_r2 adj_r2_accept  test_decision
## balance_5_back_bic 175.5        FALSE 0.8574         FALSE Fail to Reject
##                    test_decision_accept num_params num_params_accept
## balance_5_back_bic                 TRUE          9              TRUE
##                    overall_accept
## balance_5_back_bic          FALSE

balance_5_back_aic = step(balance_5logincome_3, scope = Balance ~ (log(Income) + Limit + Cards + Age + Education) ^ 3, direction = "back", trace = 0)
check_all_conditions(balance_5_back_aic, "balance_5_back_aic", "bp")

##                    loocv loocv_accept adj_r2 adj_r2_accept  test_decision
## balance_5_back_aic 175.5        FALSE 0.8574         FALSE Fail to Reject
##                    test_decision_accept num_params num_params_accept
## balance_5_back_aic                 TRUE          9              TRUE
##                    overall_accept
## balance_5_back_aic          FALSE

balance_6_both_aic = step(balance_5logincome_3, scope = Balance ~ (log(Income) + Limit + Cards + Age + Education  + Student) ^ 3, direction = "back", trace = 0)
summary(balance_6_both_aic)$coef

##                         Estimate Std. Error t value  Pr(>|t|)
## (Intercept)            121.21277 256.967166  0.4717 6.374e-01
## log(Income)           -215.90646  69.891341 -3.0892 2.151e-03
## Limit                    0.34381   0.020735 16.5813 3.848e-47
## Cards                   17.92309   6.390699  2.8046 5.290e-03
## Age                      3.50986   2.702251  1.2989 1.948e-01
## Education              -29.91032  14.343951 -2.0852 3.770e-02
## log(Income):Limit       -0.02626   0.004987 -5.2660 2.306e-07
## log(Income):Age         -1.23437   0.732579 -1.6850 9.279e-02
## log(Income):Education    8.67205   3.922167  2.2110 2.761e-02

check_all_conditions(balance_6_both_aic, "balance_6_both_aic", "bp")

##                    loocv loocv_accept adj_r2 adj_r2_accept  test_decision
## balance_6_both_aic 175.5        FALSE 0.8574         FALSE Fail to Reject
##                    test_decision_accept num_params num_params_accept
## balance_6_both_aic                 TRUE          9              TRUE
##                    overall_accept
## balance_6_both_aic          FALSE

$\color{blue}{\text{Brute force tried things to get to here.}}$

balance_4logincome_3 = lm(Balance ~ (log(Income) + Limit + Cards + Student)^3, data = Credit)
check_all_conditions(balance_4logincome_3, "balance_4logincome_3", "sw", 135, 0.91, 25)

##                      loocv loocv_accept adj_r2 adj_r2_accept  test_decision
## balance_4logincome_3 119.5         TRUE 0.9354          TRUE Fail to Reject
##                      test_decision_accept num_params num_params_accept
## balance_4logincome_3                 TRUE         15              TRUE
##                      overall_accept
## balance_4logincome_3           TRUE

mod_b = balance_4logincome_3

get_loocv_rmse(mod_b)

## [1] 119.5

get_adj_r2(mod_b)

## [1] 0.9354

get_sw_decision(mod_b, alpha = 0.01)

## [1] "Fail to Reject"

get_num_params(mod_b)

## [1] 15

Exercise 3 (`Sacramento` Housing Data)

For this exercise, use the Sacramento data from the caret package. Use the following code to perform some preprocessing of the data.

# NOTE for Mac, I needed to run `brew install libomp` for caret to install
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:faraway':
## 
##     melanoma

## Loading required package: ggplot2

library(ggplot2)
data(Sacramento)
sac_data = Sacramento
sac_data$limits = factor(ifelse(sac_data$city == "SACRAMENTO", "in", "out"))
sac_data = subset(sac_data, select = -c(city, zip))

Instead of using the city or zip variables that exist in the dataset, we will simply create a variable (limits) indicating whether or not a house is technically within the city limits of Sacramento. (We do this because they would both be factor variables with a large number of levels. This is a choice that is made due to laziness, not necessarily because it is justified. Think about what issues these variables might cause.)

Use ?Sacramento to learn more about this dataset.

#?Sacramento

A plot of longitude versus latitude gives us a sense of where the city limits are.

qplot(y = longitude, x = latitude, data = sac_data,
      col = limits, main = "Sacramento City Limits ")

After these modifications, we test-train split the data.

set.seed(420)
sac_trn_idx  = sample(nrow(sac_data), size = trunc(0.80 * nrow(sac_data)))
sac_trn_data = sac_data[sac_trn_idx, ]
sac_tst_data = sac_data[-sac_trn_idx, ]

The training data should be used for all model fitting. Our goal is to find a model that is useful for predicting home prices.

(a) Find a “good” model for price. Use any methods seen in class. The model should reach a LOOCV-RMSE below 77,500 in the training data. Do not use any transformations of the response variable.

$\color{blue}{\text{Helper function to check if the criteria are met or not.}}$

check_loocv = function(model, model_name, loocv_limit = 77500) {
  df_conditions = data.frame(loocv = double(), loocv_accept = logical())
  df_conditions[model_name, "loocv"] = get_loocv_rmse(model)
  df_conditions[model_name, "loocv_accept"] = df_conditions[model_name, "loocv"] < loocv_limit
  return(df_conditions)
}

$\color{blue}{\text{First we will try the full additive model.}}$

sac_add = lm(price ~ ., data = sac_trn_data)
#summary(sac_add)
check_loocv(sac_add, "sac_add")

##         loocv loocv_accept
## sac_add 78671        FALSE

$\color{blue}{\text{This is just above our criteria of a Leave One Out Cross Validation Root Mean Square Error of below 77,500}}$ $\color{blue}{\text{Let's try a smaller model first to see what happens}}$

sac_5p = lm(price ~ beds + sqft + type + latitude + longitude, data = sac_trn_data)
#summary(sac_5p)
check_loocv(sac_5p, "sac_5p")

##        loocv loocv_accept
## sac_5p 78623        FALSE

$\color{blue}{\text{Interestingly this is similarly just above our criteria.}}$ $\color{blue}{\text{Let's try a bigger model of crossing all the variables.}}$

sac_2full = lm(price ~ (beds + sqft + type + latitude + longitude)^2,  data = sac_trn_data)
#summary(sac_2full)
check_loocv(sac_2full, "sac_2full")

##           loocv loocv_accept
## sac_2full 78811        FALSE

$\color{blue}{\text{Interestingly again this is just above our criteria.}}$ $\color{blue}{\text{Let's try a backward aic}}$

sac_backward_aic = step(sac_5p, scope = price ~ (beds + sqft + type + latitude + longitude)^2, direction = "backward", trace = 0)
summary(sac_backward_aic)$coefficients

##                    Estimate Std. Error t value   Pr(>|t|)
## (Intercept)      16480589.3  3.180e+06   5.182  2.837e-07
## beds               -22918.8  4.963e+03  -4.617  4.582e-06
## sqft                  150.2  6.006e+00  25.011 1.801e-100
## typeMulti_Family   -13834.3  2.676e+04  -0.517  6.053e-01
## typeResidential     36838.4  1.259e+04   2.927  3.530e-03
## latitude            49050.1  2.330e+04   2.105  3.560e-02
## longitude          151108.9  2.248e+04   6.723  3.575e-11

vif(sac_backward_aic)

##             beds             sqft typeMulti_Family  typeResidential 
##            2.423            2.218            1.386            1.454 
##         latitude        longitude 
##            1.194            1.246

check_loocv(sac_backward_aic, "sac_backward_aic")

##                  loocv loocv_accept
## sac_backward_aic 78623        FALSE

$\color{blue}{\text{Interestingly again this is just above our criteria of 77,500.}}$ $\color{blue}{\text{Let's try both direction aic}}$

sac_both_aic = step(sac_5p, scope = price ~ (beds + sqft + type + latitude + longitude)^2, direction = "both", trace = 0)
summary(sac_both_aic)$coefficients

##                             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)               -1.300e+09  8.772e+08 -1.4818 1.388e-01
## beds                       1.525e+07  3.017e+06  5.0535 5.484e-07
## sqft                       2.102e+02  1.691e+01 12.4319 2.475e-32
## typeMulti_Family          -1.548e+07  7.844e+06 -1.9733 4.884e-02
## typeResidential            8.450e+05  3.776e+06  0.2238 8.230e-01
## latitude                   3.280e+07  2.264e+07  1.4491 1.477e-01
## longitude                 -1.069e+07  7.228e+06 -1.4791 1.395e-01
## beds:longitude             1.256e+05  2.484e+04  5.0547 5.449e-07
## beds:sqft                 -1.586e+01  3.979e+00 -3.9861 7.389e-05
## typeMulti_Family:latitude  4.006e+05  2.031e+05  1.9718 4.901e-02
## typeResidential:latitude  -2.138e+04  9.771e+04 -0.2189 8.268e-01
## latitude:longitude         2.698e+05  1.865e+05  1.4464 1.485e-01

vif(sac_both_aic)

##                      beds                      sqft          typeMulti_Family 
##                 936540.59                     18.39                 124544.49 
##           typeResidential                  latitude                 longitude 
##                 136888.83                1179479.77                 134787.23 
##            beds:longitude                 beds:sqft typeMulti_Family:latitude 
##                 934784.96                     32.77                 124455.81 
##  typeResidential:latitude        latitude:longitude 
##                 136527.36                1007863.54

check_loocv(sac_both_aic, "sac_both_aic")

##              loocv loocv_accept
## sac_both_aic 77077         TRUE

$\color{blue}{\text{We now have a LOOCV-RMSE of 77076.5091, TRUE which is below 77,500}}$ $\color{blue}{\text{We will set this for use below in part b.}}$

sac_model = sac_both_aic

(b) Is a model that achieves a LOOCV-RMSE below 77,500 useful in this case? That is, is an average error of 77,500 low enough when predicting home prices? To further investigate, use the held-out test data and your model from part (a) to do two things:

Calculate the average percent error: \[ \frac{1}{n}\sum_i\frac{|\text{predicted}_i - \text{actual}_i|}{\text{predicted}_i} \times 100 \]
Plot the predicted versus the actual values and add the line $y = x$.

Based on all of this information, argue whether or not this model is useful.

actual = sac_tst_data$price
predicted = predict(sac_model, newdata = sac_tst_data)
# As discussed at: https://piazza.com/class/jviiq7m0zx16ar?cid=962
# Assignment asks for dividing by "predicted"
100 * mean((abs(predicted - actual)/ predicted))

## [1] 22.94

# However this is actually probably more correct:
100 * mean((abs(actual - predicted)) / actual)

## [1] 24.72

plot(actual ~ predicted, col = "dodgerblue", xlab = "Actual Home Prices", ylab = "Predicted Home Prices")
abline(a = 0, b = 1, lwd = 3, col = "darkorange")

$\color{blue}{\text{The usefulness of a \$77,500 root mean square error is highly dependent on the price of the house.}}$ $\color{blue}{\text{For a house that is on the low end of \$100,000 that would be way too large of an error to be}}$ $\color{blue}{\text{acceptable since that would be almost off by 80%. However for a \$750,000 home it would only be off by 10%. }}$ $\color{blue}{\text{So when we look at the percent of error we see that overall it is on average around 25%. 25% still}}$ $\color{blue}{\text{seems sort of high, however when we look at the plot we see that there are a few outliers that are}}$ $\color{blue}{\text{being predicted way over the actual price that may be skewing the overall percent. Most of the point appear close to the line.}}$

Exercise 4 (Does It Work?)

In this exercise, we will investigate how well backwards AIC and BIC actually perform. For either to be “working” correctly, they should result in a low number of both false positives and false negatives. In model selection,

False Positive, FP: Incorrectly including a variable in the model. Including a non-significant variable
False Negative, FN: Incorrectly excluding a variable in the model. Excluding a significant variable

Consider the true model

\[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7 + \beta_8 x_8 + \beta_9 x_9 + \beta_{10} x_{10} + \epsilon \]

where $\epsilon \sim N(0, \sigma^2 = 4)$. The true values of the $\beta$ parameters are given in the R code below.

beta_0  = 1
beta_1  = -1
beta_2  = 2
beta_3  = -2
beta_4  = 1
beta_5  = 1
beta_6  = 0
beta_7  = 0
beta_8  = 0
beta_9  = 0
beta_10 = 0
sigma = 2

Then, as we have specified them, some variables are significant, and some are not. We store their names in R variables for use later.

not_sig  = c("x_6", "x_7", "x_8", "x_9", "x_10")
signif = c("x_1", "x_2", "x_3", "x_4", "x_5")

We now simulate values for these x variables, which we will use throughout part (a).

#set.seed(420)
set.seed(19781105)
n = 100
x_1  = runif(n, 0, 10)
x_2  = runif(n, 0, 10)
x_3  = runif(n, 0, 10)
x_4  = runif(n, 0, 10)
x_5  = runif(n, 0, 10)
x_6  = runif(n, 0, 10)
x_7  = runif(n, 0, 10)
x_8  = runif(n, 0, 10)
x_9  = runif(n, 0, 10)
x_10 = runif(n, 0, 10)

We then combine these into a data frame and simulate y according to the true model.

sim_data_1 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
  y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
      beta_5 * x_5 + rnorm(n, 0 , sigma)
)

We do a quick check to make sure everything looks correct.

head(sim_data_1)

##      x_1    x_2   x_3     x_4   x_5   x_6    x_7    x_8    x_9  x_10     y
## 1 1.4992 2.0825 2.077 1.57392 4.559 1.599 5.9728 9.0543 0.5824 1.784 7.769
## 2 9.4861 0.8646 1.022 9.95006 3.992 7.784 3.9387 4.7519 7.2986 3.345 4.192
## 3 8.2667 7.3802 2.841 0.06523 6.536 8.788 5.8062 7.1948 8.9837 8.792 9.889
## 4 7.3608 6.8598 5.306 7.31392 1.649 3.635 0.5872 5.3727 0.6993 6.302 7.978
## 5 9.3510 4.9396 2.094 4.77580 4.253 9.805 5.6786 0.5231 7.6863 2.535 6.300
## 6 0.2112 3.5737 7.223 9.55127 4.775 4.650 9.4180 9.6812 5.2140 8.724 6.618

Now, we fit an incorrect model.

fit = lm(y ~ x_1 + x_2 + x_6 + x_7, data = sim_data_1)
coef(fit)

## (Intercept)         x_1         x_2         x_6         x_7 
##      4.6966     -1.3232      1.3133      0.4112     -0.1123

Notice, we have coefficients for x_1, x_2, x_6, and x_7. This means that x_6 and x_7 are false positives, while x_3, x_4, and x_5 are false negatives.

To detect the false negatives, use:

# which are false negatives?
!(signif %in% names(coef(fit)))

## [1] FALSE FALSE  TRUE  TRUE  TRUE

To detect the false positives, use:

# which are false positives?
names(coef(fit)) %in% not_sig

## [1] FALSE FALSE FALSE  TRUE  TRUE

Note that in both cases, you could sum() the result to obtain the number of false negatives or positives.

(a) Set a seed equal to your birthday; then, using the given data for each x variable above in sim_data_1, simulate the response variable y 300 times. Each time,

Fit an additive model using each of the x variables.
Perform variable selection using backwards AIC.
Perform variable selection using backwards BIC.
Calculate and store the number of false negatives for the models chosen by AIC and BIC.
Calculate and store the number of false positives for the models chosen by AIC and BIC.

Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table.

birthday = 19781105
set.seed(birthday)
num_sims = 300

false_negative_aic_sim1 = rep(0,num_sims)
false_negative_bic_sim1 = rep(0,num_sims)
false_positive_aic_sim1 = rep(0,num_sims)
false_positive_bic_sim1 = rep(0,num_sims)

for(i in 1:num_sims){
  sim_data_1$y = beta_0 +  beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + beta_5 * x_5 + rnorm(n, 0, sigma)
  model = lm(y ~ ., data = sim_data_1)

  len = length(resid(model))
  model_aic = step(model, direction = "backward", trace = 0)
  model_bic = step(model, direction = "backward", k=log(len),trace = 0)

  false_negative_aic_sim1[i] = sum(!(signif %in% names(coef(model_aic))))
  false_positive_aic_sim1[i] = sum(names(coef(model_aic)) %in% not_sig)
  false_negative_bic_sim1[i] = sum(!(signif %in% names(coef(model_bic))))
  false_positive_bic_sim1[i] = sum(names(coef(model_bic)) %in% not_sig)
}

library(knitr)
table <- data.frame(Method = c("AIC","BIC"),
         False_Negative = c(sum(false_negative_aic_sim1)/num_sims,sum(false_negative_bic_sim1)/num_sims),
         False_Positive = c(sum(false_positive_aic_sim1)/num_sims,sum(false_positive_bic_sim1)/num_sims ))
kable(table,caption="Model Selection Table")

Model Selection Table
Method	False_Negative	False_Positive
AIC	0	0.92
BIC	0	0.23

$\color{blue}{\text{For both AIC and BIC the False Negatives are 0. So the models being produced are likely too big to some degree.}}$ $\color{blue}{\text{BIC returns fewer false positives, which is what we would expect. BIC should produce smaller models.}}$

(b) Set a seed equal to your birthday; then, using the given data for each x variable below in sim_data_2, simulate the response variable y 300 times. Each time,

Fit an additive model using each of the x variables.
Perform variable selection using backwards AIC.
Perform variable selection using backwards BIC.
Calculate and store the number of false negatives for the models chosen by AIC and BIC.
Calculate and store the number of false positives for the models chosen by AIC and BIC.

Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table. Also compare to your answers in part (a) and suggest a reason for any differences.

#set.seed(420)
birthday = 19781105
set.seed(birthday)
x_1  = runif(n, 0, 10)
x_2  = runif(n, 0, 10)
x_3  = runif(n, 0, 10)
x_4  = runif(n, 0, 10)
x_5  = runif(n, 0, 10)
x_6  = runif(n, 0, 10)
x_7  = runif(n, 0, 10)
x_8  = x_1 + rnorm(n, 0, 0.1)
x_9  = x_1 + rnorm(n, 0, 0.1)
x_10 = x_2 + rnorm(n, 0, 0.1)

sim_data_2 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
  y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
      beta_5 * x_5 + rnorm(n, 0 , sigma)
)

num_sims = 300
false_negative_aic_sim2 = rep(0,num_sims)
false_negative_bic_sim2 = rep(0,num_sims)
false_positive_aic_sim2 = rep(0,num_sims)
false_positive_bic_sim2 = rep(0,num_sims)

for(i in 1:num_sims){
  sim_data_2$y = beta_0 +  beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + beta_5 * x_5 + rnorm(n, 0, sigma)
  model <- lm(y ~ ., data = sim_data_2)
  
  len <- length(resid(model))
  model_aic <- step(model, direction = "backward", trace=0)
  model_bic <- step(model, direction = "backward", k=log(len), trace=0)
  
  false_negative_aic_sim2[i] <- sum(!(signif %in% names(coef(model_aic))))
  false_positive_aic_sim2[i] <- sum(names(coef(model_aic)) %in% not_sig)
  false_negative_bic_sim2[i] <- sum(!(signif %in% names(coef(model_bic))))
  false_positive_bic_sim2[i] <- sum(names(coef(model_bic)) %in% not_sig)
}

table <- data.frame(Method = c("AIC","BIC"),
        False_Negative = c(sum(false_negative_aic_sim2)/num_sims,sum(false_negative_bic_sim2/num_sims)),
        False_Positive = c(sum(false_positive_aic_sim2)/num_sims,sum(false_positive_bic_sim2/num_sims)))
kable(table,caption="Model Selection Table")

Model Selection Table
Method	False_Negative	False_Positive
AIC	0.73	1.49
BIC	0.77	0.94

$\color{blue}{\text{This time there are some False Negatives for both models. Again we see that BIC returns fewer false positives.}}$

Week 9 - Homework, Aaron Blythe

STAT 420, Summer 2019, Aaron Blythe

7/13/2019

Exercise 1 (`longley` Macroeconomic Data)

Exercise 2 (`Credit` Data)

Exercise 3 (`Sacramento` Housing Data)

Exercise 4 (Does It Work?)

Week 9 - Homework, Aaron Blythe

STAT 420, Summer 2019, Aaron Blythe

7/13/2019

Exercise 1 (longley Macroeconomic Data)

Exercise 2 (Credit Data)

Exercise 3 (Sacramento Housing Data)

Exercise 4 (Does It Work?)

Exercise 1 (`longley` Macroeconomic Data)

Exercise 2 (`Credit` Data)

Exercise 3 (`Sacramento` Housing Data)