Week 10 - Homework

Exercise 1 (Simulating Wald and Likelihood Ratio Tests)
Exercise 2 (Surviving the Titanic)
Exercise 3 (Breast Cancer Detection)

Exercise 1 (Simulating Wald and Likelihood Ratio Tests)

In this exercise we will investigate the distributions of hypothesis tests for logistic regression. For this exercise, we will use the following predictors.

sample_size = 150
set.seed(420)
x1 = rnorm(n = sample_size)
x2 = rnorm(n = sample_size)
x3 = rnorm(n = sample_size)

Recall that

\[ p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}] \]

Consider the true model

\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 \]

where

\(\beta_0 = 0.4\)
\(\beta_1 = -0.35\)

(a) To investigate the distributions, simulate from this model 2500 times. To do so, calculate

\[ P[Y = 1 \mid {\bf X} = {\bf x}] \]

for an observation, and then make a random draw from a Bernoulli distribution with that success probability. (Note that a Bernoulli distribution is a Binomial distribution with parameter \(n = 1\). There is no direction function in R for a Bernoulli distribution.)

Each time, fit the model:

\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \]

Store the test statistics for two tests:

The Wald test for \(H_0: \beta_2 = 0\), which we say follows a standard normal distribution for “large” samples
The likelihood ratio test for \(H_0: \beta_2 = \beta_3 = 0\), which we say follows a \(\chi^2\) distribution (with some degrees of freedom) for “large” samples

beta_0 = 0.4
beta_1 = -0.35
num_sims = 2500
eta = beta_0 + beta_1 * x1
p = 1 / (1 + exp(-eta))

y = rep(0, sample_size)
sim_data = data.frame(y, x1, x2, x3)

wald_test = rep(0, num_sims)
lrt_test = rep(0, num_sims)

for (i in 1:num_sims){
  sim_data$y = rbinom(n = sample_size, size = 1, prob = p)
  
  mod_full = glm(y ~ x1 + x2 + x3, data = sim_data, family = binomial)
  mod_1 = glm(y ~ x1, data = sim_data, family = binomial)
  
  wald_test[i] = summary(mod_full)$coefficients["x2", "z value"]
  lrt_test[i] = anova(mod_1, mod_full, test = "LRT")[2, "Deviance"]
}

(b) Plot a histogram of the empirical values for the Wald test statistic. Overlay the density of the true distribution assuming a large sample.

hist(wald_test, freq = FALSE,
     xlab = "Empirical Values of z-values",
     main = "Histogram of the Wald Test Statistic",
     breaks = 30,
     col = "dodgerblue")
x = seq(-1, 1, length = 100)
curve(dnorm(x, mean = 0, sd = 1), col = "darkorange", add = TRUE, lwd = 4)

(c) Use the empirical results for the Wald test statistic to estimate the probability of observing a test statistic larger than 1. Also report this probability using the true distribution of the test statistic assuming a large sample.

mean(wald_test > 1)

## [1] 0.168

pnorm(1, mean = 0, sd = 1, lower.tail = FALSE)

## [1] 0.1587

(d) Plot a histogram of the empirical values for the likelihood ratio test statistic. Overlay the density of the true distribution assuming a large sample.

hist(lrt_test, freq = FALSE,
     xlab = "Empirical Values of Deviances",
     main = "Histogram of the LRT Statistic",
     breaks = 30,
     col = "dodgerblue")
k = 3 - 1
curve(dchisq(x, df = k), col = "darkorange", add = TRUE, lwd = 4)

(e) Use the empirical results for the likelihood ratio test statistic to estimate the probability of observing a test statistic larger than 5. Also report this probability using the true distribution of the test statistic assuming a large sample.

mean(lrt_test > 5)

## [1] 0.0828

pchisq(5, df = k, lower.tail = FALSE)

## [1] 0.08208

(f) Repeat (a)-(e) but with simulation using a smaller sample size of 10. Based on these results, is this sample size large enough to use the standard normal and \(\chi^2\) distributions in this situation? Explain.

sample_size = 10
set.seed(420)
x1 = rnorm(n = sample_size)
x2 = rnorm(n = sample_size)
x3 = rnorm(n = sample_size)

beta_0 = 0.4
beta_1 = -0.35
num_sims = 2500
eta = beta_0 + beta_1 * x1
p = 1 / (1 + exp(-eta))

y = rep(0, sample_size)
sim_data_2 = data.frame(y, x1, x2, x3)

wald_test_2 = rep(0, num_sims)
lrt_test_2 = rep(0, num_sims)

for (i in 1:num_sims){
  sim_data_2$y = rbinom(n = sample_size, size = 1, prob = p) 
  
  mod_full = glm(y ~ x1 + x2 + x3, data = sim_data_2, family = binomial)
  mod_1 = glm(y ~ x1, data = sim_data_2, family = binomial)
  
  wald_test_2[i] = summary(mod_full)$coefficients["x2", "z value"]
  lrt_test_2[i] = anova(mod_1, mod_full, test = "LRT")[2, "Deviance"]
}


hist(wald_test_2, freq = FALSE,
     xlab = "Empirical Values of z-values",
     main = "Histogram of the Wald Test Statistic",
     breaks = 30,
     col = "dodgerblue")
x = seq(-1, 1, length = 100)
curve(dnorm(x, mean = 0, sd = 1), col = "darkorange", add = TRUE, lwd = 4)

hist(lrt_test_2, freq = FALSE,
     xlab = "Empirical Values of Deviances",
     main = "Histogram of the LRT Statistic",
     breaks = 30,
     col = "dodgerblue")
k = 3 - 1
curve(dchisq(x, df = k), col = "darkorange", add = TRUE, lwd = 4)

mean(wald_test_2 > 1)

## [1] 0.0796

mean(lrt_test_2 > 5)

## [1] 0.3324

\(\color{blue}{\text{A sample size of 10 does not seem large enough to use either a standard normal or chi-squared distribution. }}\) \(\color{blue}{\text{It is noticeable that the histogram of the empirical results and the curve of the true distribution don't match well.}}\) \(\color{blue}{\text{The estimated and true probabilities are also pretty far off. }}\)

Exercise 2 (Surviving the Titanic)

For this exercise use the ptitanic data from the rpart.plot package. (The rpart.plot package depends on the rpart package.) Use ?rpart.plot::ptitanic to learn about this dataset. We will use logistic regression to help predict which passengers aboard the Titanic will survive based on various attributes.

# install.packages("rpart")
# install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
data("ptitanic")

For simplicity, we will remove any observations with missing data. Additionally, we will create a test and train dataset.

ptitanic = na.omit(ptitanic)
set.seed(42)
trn_idx = sample(nrow(ptitanic), 300)
ptitanic_trn = ptitanic[trn_idx, ]
ptitanic_tst = ptitanic[-trn_idx, ]

(a) Consider the model

\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_3x_4 \]

where

\[ p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}] \]

is the probability that a certain passenger survives given their attributes and

\(x_1\) is a dummy variable that takes the value \(1\) if a passenger was 2nd class.
\(x_2\) is a dummy variable that takes the value \(1\) if a passenger was 3rd class.
\(x_3\) is a dummy variable that takes the value \(1\) if a passenger was male.
\(x_4\) is the age in years of a passenger.

Fit this model to the training data and report its deviance.

tt_mod = glm(survived ~ pclass + sex + age + sex:age, data = ptitanic_trn, family = binomial)
# summary(tt_mod)
deviance(tt_mod)

## [1] 281.5

(b) Use the model fit in (a) and an appropriate statistical test to determine if class played a significant role in surviving on the Titanic. Use \(\alpha = 0.01\). Report:

The null hypothesis of the test
The test statistic of the test
The p-value of the test
A statistical decision
A practical conclusion

tt_null = glm(survived ~ sex + age + sex:age, data = ptitanic_trn, family = binomial)
anova(tt_null, tt_mod, test = "LRT")

## Analysis of Deviance Table
## 
## Model 1: survived ~ sex + age + sex:age
## Model 2: survived ~ pclass + sex + age + sex:age
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1       296        312                         
## 2       294        282  2     30.7  2.2e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(tt_null, tt_mod, test = "LRT")[2, "Deviance"]

## [1] 30.67

anova(tt_null, tt_mod, test = "LRT")[2, "Pr(>Chi)"]

## [1] 2.185e-07

anova(tt_null, tt_mod, test = "LRT")[2, "Pr(>Chi)"] < 0.01

## [1] TRUE

\(\color{blue}{\text{The null hypothesis of the test: }}\) \[H_0: \beta_1 = \beta_2 = 0\]
\(\color{blue}{\text{The test statistic of the test: }}\) 30.6734
\(\color{blue}{\text{The p-value of the test: }}\) 2.184610^{-7}
\(\color{blue}{\text{A statistical decision: Reject the Null Hypothesis }}\)
\(\color{blue}{\text{A practical conclusion: We prefer the larger model, meaning that there is a relationship between class and survival.}}\)

(c) Use the model fit in (a) and an appropriate statistical test to determine if an interaction between age and sex played a significant role in surviving on the Titanic. Use \(\alpha = 0.01\). Report:

The null hypothesis of the test
The test statistic of the test
The p-value of the test
A statistical decision
A practical conclusion

tt_no_int = glm(survived ~ pclass + sex + age, data = ptitanic_trn, family = binomial)
anova(tt_no_int, tt_mod, test = "LRT")

## Analysis of Deviance Table
## 
## Model 1: survived ~ pclass + sex + age
## Model 2: survived ~ pclass + sex + age + sex:age
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       295        288                       
## 2       294        282  1      6.6     0.01 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(tt_no_int, tt_mod, test = "LRT")[2, "Deviance"]

## [1] 6.599

anova(tt_no_int, tt_mod, test = "LRT")[2, "Pr(>Chi)"]

## [1] 0.0102

anova(tt_no_int, tt_mod, test = "LRT")[2, "Pr(>Chi)"] < 0.01

## [1] FALSE

\(\color{blue}{\text{The null hypothesis of the test: }}\) \[H_0: \beta_4 = 0\]
\(\color{blue}{\text{The test statistic of the test: }}\) 6.5992
\(\color{blue}{\text{The p-value of the test: }}\) 0.0102
\(\color{blue}{\text{A statistical decision: Fail to Reject the Null Hypothesis (although, very close) }}\)
\(\color{blue}{\text{A practical conclusion: We prefer the smaller model, meaning that the interaction of sex and age is not significant to survival.}}\)

(d) Use the model fit in (a) as a classifier that seeks to minimize the misclassification rate. Classify each of the passengers in the test dataset. Report the misclassification rate, the sensitivity, and the specificity of this classifier. (Use survived as the positive class.)

pred = ifelse(predict(tt_mod, ptitanic_tst, type = "response") > 0.5, "survived", "died")
mean(pred != ptitanic_tst$survived)

## [1] 0.2118

# Functions from the text: https://daviddalpiaz.github.io/appliedstats/logistic-regression.html
make_conf_mat = function(predicted, actual){
  table(predicted = predicted, actual = actual)
}

get_sens = function(conf_mat){
  conf_mat[2, 2] / sum(conf_mat[, 2])
}

get_spec = function(conf_mat){
  conf_mat[1, 1] / sum(conf_mat[, 1])
}

(conf_mat = make_conf_mat(predicted = pred, actual = ptitanic_tst$survived))

##           actual
## predicted  died survived
##   died      368       80
##   survived   78      220

get_sens(conf_mat)

## [1] 0.7333

get_spec(conf_mat)

## [1] 0.8251

\(\color{blue}{\text{The misclassification rate is: 0.2118 }}\)
\(\color{blue}{\text{The sensitivity is: 0.7333}}\)
\(\color{blue}{\text{The specificity is: 0.8251}}\)

Exercise 3 (Breast Cancer Detection)

For this exercise we will use data found in wisc-train.csv and wisc-test.csv, which contain train and test data, respectively. wisc.csv is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)

You should consider coercing the response to be a factor variable if it is not stored as one after importing the data.

library(readr)
wisc_train <- read_csv("wisc-train.csv")

## Parsed with column specification:
## cols(
##   class = col_character(),
##   radius = col_double(),
##   texture = col_double(),
##   perimeter = col_double(),
##   area = col_double(),
##   smoothness = col_double(),
##   compactness = col_double(),
##   concavity = col_double(),
##   concave = col_double(),
##   symmetry = col_double(),
##   fractal = col_double()
## )

#View(wisc_train)
is.factor(wisc_train$class)

## [1] FALSE

wisc_train$class = as.factor(wisc_train$class)
is.factor(wisc_train$class)

## [1] TRUE

wisc_test <- read_csv("wisc-test.csv")

## Parsed with column specification:
## cols(
##   class = col_character(),
##   radius = col_double(),
##   texture = col_double(),
##   perimeter = col_double(),
##   area = col_double(),
##   smoothness = col_double(),
##   compactness = col_double(),
##   concavity = col_double(),
##   concave = col_double(),
##   symmetry = col_double(),
##   fractal = col_double()
## )

#View(wisc_test)
is.factor(wisc_test$class)

## [1] FALSE

wisc_test$class = as.factor(wisc_test$class)
is.factor(wisc_test$class)

## [1] TRUE

(a) The response variable class has two levels: M if a tumor is malignant, and B if a tumor is benign. Fit three models to the training data.

An additive model that uses radius, smoothness, and texture as predictors
An additive model that uses all available predictors
A model chosen via backwards selection using AIC. Use a model that considers all available predictors as well as their two-way interactions for the start of the search.

\(\color{blue}{\text{NOTE: have to set the warnings to FALSE because there are way too many: }}\) \(\color{blue}{\text{ ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred }}\)

wisc_add_sm = glm(class ~ radius + smoothness + texture, data = wisc_train, family = binomial)
wisc_add_all = glm(class ~ ., data = wisc_train, family = binomial)
wisc_full = glm(class ~ . ^ 2, data = wisc_train, family = binomial, maxit = 50)
wisc_select = step(wisc_full, trace = 0)

#summary(wisc_select)

For each, obtain a 5-fold cross-validated misclassification rate using the model as a classifier that seeks to minimize the misclassification rate. Based on this, which model is best? Relative to the best, are the other two underfitting or over fitting? Report the test misclassification rate for the model you picked as the best.

library(boot)
set.seed(1)
cv.glm(wisc_train, wisc_add_sm, K = 5)$delta[1]

## [1] 0.06823

set.seed(1)
cv.glm(wisc_train, wisc_add_all, K = 5)$delta[1]

## [1] 0.1286

set.seed(1)
cv.glm(wisc_train, wisc_select, K = 5)$delta[1]

## [1] 0.1699

pred = ifelse(predict(wisc_add_sm, wisc_test, type = "response") > 0.5, "M", "B")
mean(pred != wisc_test$class)

## [1] 0.08955

Based on this, which model is best? Report the test misclassification rate for the model you picked as the best.

\(\color{blue}{\text{ The small additive model using 'radius', 'smoothness', and 'texture' as predictors. The test misclassification rate is 0.0896 }}\)

Relative to the best, are the other two underfitting or over fitting?

\(\color{blue}{\text{ The other two models are overfitting }}\)

(b) In this situation, simply minimizing misclassifications might be a bad goal since false positives and false negatives carry very different consequences. Consider the M class as the “positive” label. Consider each of the probabilities stored in cutoffs in the creation of a classifier using the additive model fit in (a).

cutoffs = seq(0.01, 0.99, by = 0.01)

That is, consider each of the values stored in cutoffs as \(c\). Obtain the sensitivity and specificity in the test set for each of these classifiers. Using a single graphic, plot both sensitivity and specificity as a function of the cutoff used to create the classifier. Based on this plot, which cutoff would you use? (0 and 1 have not been considered for coding simplicity. If you like, you can instead consider these two values.)

\[ \hat{C}(\bf x) = \begin{cases} 1 & \hat{p}({\bf x}) > c \\ 0 & \hat{p}({\bf x}) \leq c \end{cases} \]

# Functions from the text: https://daviddalpiaz.github.io/appliedstats/logistic-regression.html
make_conf_mat = function(predicted, actual){
  table(predicted = predicted, actual = actual)
}

get_sens = function(conf_mat){
  conf_mat[2, 2] / sum(conf_mat[, 2])
}

get_spec = function(conf_mat){
  conf_mat[1, 1] / sum(conf_mat[, 1])
}

\(\color{blue}{\text{Additive model with three predictors, chosen in part A.}}\)

n = length(cutoffs)
sens = rep(0, n)
spec = rep(0, n)

for (i in 1:n){
  pred = ifelse(predict(wisc_add_sm, wisc_test, type = "response") > cutoffs[i], "M", "B")
  conf_mat = make_conf_mat(predicted = pred, actual = wisc_test$class)
  sens[i] = get_sens(conf_mat)
  spec[i] = get_spec(conf_mat)
}

plot(sens ~ cutoffs,type = "l", lwd = 3,
     xlab = "Cutoffs", ylab = "Sensitivity & Specificity",
     main = "Cutoffs vs Results", col = "dodgerblue")
axis(1, seq(0, 1, 0.1))
lines(cutoffs, spec, col="darkorange", lty = 2, lwd = 3)
legend("bottom", c("Sensitivity", "Specificity"), lty = c(1, 2), lwd = 2,col = c("dodgerblue", "darkorange"))
grid(nx = 20, ny = 20)

\(\color{blue}{\text{It appears that about 0.66 is where the intersection of the two curve occurs.}}\)

\(\color{blue}{\text{Full Additive model from part A.}}\)

n = length(cutoffs)
sens = rep(0, n)
spec = rep(0, n)

for (i in 1:n){
  pred = ifelse(predict(wisc_add_all, wisc_test, type = "response") > cutoffs[i], "M", "B")
  conf_mat = make_conf_mat(predicted = pred, actual = wisc_test$class)
  sens[i] = get_sens(conf_mat)
  spec[i] = get_spec(conf_mat)
}

plot(sens ~ cutoffs,type = "l", lwd = 2,
     xlab = "Cutoffs", ylab = "Sensitivity & Specificity",
     main = "Cutoffs vs Results", col = "dodgerblue")
axis(1, seq(0, 1, 0.1))
lines(cutoffs, spec, col="darkorange", lty = 2, lwd = 3)
legend("bottom", c("Sens", "Spec"), lty = c(1, 2), lwd = 3,
       col = c("dodgerblue", "darkorange"))
grid(nx = 20, ny = 20)

\(\color{blue}{\text{In this plot the cutoff appears to be at 0.8. }}\)

Week 10 - Homework

STAT 420, Summer 2019, Blythe

7/17/2019

Exercise 1 (Simulating Wald and Likelihood Ratio Tests)

Exercise 2 (Surviving the Titanic)

Exercise 3 (Breast Cancer Detection)