In this exercise we will investigate the distributions of hypothesis tests for logistic regression. For this exercise, we will use the following predictors.

```
sample_size = 150
set.seed(420)
x1 = rnorm(n = sample_size)
x2 = rnorm(n = sample_size)
x3 = rnorm(n = sample_size)
```

Recall that

\[ p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}] \]

Consider the true model

\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 \]

where

- \(\beta_0 = 0.4\)
- \(\beta_1 = -0.35\)

**(a)** To investigate the distributions, simulate from this model 2500 times. To do so, calculate

\[ P[Y = 1 \mid {\bf X} = {\bf x}] \]

for an observation, and then make a random draw from a Bernoulli distribution with that success probability. (Note that a Bernoulli distribution is a Binomial distribution with parameter \(n = 1\). There is no direction function in `R`

for a Bernoulli distribution.)

Each time, fit the model:

\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \]

Store the test statistics for two tests:

- The Wald test for \(H_0: \beta_2 = 0\), which we say follows a standard normal distribution for “large” samples
- The likelihood ratio test for \(H_0: \beta_2 = \beta_3 = 0\), which we say follows a \(\chi^2\) distribution (with some degrees of freedom) for “large” samples

```
beta_0 = 0.4
beta_1 = -0.35
num_sims = 2500
eta = beta_0 + beta_1 * x1
p = 1 / (1 + exp(-eta))
y = rep(0, sample_size)
sim_data = data.frame(y, x1, x2, x3)
wald_test = rep(0, num_sims)
lrt_test = rep(0, num_sims)
for (i in 1:num_sims){
sim_data$y = rbinom(n = sample_size, size = 1, prob = p)
mod_full = glm(y ~ x1 + x2 + x3, data = sim_data, family = binomial)
mod_1 = glm(y ~ x1, data = sim_data, family = binomial)
wald_test[i] = summary(mod_full)$coefficients["x2", "z value"]
lrt_test[i] = anova(mod_1, mod_full, test = "LRT")[2, "Deviance"]
}
```

**(b)** Plot a histogram of the empirical values for the Wald test statistic. Overlay the density of the true distribution assuming a large sample.

```
hist(wald_test, freq = FALSE,
xlab = "Empirical Values of z-values",
main = "Histogram of the Wald Test Statistic",
breaks = 30,
col = "dodgerblue")
x = seq(-1, 1, length = 100)
curve(dnorm(x, mean = 0, sd = 1), col = "darkorange", add = TRUE, lwd = 4)
```

**(c)** Use the empirical results for the Wald test statistic to estimate the probability of observing a test statistic larger than 1. Also report this probability using the true distribution of the test statistic assuming a large sample.

`mean(wald_test > 1)`

`## [1] 0.168`

`pnorm(1, mean = 0, sd = 1, lower.tail = FALSE) `

`## [1] 0.1587`

**(d)** Plot a histogram of the empirical values for the likelihood ratio test statistic. Overlay the density of the true distribution assuming a large sample.

```
hist(lrt_test, freq = FALSE,
xlab = "Empirical Values of Deviances",
main = "Histogram of the LRT Statistic",
breaks = 30,
col = "dodgerblue")
k = 3 - 1
curve(dchisq(x, df = k), col = "darkorange", add = TRUE, lwd = 4)
```

**(e)** Use the empirical results for the likelihood ratio test statistic to estimate the probability of observing a test statistic larger than 5. Also report this probability using the true distribution of the test statistic assuming a large sample.

`mean(lrt_test > 5)`

`## [1] 0.0828`

`pchisq(5, df = k, lower.tail = FALSE)`

`## [1] 0.08208`

**(f)** Repeat **(a)**-**(e)** but with simulation using a smaller sample size of 10. Based on these results, is this sample size large enough to use the standard normal and \(\chi^2\) distributions in this situation? Explain.

```
sample_size = 10
set.seed(420)
x1 = rnorm(n = sample_size)
x2 = rnorm(n = sample_size)
x3 = rnorm(n = sample_size)
```

```
beta_0 = 0.4
beta_1 = -0.35
num_sims = 2500
eta = beta_0 + beta_1 * x1
p = 1 / (1 + exp(-eta))
y = rep(0, sample_size)
sim_data_2 = data.frame(y, x1, x2, x3)
wald_test_2 = rep(0, num_sims)
lrt_test_2 = rep(0, num_sims)
for (i in 1:num_sims){
sim_data_2$y = rbinom(n = sample_size, size = 1, prob = p)
mod_full = glm(y ~ x1 + x2 + x3, data = sim_data_2, family = binomial)
mod_1 = glm(y ~ x1, data = sim_data_2, family = binomial)
wald_test_2[i] = summary(mod_full)$coefficients["x2", "z value"]
lrt_test_2[i] = anova(mod_1, mod_full, test = "LRT")[2, "Deviance"]
}
hist(wald_test_2, freq = FALSE,
xlab = "Empirical Values of z-values",
main = "Histogram of the Wald Test Statistic",
breaks = 30,
col = "dodgerblue")
x = seq(-1, 1, length = 100)
curve(dnorm(x, mean = 0, sd = 1), col = "darkorange", add = TRUE, lwd = 4)
```

```
hist(lrt_test_2, freq = FALSE,
xlab = "Empirical Values of Deviances",
main = "Histogram of the LRT Statistic",
breaks = 30,
col = "dodgerblue")
k = 3 - 1
curve(dchisq(x, df = k), col = "darkorange", add = TRUE, lwd = 4)
```