Statistical Rethinking Chapter 5: EoC Questions
Easy Problems
5E1 Which of the linear models below are multiple linear regressions?
$$ \[\begin{align} (1)\quad \mu_i &= \alpha + \beta x_i \\ (2)\quad \mu_i &= \beta_x x_i + \beta_z z_i \\ (3)\quad \mu_i &= \alpha + \beta(x_i - z_i) \\ (4)\quad \mu_i &= \alpha + \beta_x x_i + \beta_z z_i \end{align}\] $$ Answers -
- Simple Linear Regression (only var),
- Yes, two vars x and z,
- Simple Linear Regression only one var (x - z is a third var that is a linear combo of both)
- Yes, two vars x and z, similar to 2 but has intercept as well.
5E3 Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.
$$
\[\begin{align} \operatorname{Time\_to\_Degree} &= \alpha + \beta Funding \tag*{(Will not show relation)} \\ \operatorname{Time\_to\_Degree} &= \alpha + \beta LabSize \tag*{(Will not show relation)} \\ \operatorname{Time\_to\_Degree} &= \alpha + \beta_1 Funding + \beta_2 LabSize \tag*{(Will show relation)} \\ \end{align}\]
$$
5E4 Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D. Let Ai be an indicator variable that is 1 where case i is in category A. Also suppose Bi, Ci, and Di for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model
$$
\[\begin{align} (1)\quad \mu_i &= \alpha + \beta_A A_i + \beta_B B_i + \beta_D D_i \\ (2)\quad \mu_i &= \alpha + \beta_A A_i + \beta_B B_i + \beta_C C_i + \beta_D D_i \\ (3)\quad \mu_i &= \alpha + \beta_B B_i + \beta_C C_i + \beta_D D_i \\ (4)\quad \mu_i &= \alpha_A A_i + \alpha_B B_i + \alpha_C C_i + \alpha_D D_i \\ (5)\quad \mu_i &= \alpha_A (1 - B_i - C_i - D_i) + \alpha_B B_i + \alpha_C C_i + \alpha_D D_i \end{align}\]
$$
- 1 and 3 are equal (they have n-1 dummy categories)
- 4, and 5 are equal (where they are using index variable approach)
- 2 uses all dummy categories (will run into multi-collinearity issue)
every model except 2 should provide similar posterior.
Medium Problems
5M3 It is sometimes observed that the best predictor of fire risk is the presence of firefighters — States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?
Here we are talking about higher divorce leading to higher marriage rate.
\[ \operatorname{Divorce} \rightarrow \operatorname{Marriage} \] It could be that some people are more inclined towards marriage, and would be more ready to marrying again after divorce. We would like to control for the effect of this variable. A helpful stat would be the rate of remarriage rate.
$$
\[\begin{align} MarriageRate &\sim Normal(mu, sigma) \\ mu &= \alpha + \beta_1 DivorceRate + \beta_2 \operatorname{RemarriageRate} \\ \cdots \end{align}\] $$
Here, we are regressing marriage rate on both divorce rate and remarriage rate which would control for the effect of some people having higher rates of marriage. After running this regression, if remarriage rate is significant while divorce rate is not, that would imply that the hypothesised causal relationship from divorce rate to marriage rate was incorrect.
5M4 In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.
First let us get the data from Wikipedia page. This is actually from 2019, while the WaffleDivorce dataset has statistics from 2010, but we will make do with this for now.
library(rvest)
library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(readr)
<- read_html("https://en.wikipedia.org/wiki/The_Church_of_Jesus_Christ_of_Latter-day_Saints_membership_statistics_(United_States)") %>%
church_pop html_table()
<- which(sapply(church_pop, function(x) names(x)[[1]] == "State"))
idx <- church_pop[[idx]] %>%
church_pop select(State, LDS) %>%
mutate(LDS = readr::parse_number(LDS) / 100 * 1000) # convert to number, % to decimal then to per 1000 rate
data("WaffleDivorce")
<- WaffleDivorce
d
<- left_join(d, church_pop, by = c("Location" = "State"))
d2
# standardise LDS
$M <- standardize(d2$Marriage)
d2$A <- standardize(d2$MedianAgeMarriage)
d2$D <- standardize(d2$Divorce)
d2$L <- standardize(d2$LDS) d2
Now we will model the following relationship:
$$ \[\begin{align} D &\sim Normal(mu, sigma) \\ mu &= a + bA * A + bM * M + bL * L \\ a &\sim Normal(0, 10) \\ bA &\sim Normal(0, 10) \\ bM &\sim Normal(0, 10) \\ bL &\sim Normal(0, 10) \\ sigma &\sim Exponential(1) \end{align}\] $$
<- quap(
m4 alist(
~ dnorm(mu, sigma),
D <- a + bA * A + bM * M + bL * L,
mu ~ dnorm(0, 10),
a ~ dnorm(0, 10),
bA ~ dnorm(0, 10),
bM ~ dnorm(0, 10),
bL ~ dexp(2)
sigma data = d2
),
)
precis(m4)
## mean sd 5.5% 94.5%
## a 3.390999e-09 0.10251785 -0.1638433 0.1638433
## bA -7.552937e-01 0.15167557 -0.9977005 -0.5128868
## bM 6.430358e-03 0.15592224 -0.2427635 0.2556242
## bL -3.444925e-01 0.12258461 -0.5404064 -0.1485786
## sigma 7.249487e-01 0.07096743 0.6115291 0.8383684
Similar to previous regressions in the chapter, median age is negative (even more so now), marriage rate is even closer to zero now, and the new variable, the number of people associated with the Church of Jesus Christ of Latter-day Saints is significant and negative, implying lower number of LDS people in state leads to higher divorces.
5M5 One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.
$$
Obesity = + _1 GasolinePrice + _2 DrivingTimePerMonth + _3 RestaurantMealsPerMonth
$$
Hard Questions
5H1 In the divorce example, suppose the DAG is: M -> A -> D. What are the implied conditional independencies of the graph? Are the data consistent with it?
library(dagitty)
<- dagitty('dag{M -> A -> D}')
dag1 impliedConditionalIndependencies(dag1)
## D _||_ M | A
Thus teh implied conditional independencies here is that D is independent of M conditional on A, i.e. M tells us nothing new about D after we know A. We can test this using a multiple regression model, in fact this has been tested in the chapter.
We can check the Markov equivalents of this DAG. We see that the second one is the one used in the book.
# equivalentDAGs(dag1)
par(mfrow = c(3, 1))
lapply(equivalentDAGs(dag1), plot)
## [[1]]
## [[1]]$mar
## [1] 0 0 0 0
##
##
## [[2]]
## [[2]]$mar
## [1] 0 0 0 0
##
##
## [[3]]
## [[3]]$mar
## [1] 0 0 0 0
5H2 Assuming that the DAG for the divorce example is indeed M -> A -> D, fit a new model and use it to estimate the counterfactual effect of halving a State’s marriage rate M. Use the counterfactual example from the chapter (starting on page 140) as a template.
data(WaffleDivorce)
<- list()
d
$A <- standardize(WaffleDivorce$MedianAgeMarriage)
d$D <- standardize(WaffleDivorce$Divorce)
d$M <- standardize(WaffleDivorce$Marriage)
d
<- quap(
m5h2 alist(
## M -> A -> D
~ dnorm(mu, sigma),
D <- a + bM * M + bA * A,
mu ~ dnorm(0, 0.2),
a ~ dnorm(0, 0.5),
bM ~ dnorm(0, 0.5),
bA ~ dexp(1),
sigma ## M -> A
~ dnorm(mu_A, sigma_A),
A <- aA + bMA * M,
mu_A ~ dnorm(0, 0.2),
aA ~ dnorm(0, 0.5),
bMA ~ dexp(1)
sigma_A data = d
),
)precis(m5h2)
## mean sd 5.5% 94.5%
## a 1.447161e-07 0.09707596 -0.1551460 0.1551463
## bM -6.538046e-02 0.15077294 -0.3063447 0.1755838
## bA -6.135136e-01 0.15098347 -0.8548144 -0.3722129
## sigma 7.851173e-01 0.07784322 0.6607088 0.9095258
## aA 2.449319e-08 0.08684786 -0.1387996 0.1387997
## bMA -6.947375e-01 0.09572696 -0.8477277 -0.5417473
## sigma_A 6.817371e-01 0.06758011 0.5737310 0.7897432
<- seq(from = -2, to = 2, length.out = 30) M_seq
General counterfactual effect of M on A and D
# prep data
<- data.frame( M=M_seq )
sim_dat # simulate M and then D, using A_seq
<- sim( m5h2 , data=sim_dat , vars=c("A","D") ) # vars tells which variables to simulate and in which order
s
par(mfrow = c(1, 2))
plot( sim_dat$M , colMeans(s$D) , ylim=c(-2,2) , type="l" ,
xlab="manipulated M" , ylab="counterfactual D" )
shade( apply(s$D,2,PI) , sim_dat$M )
mtext( "Total counterfactual effect of A on D" )
plot(sim_dat$M, colMeans(s$A), ylim = c(-2, 2), type = "l", xlab = "manipulated M", ylab = "counterfactual A")
shade( apply(s$A, 2, PI), sim_dat$M)
mtext( "Counterfactual effect M -> A" )
For effect of halving marriage rate, we can halve the marriage rate of each state, compute posterior distribution of Divorce using these new marriage rates and our previously built model and finally see the distribution of differences between the predicted divorce rates for our original and halved marriage rates.
<- attr(d$M, "scaled:center")
center <- attr(d$M, "scaled:scale")
scale
<- WaffleDivorce %>%
d2 select(Location, Divorce, Marriage) %>%
mutate(HM = Marriage / 2,
HM = (HM - center) / scale,
M = standardize(Marriage))
<- sim(m5h2, data = list(M = d2$HM), vars = c("A", "D"))
s_half <- sim(m5h2, data = list(M = d2$M), vars = c("A", "D"))
s_full <- d2 %>%
d2 mutate(HD = colMeans(s_half$D),
HD2 = HD * scale + center,
FD = colMeans(s_full$D),
FD2 = FD * scale + center)
summary(d2$HD - d2$FD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4612 -1.0433 -0.8926 -0.9316 -0.7765 -0.5815
summary(d2$HD2 - d2$FD2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.549 -3.962 -3.390 -3.538 -2.949 -2.209
Thus, halving the marriage rate seems to have reduced the divorce rate by 0.6 to 1.4 standard deviations (or 2.5 to 5.6 persons per 1000).
5H3 Return to the milk energy model, m5.7. Suppose that the true causal relationship among the variables is:
M -> N M -> K N -> K
Now compute the counterfactual effect on K of doubling M. You will need to account for both the direct and indirect paths of causation. Use the counterfactual example from the chapter (starting on page 140) as a template.
data(milk)
<- milk
d $K <- standardize( d$kcal.per.g )
d$N <- standardize( d$neocortex.perc )
d$M <- standardize( log(d$mass) )
d
<- d[complete.cases(d$K, d$N, d$M), ]
dcc
<- quap(
m5h3 alist(
# M -> N -> K
~ dnorm( mu , sigma ) ,
K <- a + bN*N + bM*M ,
mu ~ dnorm( 0 , 0.2 ) ,
a ~ dnorm( 0 , 0.5 ) ,
bN ~ dnorm( 0 , 0.5 ) ,
bM ~ dexp( 1 ),
sigma # M -> N
~ dnorm(mu_N, sigma_N),
N <- aN + bMN * M,
mu_N ~ dnorm(0, 0.5),
aN ~ dnorm(0, 0.5),
bMN <- dexp(1)
sigma_N
) ,data=dcc
)
We have created the model, let us create the counterfactual plots now to see the effect of manipulating M.
<- seq(-2, 2, length.out = 30)
M_seq
<- sim(m5h3, data = list(M = M_seq), vars = c("N", "K"))
sim_dat
par(mfrow = c(1, 2))
plot(x = M_seq, y = colMeans(sim_dat$K), type = "l", ylim = c(-2, 2),
xlab = "Manipulated M", ylab = "Total CounterfactualK")
shade(apply(sim_dat$K, 2, PI), M_seq)
mtext("Total Counterfactual effect of M on K")
plot(x = M_seq, y = colMeans(sim_dat$N), type = 'l', ylim = c(-2, 2),
xlab = "Manipulated M", ylab = "Counterfactual N")
shade(apply(sim_dat$N, 2, PI), M_seq)
mtext("Counterfactual effect M ->K")
Now let us check the effect of doubling M on K.
<- attr(d$M, "scaled:center")
centre <- attr(d$M, "scaled:scale")
scale
<- dcc %>%
d2 select(kcal.per.g, neocortex.perc, mass) %>%
mutate(mass = log(mass),
m2 = mass * 2,
M2 = (m2 - centre) / scale)
<- sim(m5h3, list(M = d2$M2), vars = c("N", 'K'))
sim_double <- sim(m5h3, list(M = dcc$M), vars = c("N", 'K'))
sim_orig
<- d2 %>%
d3 mutate(K2 = colMeans(sim_double$K),
kc2 = K2 * scale + centre,
K = colMeans(sim_orig$K),
kc = K * scale + centre)
summary(d3$K2 - d3$K)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.75206 -0.54210 -0.22480 -0.25011 -0.09478 0.29238
summary(d3$kc2 - d3$kcal.per.g)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0763 -0.5211 0.4727 0.4312 1.0667 2.2484
Doubling the mass can cause a K to decrease by -0.79 std to +0.34 std. or -1.13 kcal/g to 2.3 kcal/g.
5H4 Here is an open practice problem to engage your imagination. In the divorce date, States in the southern United States have many of the highest divorce rates. Add the South indicator variable to the analysis. First, draw one or more DAGs that represent your ideas for how Southern American culture might influence any of the other three variables (D,Mor A). Then list the testable implications ofyour DAGs, if there are any, and fit one or more models to evaluate the implications. What do you think the influence of “Southerness” is?
Southerness could probably influence M and/or A.
# M <- S -> A
# D <- A -> M
par(mfrow = c(1,1))
<- dagitty( "dag{
dag4 M <- S -> A
D <- A -> M
D <- M
S <- D
}")
coordinates(dag4) <- list(x = c(S = 1, M = 2, A = 0, D = 1),
y = c(S = 0, M = 1, A = 1, D = 2))
plot(dag4)
Testable implications:
- \(S \not\!\perp\!\!\!\perp A\)
- \(S \not\!\perp\!\!\!\perp M\)
- \(S \not\!\perp\!\!\!\perp D\)
The rest are the same as in the chapter.
data("WaffleDivorce")
<- WaffleDivorce %>%
d select(Divorce, Marriage, MedianAgeMarriage, South) %>%
mutate(across(Divorce:MedianAgeMarriage, standardize)) %>% # standardize the numerical ones %>%
mutate(South = ifelse(South == 0, 1, 2)) %>%
rename(M = Marriage, D = Divorce, A = MedianAgeMarriage, S = South)
round(cor(d), 2)
## D M A S
## D 1.00 0.37 -0.60 0.35
## M 0.37 1.00 -0.72 0.08
## A -0.60 -0.72 1.00 -0.25
## S 0.35 0.08 -0.25 1.00
South does seem to be related to Divorce and Age, but it’s relationship with Marriage is mostly non-existent.
Let us fit a multiple regression model to model Divorce using all the three as predictors, and compare to a model without South. For South we will use the index variable approach.
<- quap(
m5h4_F alist(
~ dnorm(mu, sigma),
D <- a[S] + bM*M + bA*A,
mu ~ dnorm(0, 1),
a[S] ~ dnorm(0, 1),
bM ~ dnorm(0, 1),
bA ~ dexp(1)
sigma data = d
),
)precis(m5h4_F, depth = 2)
## mean sd 5.5% 94.5%
## a[1] -0.12422445 0.12722620 -0.3275565 0.07910758
## a[2] 0.31168920 0.20500294 -0.0159451 0.63932349
## bM -0.06654026 0.15544994 -0.3149793 0.18189878
## bA -0.58922195 0.15972715 -0.8444968 -0.33394711
## sigma 0.76056173 0.07521938 0.6403466 0.88077683
a[1] and a[2] that represent the two paramters for South
have zero in their credible interval and thus we can conclude that they do not provide us any extra information after conditioning on Marriage Rate and Median Age at marriage.
We can compare the coefficients for M and A with a model without South
.
<- quap(
m5h4_A alist(
~ dnorm(mu, sigma),
D <- a + bM*M + bA*A,
mu ~ dnorm(0, 1),
a ~ dnorm(0, 1),
bM ~ dnorm(0, 1),
bA ~ dexp(1)
sigma data = d
),
)
plot(coeftab(m5h4_F, m5h4_A))
As can be seen, there’s only very slight differnce in the parameter’s values.