Statistical Rethinking Chapter 11 End of Chapter Questions
Easy Questions
11E1
If an event has probability 0.35, what are the log odds of this event.
Odds = p / (1-p)
<- 0.35
p log(p / (1-p))
## [1] -0.6190392
11E2
If an event has log-odds 3.2, what is the probability of this event?
Prob = Odds / (1 + Odds)
<- 3.2
lo exp(lo) / (1 + exp(lo))
## [1] 0.9608343
11E3
Suppose that a coefficient in a logistic regression has value 1.7. What does this imply about the proportional change in odds of the outcome?
The proportional odds increase by 1.7 for every unit increase in the predictor.
11E4
Why do Poisson regressions sometimes require the use of an offset? Provide an example.
If we are comparing/using data whose rate is specified using different units of time/space, then using an offset becomes necessary. For instead let’s say we are studying the number of people that visit a bank branch. If one branch computes the daily rate while the other a weekly one, then comparing the branches would require using an offset so that the comparison can be made.
Medium Questions
11M1
As explained in the chapter, binomial data can be organized in aggregated and disaggregated forms, without any impact on inference. But the likelihood of the data does change when the data are converted between the two formats. Can you explain why?
In the aggregated form, the probability of observing k successes out of n trials, with probabilty of success as p is:
\[ Pr(k|n, p) = {n \choose k} p^k (1-p)^{n-k} \]
For the non-aggregated case, the probability of k successes is:
\[ Pr(1, ..., 1 | p) = p^k (1-p)^{n-k} \] In the aggregated form, we have the additional multiplicity term in the beginning describing the number of combinations/ways of getting k successes out of n trials. In the non-aggregated form, we get the data describes the way it happened, so we don’t have the multiplicity term anymore.
11M2
If a coefficient in a Poisson regression has value 1.7, what does this imply about the change in the outcome?
Let it be a regression with a single predictor and intercept. We have:
\[ \log (\lambda) = \alpha + \beta X \]
where \(\beta=1.7\). For one unit increase in X, we have:
\[ \begin{align} \log(\lambda_2) - \log(\lambda_1) &= \alpha + \beta (X+1) - \alpha - \beta(X) \\ \log (\frac{\lambda_2}{\lambda_1}) &= \beta \\ \text{Taking exponent of each side}: \\ \frac{\lambda_2}{\lambda_1} = \exp (\beta) \\ \implies \lambda_2 = \exp(\beta) \lambda_1 \end{align} \]
Thus for every unit increase in predictor, outcome increases by exp(1.7) = 5.47
11M3
Explain why the logit link is appropriate for a binomial generalized linear model.
In a binomial model, the parameter of interest we are trying to map is the probability of success. A probability needs to lie between 0 and 1 so we need a link that maps the linear space to [0-1]. Logit link achieves this.
\[ logit(p_i) = \log \frac{p_i}{1-p_i} = \alpha + \beta x_i \\ p_i = \frac{1}{1 + \exp (-( \alpha + \beta x_i) ) } \]
Plotting the above:
curve(1/(1+exp(-x)), from = -10, to = 10, n = 1e3)
abline(h = c(0,1), lty = 2)
mtext("Logit Link confines the output space to 0-1")
11M4
Explain why the log link is appropriate for a Poisson generalized linear model.
For a Poisson model, we want to map the linear space to a positive only space. The log link allows this. We build a linear model for the log of \(\lambda\), so that after exponentiating we get positive only values.
11M5
What would it imply to use a logit link for the mean of a Poisson generalized linear model? Can you think of a real research problem for which this would make sense?
Using a logit link would constrict \(\lambda\) to be within [0-1]. Since Poisson is for count events, this would imply that in the problem the event either happens (1 count) or does not happen (0). The event cannot happen more than once.
Poisson is a special case of binomial, so we could just use binomial anyway in such a situation. Using Poisson would make sense possibly if we believe that the influence of a predictor diminishes.
11M6
State the constraints for which the binomial and Poisson distributions have maximum entropy. Are the constraints different at all for binomial and Poisson? Why or why not?
Binomial Distribution has the max entropy when each trial must result in one of two events and the expected value is constant. Poisson is just a special case of the Binomial when the probability of success is very low and there is no upper bound, so it has the same constraints.
11M7
Use quap to construct a quadratic approximate posterior distribution for the chimpanzee model that includes a unique intercept for each actor, m11.4 (page 330). Compare the quadratic approximation to the posterior distribution produced instead from MCMC. Can you explain both the differences and the similarities between the approximate and the MCMC distributions? Relax the prior on the actor intercepts to Normal(0,10). Re-estimate the posterior using both ulam and quap. Do the differences increase or decrease? Why?
Let’s load the dataset and do the pre-processing:
library(rethinking)
data(chimpanzees)
<- chimpanzees
d $treatment <- 1 + d$prosoc_left + 2 * d$condition
d
<- list(
dat_list pulled_left = d$pulled_left,
actor = d$actor,
treatment = as.integer(d$treatment)
)
Repeating the model m11.4
using both quadratic approximation and MCMC.
<- alist(
model_formula ~ dbinom(1, p),
pulled_left logit(p) <- a[actor] + b[treatment],
~ dnorm(0, 1.5),
a[actor] ~ dnorm(0, 0.5)
b[treatment]
)<- quap(model_formula, data = dat_list)
quap_model <- ulam(model_formula, data = dat_list, chains = 4, log_lik = TRUE, refresh = 0) mcmc_model
Comparing the model:
plot(coeftab(quap_model, mcmc_model), pch = 16)
We get similar posterior results out of both methods. Only for the second actor’s intercept do we get some slight difference,where the quadratic approximation underestimates slightly. Let us look at the density plot for these variable for both the models:
<- extract.samples(quap_model)
postq <- extract.samples(mcmc_model)
postm
::with_par(
withrlist(mfrow = c(3, 3)),
code = {
for(i in 1:7){
dens(postq$a[,i], col = 'black', main = paste0('a[',i,']'))
dens(postm$a[,i], col = 'blue', add = TRUE)
legend("topright", legend = c('quap', 'mcmc'), col = c("black","blue"), lty = 1)
}
} )
::with_par(
withrlist(mfrow = c(2, 2)),
code = {
for(i in 1:4){
dens(postq$b[,i], col = 'black', main = paste0('b[',i,']'))
dens(postm$b[,i], col = 'blue', add = TRUE)
legend("topright", legend = c('quap', 'mcmc'), col = c("black","blue"), lty = 1)
}
} )
The posterior distribution seems to be the same in all variables except for the intercept of second actor. Zooming in:
dens(postq$a[,2], col = 'black', main = paste0('a[',2,']'))
dens(postm$a[,2], col = 'blue', add = TRUE)
legend("topright", legend = c('quap', 'mcmc'), col = c("black","blue"), lty = 1)
The MCMC posterior seems to have a much longer tail on the right. Whereas the uqadratic approximation assumes Gaussian distribution and thus calculates a symmetric distribution.
Now using the relaxed priors:
<- alist(
model_formula2 ~ dbinom(1, p),
pulled_left logit(p) <- a[actor] + b[treatment],
~ dnorm(0, 10),
a[actor] ~ dnorm(0, 0.5)
b[treatment]
)<- quap(model_formula2, data = dat_list)
quap_model2 <- ulam(model_formula2, data = dat_list, chains = 4, iter = 4000, log_lik = TRUE, refresh = 0) mcmc_model2
## Warning: There were 35 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
Comparing the model posterior:
plot(coeftab(quap_model2, mcmc_model2, mcmc_model), pch = 16)
Again, the difference is only in the second actor’s intercept, and the difference is much larger this time. The quadratic approximation and the MCMC are both much different that the earlier models. The quadratic approximation is much closer to the original estimate though.
Again looking their specific distributions:
<- extract.samples(quap_model2)
postq2 <- extract.samples(mcmc_model2)
postm2 dens(postq2$a[,2], col = 'black', main = paste0('a[',2,']'))
dens(postm2$a[,2], col = 'blue', add = TRUE)
legend("topright", legend = c('quap', 'mcmc'), col = c("black","blue"), lty = 1)
The MCMC derives a much more rightly skewed distribution this time. The second actor always pulled the left lever. To get the probability of 1 in the linear model (logit(p) <- a[actor] + b[treatment]), any large value for a works. This asymmetry in action is preserved by the MCMC while the quadratic approximation has to make a symmetric distribution, leading to the difference seen above.
11M8
Revisit the data(Kline) islands example. This time drop Hawaii from the sample and refit the models. What changes do you observe?
Load and pre-process data
library(rethinking)
data("Kline")
<- Kline
d $P <- scale(log(d$population))
d$contact_id <- ifelse(d$contact == "high", 2, 1)
d<- list(
dat_no_hawaii T = d$total_tools[-10],
P = d$P[-10],
cid = d$contact_id[-10]
)<- list(
dat T = d$total_tools,
P = d$P,
cid = d$contact_id
)
Duplicating the models from the text:
# intercept only
.9_orig <- ulam(
m11alist(
~ dpois(lambda),
T log(lambda) <- a,
~ dnorm(3, 0.5)
a data = dat, chains = 4, log_lik = TRUE, refresh = 0
),
)
# interaction model
.10_orig <- ulam(
m11alist(
~ dpois(lambda),
T log(lambda) <- a[cid] + b[cid] * P,
~ dnorm(3, 0.5),
a[cid] ~ dnorm(0, 0.2)
b[cid] data = dat, chains = 4, log_lik = TRUE, refresh = 0
), )
Creating the two models (with and without intercept)
# intercept only
.9 <- ulam(
m11alist(
~ dpois(lambda),
T log(lambda) <- a,
~ dnorm(3, 0.5)
a data = dat_no_hawaii, chains = 4, log_lik = TRUE, refresh = 0
),
)
# interaction model
.10 <- ulam(
m11alist(
~ dpois(lambda),
T log(lambda) <- a[cid] + b[cid] * P,
~ dnorm(3, 0.5),
a[cid] ~ dnorm(0, 0.2)
b[cid] data = dat_no_hawaii, chains = 4, log_lik = TRUE, refresh = 0
), )
Comparing the two models without the intercept:
precis(m11.9)
## mean sd 5.5% 94.5% n_eff Rhat4
## a 3.422521 0.06088708 3.326239 3.518251 777.2346 1.002727
precis(m11.9_orig)
## mean sd 5.5% 94.5% n_eff Rhat4
## a 3.542926 0.05192324 3.45975 3.623907 541.5774 1.006696
We get a slightly lower mean on the model without Hawaii. Remember, on the total_tools vs population, Hawaii was an extreme observation very much to the right. Removing it, we get a lower average for the total_tools.
For the model with the intercept, let’s compare using the posterior predictions.
library(dplyr, warn.conflicts = TRUE)
library(scales)
library(ggplot2)
library(tidyr)
# set up the horizontal zxis values to compute predictions at
<- 100
ns <- seq(from = -1.4, to = 3, length.out = ns)
P_seq
<- attributes(dat$P)$'scaled:scale'
P_scale <- attributes(dat$P)$'scaled:center'
P_center <- function(x) exp(x * P_scale + P_center)
real_scale
# function to get mean predictions and credible interval
<- function(model){
get_preds
<- link(model, data = data.frame(P = P_seq, cid = 1))
lambda <- apply(lambda, 2, mean)
lmu <- apply(lambda, 2, PI)
lci
<- link(model, data = data.frame(P = P_seq, cid = 2))
lambda2 <- apply(lambda2, 2, mean)
lmu2 <- apply(lambda2, 2, PI)
lci2
tibble(mu = c(lmu, lmu2),
min = c(lci[1,], lci2[1,]),
max = c(lci[2,], lci2[2,]),
cid = rep(c(1,2), each = length(P_seq)),
P = rep(real_scale(P_seq), 2))
}
<- get_preds(m11.10_orig)
orig_model_preds <- get_preds(m11.10)
new_preds
<- bind_rows(orig_model_preds %>% mutate(id = 'With Hawaii'),
all_preds %>% mutate(id = 'Without Hawaii')) %>%
new_preds mutate(cid = factor(cid, labels = c("Low", "High")))
# Original point data
<- mutate(d,
d2 cid = factor(contact_id, labels = c("Low", "High")),
P_real = real_scale(P))
ggplot(all_preds, aes(x = P, y = mu)) +
geom_line(aes(linetype = id), size = 1) +
geom_ribbon(aes(ymin = min, ymax = max, fill = id), alpha = 0.5) +
facet_wrap(~cid) +
geom_point(aes(P_real, total_tools),
data = d2,
colour = '#1e59ae') +
scale_x_continuous(labels = comma,
limits = c( real_scale( c(-2, 2.4) ) )
+
) labs(x = "population",
y = "total tools",
title = "posterior predictions of model with and without hawaii")
For high contact the posterior predictions are pretty much the same for the two models. For low contact however, removing Hawaii results in a much lower trend. In fact after removing Hawaii, we get a posterior where the High contact always trends above Low contact. This is more easily visible in the next graph.
ggplot(all_preds, aes(x = P, y = mu, fill = cid)) +
geom_line(aes(linetype = cid), size = 1) +
geom_ribbon(aes(ymin = min, ymax = max), alpha = 0.5) +
facet_wrap(~id) +
geom_point(aes(P_real, total_tools),
data = d2,
colour = '#1e59ae') +
scale_x_continuous(labels = comma,
limits = c( real_scale( c(-2, 2.4) ) )
+
) labs(x = "population",
y = "total tools",
title = "posterior predictions of model with and without hawaii",
subtitle = "Without Hawaii, High always trends above Low contact")
Hard Questions
11H1
Use WAIC or PSIS to compare the chimpanzee model that includes a unique intercept for each actor, m11.4 (page 330), to the simpler models fit in the same section. Interpret the results.
Load and process data and build the models:
library(rethinking)
data(chimpanzees)
<- chimpanzees
d $treatment <- 1 + d$prosoc_left + 2 * d$condition
d
<- list(
dat_list pulled_left = d$pulled_left,
actor = d$actor,
treatment = as.integer(d$treatment)
)
Build the two versions of the model:
# simpler model
.3 <- ulam(
m11alist(
~ dbinom(1, p),
pulled_left logit(p) <- a + b[treatment],
~ dnorm(0, 1.5),
a ~ dnorm(0, 0.5)
b[treatment] data = dat_list, chains = 4, log_lik = TRUE, refresh = 0
),
)
# model with separate intercept for each actor
.4 <- ulam(
m11alist(
~ dbinom(1, p),
pulled_left logit(p) <- a[actor] + b[treatment],
~ dnorm(0, 1.5),
a[actor] ~ dnorm(0, 0.5)
b[treatment] data = dat_list, chains = 4, log_lik = TRUE, refresh = 0
), )
Comparing th two models:
compare(m11.3, m11.4, func = PSIS)
## PSIS SE dPSIS dSE pPSIS weight
## m11.4 532.2432 18.941424 0.0000 NA 8.479181 1.000000e+00
## m11.3 682.5767 9.197657 150.3335 18.40111 3.677269 2.267241e-33
m11.4
, the model with a unique intercept seems to be much better than the model without a unique intercept for each actor. dPSIS is 150 here with dSE very small in comparison at 18.39 so the difference seems to be significant enough.
11H2
The data contained in library(MASS);data(eagles) are records of salmon pirating attempts by Bald Eagles in Washington State. See ?eagles for details. While one eagle feeds, sometimes another will swoop in and try to steal the salmon from it. Call the feeding eagle the “victim” and the thief the “pirate.” Use the available data to build a binomial GLM of successful pirating attempts.
- Consider the following model:
\[ \begin{align} y_i &\sim \operatorname{Binomial}(n_i, p_i) \\ logit(p_i) &= \alpha + \beta_P P_i + \beta_V V_i + \beta_A A_i \\ \alpha &\sim \operatorname{Normal}(0, 1.5) \\ \beta_P, \beta_V, \beta_A &\sim \operatorname{Normal}(0, 0.5) \end{align} \]
where y is the number of successful attempts, n is the total number of attempts, P is a dummy variable indicating whether or not the pirate had large body size, V is a dummy variable indicating whether or not the victim had large body size, and finally A is a dummy variable indicating whether or not the pirate was an adult. Fit the model above to the eagles data, using both quap and ulam. Is the quadratic approximation okay?
- Now interpret the estimates. If the quadratic approximation turned out okay, then it’s okay to use the quap estimates. Otherwise stick to ulam estimates. Then plot the posterior predictions. Compute and display both (1) the predicted probability of success and its 89% interval for each row(i) in the data, as well as (2) the predicted success count and its 89% interval. What different information does each type of posterior prediction provide?
- Now try to improve the model. Consider an interaction between the pirate’s size and age (immature or adult). Compare this model to the previous one, using WAIC. Interpret.
Load Data and process:
library(MASS)
data("eagles")
<- eagles %>%
d mutate(across(c(P, V), ~ifelse(.x == "S", 1, 2)), # 1 small, 2 large
A = ifelse(A == "I", 1, 2)) # 1 Immature, 2 Adult
d
## y n P A V
## 1 17 24 2 2 2
## 2 29 29 2 2 1
## 3 17 27 2 1 2
## 4 20 20 2 1 1
## 5 1 12 1 2 2
## 6 15 16 1 2 1
## 7 0 28 1 1 2
## 8 1 4 1 1 1
We have a case of aggregated binomial. Let’s build the model:
<- alist(
model_formula ~ dbinom(n, p),
y logit(p) <- a + bp*P + bv * V + ba * A,
~ dnorm(0, 1.5),
a c(bp, bv, ba) ~ dnorm(0, 0.5)
)
<- quap(model_formula, data = d)
eagle_quap <- ulam(model_formula, data = d, chains = 4, refresh = 0, log_lik = TRUE) eagle_ulam
Looking at the estimates for the quap
model:
precis(eagle_quap)
## mean sd 5.5% 94.5%
## a -0.1931781 0.7749944 -1.4317688 1.045413
## bp 1.6017391 0.2968893 1.1272526 2.076226
## bv -1.6985597 0.3057593 -2.1872221 -1.209897
## ba 0.6304949 0.2927997 0.1625444 1.098445
Looking at the estimates for the ulam
model:
precis(eagle_ulam)
## mean sd 5.5% 94.5% n_eff Rhat4
## a -0.2093892 0.7674663 -1.4391592 1.004011 866.4751 1.003539
## ba 0.6447600 0.2923562 0.1868849 1.100524 950.4861 1.004989
## bv -1.7089720 0.3053599 -2.1930222 -1.232504 1083.9136 1.000751
## bp 1.6112984 0.2964011 1.1515090 2.076793 1312.5922 1.001936
Both the models produce almost same results:
plot(coeftab(eagle_quap, eagle_ulam))
From the estimates, we get:
- if the “Pirate” bird (the one stealing) had a large body, that increased the chances of stealing
- “Victim” bird’s large size prevented steal
- Adult “Pirate”s had a higher chance of stealing
library(patchwork)
<- function(model){
posterior_plot <- link(model)
preds <- sim(model)
sim <- colMeans(preds)
p_mean <- apply(preds, 2, PI)
p_ci <- colMeans(sim)
y_mean <- apply(sim, 2, PI)
y_ci
<- tibble(mu = p_mean, y_mu = y_mean,
plot_data mu_min = p_ci[1,], mu_max = p_ci[2,],
y_min = y_ci[1,], y_max = y_ci[2,]) %>%
bind_cols(d) %>%
mutate(id = 1:n(),
pct = y / n)
<- ggplot(plot_data, aes(id)) +
p1 geom_point(aes(y = pct, colour = "Original"), size = 2) +
geom_point(aes(y = mu, colour = "Predicted"), size = 2) +
geom_linerange(aes(ymin = mu_min, ymax = mu_max)) +
scale_x_continuous(breaks = 1:8) +
labs(x = "Bird",
y = "Proportion of success",
title = "Posterior predictions (probability scale)",
colour = "") +
theme_classic()
<- ggplot(plot_data, aes(id)) +
p2 geom_point(aes(y = y, colour = "Original"), size = 2) +
geom_point(aes(y = y_mu, colour = "Predicted"), size = 2) +
geom_linerange(aes(ymin = y_min, ymax = y_max)) +
scale_x_continuous(breaks = 1:8) +
labs(x = "Bird",
y = "Number of successes",
title = "Posterior predictions (Counts)",
colour = "") +
theme_classic()
list(p1, p2)
}
<- posterior_plot(eagle_quap)
plots 1]] / plots[[2]] plots[[
For the predicted probabilities, the model does a good job predicting for birds 1 and 3, but over or underpredicts for all the other birds. On the count scale, most of the actual counts fall within the expected 89% credible interval.
The probability plot however, allows us to compare the 8 cases in spite of the difference in number of trials. On the counts graph, the number of counts as well as the binomial process comes into picture (instead of just p, we are looking at B(N,p)), so there’s an added amount of uncertainty.
Next, we will consider an interaction between pirate’s size and age. Since quap
did a good job with the provided priors, let’s continue with it.
# using the same approach as in the chimpanzees example to create the interaction term
$AP <- 1 + (d$P-1) + 2 * (d$A - 1) d
Thus we have, for AP =
- 1 - Pirate is young and small
- 2 - Pirate is young and big
- 3 - Pirate is adult and small
- 4 - Pirate is adult and big
<- ulam(
eagle_int alist(
~ dbinom(n, p),
y logit(p) <- a + bv*V + bi[AP],
~ dnorm(0, 1.5),
a ~ dnorm(0, 0.5),
bv # c(ba, bp) ~ dnorm(0, 0.5),
~ dnorm(0, 1.5)
bi[AP] data = d, chains = 4, refresh = 0, log_lik = TRUE
),
)compare(eagle_ulam, eagle_int, func = WAIC)
## WAIC SE dWAIC dSE pWAIC weight
## eagle_int 48.23106 7.852345 0.00000 NA 8.322561 0.996226649
## eagle_ulam 59.38308 11.945086 11.15202 14.814 8.388807 0.003773351
dSE is almost the same as dWAIC, so we can’t say one is better than the other. But WAIC seems to put most of the weight on eagle_int for predictions.
Let’s look at the posterior predictions of the two models:
<- posterior_plot(eagle_int)
plots2 1]] + geom_hline(yintercept = 0.5, lty = 2)) /
{ (plots[[1]] + geom_hline(yintercept = 0.5, lty = 2))
(plots2[[ }
The plot on the top is from the earlier model, the one on the bottom is from the model with the interaction term. The prediction seem to more closely match with the observed counts now.
11H3
The data contained in data(salamanders) are counts of salamanders (Plethodon elongatus) from 47 different 49m^2 plots in northern California. The column SALAMAN is the count in each plot, and the columns PCTCOVER and FORESTAGE are percent of ground cover and age of trees in the plot, respectively. You will model SALAMAN as a Poisson variable.
- Model the relationship between density and percent cover, using a log-link (same as the example in the book and lecture). Use weakly informative priors of your choosing. Check the quadratic approximation again, by comparing quap to ulam. Then plot the expected counts and their 89% interval against percent cover. In which ways does the model do a good job? A bad job?
- Can you improve the model by using the other predictor, FORESTAGE? Try any models you think useful. Can you explain why FORESTAGE helps or does not help with prediction?
data("salamanders")
<- salamanders
d head(d)
## SITE SALAMAN PCTCOVER FORESTAGE
## 1 1 13 85 316
## 2 2 11 86 88
## 3 3 11 90 548
## 4 4 9 88 64
## 5 5 8 89 43
## 6 6 7 83 368
$G <- d$PCTCOVER/100
d$FR <- log1p(d$FORESTAGE) d
We will make the following model:
\[ \begin{align} \text{Salamander} &\sim \operatorname{Poisson}(\lambda) \\ \log \lambda_i &= \alpha + \beta_G G \\ \alpha &\sim \text{TBD} \\ \beta_G &\sim \text{TBD} \end{align} \]
The number of salamanders in an area typically ranges from 0 to 100 link So we can use a similar prior as in the text for the alpha (Normal(3, 0.5)).
For Beta, Normal(0, 1) seems fine.
set.seed(10)
<- 100
N <- rnorm(N, 3, 0.5)
a <- rnorm(N, 0, 1)
b plot(NULL, xlim = c(0, 1), ylim = c(0, 150), xlab = "Percent Ground Cover", ylab = "Salamander Count")
mtext("b ~ dnorm(0, 1)")
for(i in 1:N)
curve( exp(a[i] + b[i] * x), add = TRUE, col = grau(), from = 0, to = 1)
<- list(salamander = d$SALAMAN,
dat_list site = d$SITE,
G = d$G,
FR = d$FR)
<- alist(
model_formula ~ dpois(lambda),
salamander log(lambda) <- a + bg*G,
~ dnorm(3, 0.5),
a ~ dnorm(0, 1)
bg
)<- quap(model_formula, data = dat_list)
sal_quap <- ulam(model_formula, data = dat_list, chains = 4, refresh = 0, log_lik = TRUE, iter = 2000) sal_ulam
plot(coeftab(sal_quap, sal_ulam))
We get similar predictions from both.
# postcheck(sal_ulam, prob = 0.89, window = 50)
# helper function to get posterior predictions
<- function(model, actual){
get_posterior_preds <- link(model)
preds <- sim(model)
y <- colMeans(preds)
preds_mu <- apply(preds, 2, PI)
preds_ci
<- apply(y, 2, PI)
y_ci
tibble(mu = preds_mu,
mu_min = preds_ci[1,],
mu_max = preds_ci[2,],
y_min = y_ci[1,],
y_max = y_ci[2,],
id = 1:length(preds_mu),
actual = actual)
}
# helper function similar to postcheck from rethinking
<- function(model, actual){
posterior_check <- get_posterior_preds(model, actual)
tbl ggplot(tbl, aes(x = id)) +
geom_point(aes(y = actual, colour = "Actual")) +
geom_point(aes(y = mu, colour = "Posterior")) +
geom_linerange(aes(ymin = mu_min, ymax = mu_max)) +
geom_point(aes(y = y_min), shape = 3) +
geom_point(aes(y = y_max), shape = 3) +
theme(legend.position = "top") +
labs(colour = "")
}
posterior_check(sal_ulam, d$SALAMAN)
Our posterior predictions don’t do very well on the actual data, with most counts outside even the posterior credible intervals.
We will now try to use FORESTAGE as a predictor.
\[ \begin{align} \text{Salamander} &\sim \operatorname{Poisson}(\lambda) \\ \log \lambda_i &= \alpha + \beta_P G + \beta_F F \\ \alpha &\sim \text{Normal}(3, 0.5) \\ \beta_G &\sim \text{Normal}(0, 1) \\ \beta_F &\sim \text{TBD} \end{align} \] We now need a prior for the age of trees as well. Since we have standardised it above, we expect it to range mostly between 0 to 7 (age 0-1000 years). By trial and error we settle upon the prior of Normal(0, 0.2) for this new parameter that mostly keeps our salamander count mostly flat and between 0-100.
set.seed(10)
<- 100
N <- rnorm(N, 3, 0.5)
a <- rnorm(N, 0, 1)
b <- rnorm(N, 0, 0.2)
f <- seq(0,7, length.out = N)
age_seq plot(NULL, xlim = c(0, 1), ylim = c(0, 150), xlab = "Percent Ground Cover", ylab = "Salamander Count")
for(i in 1:N)
curve( exp(a[i] + b[i] * x + f[i] * log(age_seq[i])), add = TRUE, col = grau(), from = 0, to = 1)
Modelling:
<- alist(
model_formula2 ~ dpois(lambda),
salamander log(lambda) <- a + bg*G + bf * FR,
~ dnorm(3, 0.5),
a ~ dnorm(0, 1),
bg ~ dnorm(0, 0.2)
bf
)<- quap(model_formula2, data = dat_list)
sal_quap2 <- ulam(model_formula2, data = dat_list, chains = 4, refresh = 0, log_lik = TRUE, iter = 2000) sal_ulam2
compare(sal_ulam, sal_ulam2, func = WAIC)
## WAIC SE dWAIC dSE pWAIC weight
## sal_ulam 229.0977 24.7828 0.000000 NA 4.472417 0.772482
## sal_ulam2 231.5424 27.2342 2.444758 5.870347 6.422898 0.227518
Comparing the two models the difference between them isn’t too great and dSE is higher than dWAIC. So not much difference in terms of prediction.
plot(coeftab(sal_ulam, sal_ulam2))
the coefficient for forestage is very close to zero. The coefficient of ground cover seems to have increased by a similar amount.
Trying out a model with only forest age as predictor. We do another prior check and flatten the prior to allow it to take greater values:
<- alist(
model_formula3 ~ dpois(lambda),
salamander log(lambda) <- a + bf * FR,
~ dnorm(3, 0.5),
a ~ dnorm(0, 1)
bf
)<- quap(model_formula3, data = dat_list)
sal_quap3 compare(sal_quap, sal_quap3, sal_quap2)
## WAIC SE dWAIC dSE pWAIC weight
## sal_quap 229.4508 24.57573 0.000000 NA 4.467638 7.129233e-01
## sal_quap2 231.2700 26.88736 1.819252 5.768513 6.023297 2.870762e-01
## sal_quap3 257.9831 27.58798 28.532320 9.631389 5.198938 4.542848e-07
This seems to do significantly worse than the earlier two models. So forest age does not seem to help much with prediction. Another try without taking log of forest age does not seem any better results either.
Trying interaction after binning forest age into two ranges, 0-100 years, > 100 years.
<- dat_list
dd $FR <- ifelse(dat_list$FR < log1p(100), 1, 2)
dd<- alist(
model_formula4 ~ dpois(lambda),
salamander log(lambda) <- a + bg[FR]*G,
~ dnorm(3, 0.5),
a ~ dnorm(0, 1)
bg[FR]
)<- quap(model_formula4, data = dd)
sal_quap4 compare(sal_quap, sal_quap2, sal_quap4)
## WAIC SE dWAIC dSE pWAIC weight
## sal_quap 229.8389 24.51082 0.000000 NA 4.422341 0.6268483
## sal_quap2 231.6997 26.73276 1.860770 5.750061 6.305364 0.2472301
## sal_quap4 233.0490 26.49547 3.210091 5.345795 7.793906 0.1259215
Not much better. It is possible that percent ground cover and age of trees are correlated and thus providing the same information, which is why we don’t see much effect from forest age after including both the variables.
11H4
The data in data(NWOGrants) are outcomes for scientific funding applications for the Netherlands Organization for Scientific Research (NWO) from 2010–2012 (see van der Lee and Ellemers (2015) for data and context). These data have a very similar structure to the UCBAdmit data discussed in the chapter. I want you to consider a similar question: What are the total and indirect causal effects of gender on grant awards? Consider a mediation path (a pipe) through discipline. Draw the corresponding DAG and then use one or more binomial GLMs to answer the question. What is your causal interpretation? If NWO’s goal is to equalize rates of funding between men and women, what type of intervention would be most effective?
Loading in data:
library(rethinking)
data("NWOGrants")
<- NWOGrants
d d
## discipline gender applications awards
## 1 Chemical sciences m 83 22
## 2 Chemical sciences f 39 10
## 3 Physical sciences m 135 26
## 4 Physical sciences f 39 9
## 5 Physics m 67 18
## 6 Physics f 9 2
## 7 Humanities m 230 33
## 8 Humanities f 166 32
## 9 Technical sciences m 189 30
## 10 Technical sciences f 62 13
## 11 Interdisciplinary m 105 12
## 12 Interdisciplinary f 78 17
## 13 Earth/life sciences m 156 38
## 14 Earth/life sciences f 126 18
## 15 Social sciences m 425 65
## 16 Social sciences f 409 47
## 17 Medical sciences m 245 46
## 18 Medical sciences f 260 29
First let us visualise the data first:
%>%
d mutate(pct_awarded = awards/applications,
discipline = reorder(discipline, pct_awarded),
gender = ifelse(gender == 'm', "Male", "Female")) %>%
::pivot_longer(cols = c(applications, pct_awarded)) %>%
tidyrggplot(aes(y = discipline, x = value)) +
geom_col(aes(fill = gender), position = "dodge") +
# scale_x_continuous(labels = percent) +
guides(fill = guide_legend(reverse = TRUE)) +
facet_wrap(~name, scales = "free_x") +
labs(x = "",
title = "Distribution of Awards/Applications across various disciplines",
fill = "Gender")
From the above graph we can see that there is a disparity between the number of applications across disciplines as well as in the percentage of applications that receive awards. So there is definitly an effect of discipline.
Let’s consider a DAG where gender affects choice of discipline as well as number of awards and displine effects the number of awards as well.
library(dagitty)
library(ggdag)
<- dagitty("dag{
nwo_dag G -> D -> A
G -> A
}")
coordinates(nwo_dag) <- list(
x = c(G = 0, A = 1, D = 0.5),
y = c(G = 0, A = 0, D = 1)
)
ggdag(nwo_dag) + theme_void()
Modelling:
$D <- coerce_index(d$discipline)
d$G <- ifelse(d$gender == 'm', 1, 2)
d
<- list(applications = d$applications,
dat_list awards = d$awards,
D = d$D,
G = d$G)
<- ulam(
nwo_mod alist(
~ dbinom(applications, p),
awards logit(p) <- a[G] + b[D],
~ dnorm(0, 1.5),
a[G] ~ dnorm(0, 1.5)
b[D] data = dat_list, chains = 4, refresh = 0, log_lik = TRUE, iter = 4000
),
)precis(nwo_mod, depth = 2, pars = "a")
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] -1.160646 0.4714619 -1.910919 -0.4147350 593.5910 1.005315
## a[2] -1.298497 0.4739362 -2.052660 -0.5456001 597.8852 1.005284
The intercept for females “a[2]” is less than that of males. Checking the contrast on the logit and outcome scale:
<- extract.samples(nwo_mod)
post <- post$a[,1] - post$a[,2]
diff_a <- inv_logit(post$a[,1]) - inv_logit(post$a[,2])
diff_p precis(list(logit = diff_a, outcome = diff_p))
## mean sd 5.5% 94.5% histogram
## logit 0.13785118 0.10775873 -0.033754734 0.31199162 ▁▁▂▅▇▃▁▁▁
## outcome 0.02382615 0.01981842 -0.005619127 0.05705356 ▁▁▂▇▇▃▁▁▁▁
We see only a minor difference, with females worse off by 2% on average.
Let us look at the posterior for disciplines:
<- as.data.frame( precis(nwo_mod, 2, pars = 'b') )
b $discipline <- as.character(levels(d$discipline))
b
order(b$mean, decreasing = TRUE),] %>%
b[# select(-n_eff, Rhat4, discipline) %>%
mutate(across(-discipline, round, 2))
## mean sd 5.5% 94.5% n_eff Rhat4 discipline
## b[1] 0.16 0.51 -0.66 0.97 680.62 1.00 Chemical sciences
## b[7] 0.13 0.52 -0.71 0.97 722.73 1.00 Physics
## b[2] -0.18 0.49 -0.96 0.60 628.74 1.00 Earth/life sciences
## b[6] -0.19 0.50 -1.01 0.59 659.09 1.00 Physical sciences
## b[9] -0.39 0.50 -1.18 0.41 643.34 1.01 Technical sciences
## b[3] -0.42 0.48 -1.18 0.35 627.57 1.01 Humanities
## b[4] -0.46 0.50 -1.25 0.35 674.08 1.00 Interdisciplinary
## b[5] -0.52 0.48 -1.29 0.25 627.04 1.01 Medical sciences
## b[8] -0.64 0.48 -1.39 0.13 608.99 1.01 Social sciences
The above is ordered in decreasing order of awards granted, and this seems to line up with our earlier EDA plot where Chemical sciences had the highest award rate and Social Sciences with the lowest.
postcheck(nwo_mod)
The model’s predictions do tend to agree with the observed counts. It seems this is a very similar case to the UCB Admissions case, with disparity between the awards rate for different disciplines and a gender split in applications sent across disciplines.
Looking at the rates of applications sent to different disciplines by gender:
::select(d, gender, applications, discipline) %>%
dplyr::pivot_wider(names_from = gender, values_from = applications) %>%
tidyr::adorn_percentages(denominator = "row") %>%
janitor::adorn_pct_formatting(digits = 0) janitor
## discipline m f
## Chemical sciences 68% 32%
## Physical sciences 78% 22%
## Physics 88% 12%
## Humanities 58% 42%
## Technical sciences 75% 25%
## Interdisciplinary 57% 43%
## Earth/life sciences 55% 45%
## Social sciences 51% 49%
## Medical sciences 49% 51%
So it seems, women are sending more applications to disciplines such as Social sciences and Medical Sciences which have lower awards rate, with much fewer applications sent to disciplines like Chemical and Physical Sciences which have much higher awards rate.
For NWO’s goal of equalizing rates of funding between men and women, the award rate across the divisions could be made more uniform. Another approach would be to try and attract more female talent in the Chemical/Physical Science fields.
11H5
Suppose that the NWO Grants sample has an unobserved confound that influences both choice of discipline and the probability of an award. One example of such a confound could be the career stage of each applicant. Suppose that in some disciplines, junior scholars apply for most of the grants. In other disciplines, scholars from all career stages compete. As a result, career stage influences discipline as well as the probability of being awarded a grant. Add these influences to your DAG from the previous problem. What happens now when you condition on discipline? Does it provide an un-confounded estimate of the direct path from gender to an award? Why or why not? Justify your answer with the backdoor criterion. If you have trouble thinking this though, try simulating fake data, assuming your DAG is true. Then analyze it using the model from the previous problem. What do you conclude? Is it possible for gender to have a real direct causal influence but for a regression conditioning on both gender and discipline to suggest zero influence?
Let’s add career stage as an influence for both discipline and award.
<- dagitty("dag{
nwo_dag G [exposure]
A [outcome]
C [unobserved]
G -> D -> A
G -> A
D <- C -> A
}")
coordinates(nwo_dag) <- list(
x = c(G = 0, A = 1, C = 1, D = 0 ),
y = c(G = 0, A = 0, C = 1, D = 1)
)
# ggdag(nwo_dag) + theme_void()
drawdag(nwo_dag)
drawopenpaths(nwo_dag, col_arrow = "navyblue", lwd = 2)
Let’s consider the paths from G -> A (Gender to Awards):
- G -> A (direct path)
- G -> D -> A (pipe)
Expanding the second path above we have:
- A <- C -> D <- G -> A (backdoor from C to D)
G & C collide at A and D. For the G -> D <- C collider, conditioning on D will show association between G and C. C is however unobserved, but we believe that C influences A, this will lead to showing association between G and A as well even if there is no association in reality. So we should not condition on D if we want to study the effect of G on A. Confirming the same via dagitty
:
adjustmentSets(nwo_dag, exposure = "G", outcome = "A")
## {}
Simulating fake data to confirm above
- Assume 5k applications sent from different disciplines
- Different number of applications per discipline
- Different disciplines have different number of female/male applicants
- Career stage is simplified as senior vs junior scholars
- Different disciplines have different number of junior/senior applicants
- the number of grants awarded is then decided using the DAG above such that it is influenced by both gender and career stage, specifically females and junior scholars are awarded less grants
Two models are then created, one that conditions on discipline and one that does not. We confirm our reasoning above and see that the one that does not condition on discipline correctly shows the greater disparity in grant rate between male/females, while the former shows a much smaller effect of gender.
set.seed(116)
<- 5000 # assume 5000 application sent in total
N
<- levels(d$discipline)
disciplines
# divide 5000 applications to the 9 discipline
<- sample(disciplines, N, replace = TRUE, prob = runif(9, 0.2, 0.8))
discipline
# divide 5000 applications among female/male and junior/senior
<- sample(c(0,1), N, replace = TRUE, prob = runif(2, 0.3, 0.7))
female <- sample(c(0, 1), N, replace = TRUE, prob = runif(2, 0.5, 0.7))
junior
# general rate of grants awarded as a proportion of applications sent
<- setNames(runif(9, 0.1, 0.4), disciplines)
general_grant_rate <- 0.5 # penalty to junior scholars in rate of grant
junior_penalty <- 0.6 # penalty to females
female_penalty
# function to compute rate of grants awarded given discipline, career stage and gender
<- function(discipline, junior, female){
grant_rate * (junior_penalty) ^ junior * (female_penalty)^female
general_grant_rate[discipline]
}
# combine all of above into a single data frame
# and then create an aggregate data set
<- tibble(discipline, female, junior) %>%
test count(discipline, female, junior, name = "applications") %>%
mutate(award = round(applications * grant_rate(discipline, junior, female)),
D = coerce_index(factor(discipline)),
G = ifelse(female, 2, 1),
S = ifelse(junior, 2, 1),
pct = award / applications)
# simplified data set
<- with(test, list(applications = applications,
dat_list award = award,
D = D,
G = G,
S = S))
# model that conditions on D
<- ulam(
model1 alist(
~ dbinom(applications, p),
award logit(p) <- a[G] + b[D],
~ dnorm(0, 1.5),
a[G] ~ dnorm(0, 1.5)
b[D] data = dat_list, chains = 4, iter = 4000, refresh = 0
),
)
# model that does not condition on D
<- ulam(
model2 alist(
~ dbinom(applications, p),
award logit(p) <- a[G],
~ dnorm(0, 1.5)
a[G] data = dat_list, chains = 4, iter = 4000, refresh = 0
),
)
<- function(model) {
diff_outcome_scale <- extract.samples(model)
post inv_logit(post$a[,1]) - inv_logit(post$a[,2])
}
<- diff_outcome_scale(model1)
diff_p_condition_d <- diff_outcome_scale(model2)
diff_p_no_condition precis(list(model1 = diff_p_condition_d, model2 = diff_p_no_condition))
## mean sd 5.5% 94.5% histogram
## model1 0.08521976 0.02620310 0.04669767 0.1300493 ▁▃▇▇▅▂▁▁▁
## model2 0.07181296 0.01000651 0.05554954 0.0877577 ▁▁▂▇▇▃▁▁▁
If we just look at the means, then the difference between the two models does not seem that extreme, however, if we look their credible intervals, the model that conditions on D has a much wider interval for the difference due to gender whereas the model that does not condition has a narrower credible interval, granting more credibility to the idea that there is a gender bias.
There could be issues with the simulation above that I have not yet realised, without which, we will see difference in the means of the difference between the genders as well.
11H6
The data in data(Primates301) are 301 primate species and associated measures. In this problem, you will consider how brain size is associated with social learning. There are three parts.
- Model the number of observations of social_learning for each species as a function of the log brain size. Use a Poisson distribution for the social_learning outcome variable. Interpret the resulting posterior.
- Some species are studied much more than others. So the number of reported instances of social_learning could be a product of research effort. Use the research_effort variable, specifically its logarithm, as an additional predictor variable. Interpret the coefficient for log research_effort. Howdoes this model differ from the previous one?
- Draw a DAG to represent how you think the variables social_learning, brain, and research_effort interact. Justify the DAG with the measured associations in the two models above (and any other models you used).
data("Primates301")
<- Primates301
d head(d[,c("species", "social_learning", "brain", "research_effort")])
## species social_learning brain research_effort
## 1 nigroviridis 0 58.02 6
## 2 trichotis 0 NA 6
## 3 belzebul 0 52.84 15
## 4 caraya 0 52.63 45
## 5 guariba 0 51.70 37
## 6 palliata 3 49.88 79
There are some NAs in the data, so let’s filter them out.
<- Primates301 %>%
d ::select(species, social_learning, brain, research_effort, spp_id) %>%
dplyrna.omit() %>%
droplevels()
dim(d)
## [1] 150 5
Out of 301, we have now 150 rows left.
<- list(species = coerce_index(d$species),
dat_list sl = d$social_learning,
brain = log(d$brain),
research = log(d$research_effort))
Model will be:
\[ \begin{align} \text{social_learning} &\sim \operatorname{Poisson}(\lambda) \\ \log \lambda_i &= \alpha\text{} + \beta\text{} \log (x - \bar x) \\ \alpha\text{} &\sim \text{Normal}(3, 0.5) \\ \beta\text{} &\sim \text{Normal}(0, 0.2) \end{align} \] where \(x\) is the brain size.
priors for beta
prior for alpha can be the same as before, we expect values to range between 0-100 with probably some extreme values. For beta, we imagine that brain volume for primates (which includes from birds to gorillas) ranges in the [5-1000] cc range.
<- 100
N <- rnorm(N, 3, 0.5)
a <- rnorm(N, 0, 0.2)
b plot(NULL, xlim = c(2, 7), ylim = c(0, 200), xlab = "log brain size", ylab = "social learning")
mtext("b ~ dnorm(0, 0.1)")
for(i in 1:N)
curve( exp(a[i] + b[i] * x), add = TRUE, col = grau(), from = log(5), to = log(1000))
We settle down on Normal(0, 0.2)
Modelling:
<- ulam(
prim_mod1 alist(
~ dpois(lambda),
sl log(lambda) <- a + b * brain,
~ dnorm(3, 0.5),
a ~ dnorm(0, 0.2)
b data = dat_list, chains = 4, iter = 4000, log_lik = TRUE, refresh = 0
),
)
precis(prim_mod1)
## mean sd 5.5% 94.5% n_eff Rhat4
## a -5.694815 0.24289635 -6.097134 -5.308912 1239.152 1.006466
## b 1.564574 0.04834444 1.487698 1.644743 1271.018 1.005911
The parameter estimate for brain size is around 1.57. Taking exponent, we get 4.81, thus we expect lambda to increase by this amount for a unit increase in log brain mass. Or rather looking at the equation:
\[ \begin{align} \log \lambda = \alpha + 1.57 \log B \\ \text{Taking exponent of each side} \\ \lambda = exp(-5.7) B^{1.57} \\ \lambda = 0.0033 B ^ {1.57} \end{align} \] Thus the rate of social learning has a power function with brain size with power 1.57.
Plotting the posterior for a better look at the model
<- extract.samples(prim_mod1)
post <- seq(from = min(dat_list$brain), to = max(dat_list$brain), length.out = 100)
brain_seq
<- link(prim_mod1, data = list(brain = brain_seq))
lambda <- colMeans(lambda)
lmu <- apply(lambda, 2, PI)
lci
<- sim(prim_mod1, data = list(brain = brain_seq))
simy <- apply(simy, 2, PI)
simy_ci
plot(exp(dat_list$brain), dat_list$sl, pch = 16, xlab = "brain size (grams)", ylab = "social learning instances")
lines(exp(brain_seq), lmu)
shade(simy_ci, exp(brain_seq), col = col.alpha('lightblue', alpha = 0.7))
shade(lci, exp(brain_seq))
<- PSIS(prim_mod1, pointwise = TRUE)$k
pareto_k <- which(pareto_k > 0.5)
idx points(exp(dat_list$brain[idx]), dat_list$sl[idx], pch = 6, cex = 1.5, col = "firebrick", lwd = 2)
The relationship is exponential. The 5 points with high Pareto k value have also been highlighted.
Part (b)
Modelling with research_effort
added. We use the same prior for both betas.
<- ulam(
prim_mod2 alist(
~ dpois(lambda),
sl log(lambda) <- a + bb*brain + br*research,
~ dnorm(3, 0.5),
a c(bb, br) ~ dnorm(0, 0.2)
data = dat_list, chains = 4, iter = 4000, log_lik = TRUE, refresh = 0
),
)compare(prim_mod1, prim_mod2, func = WAIC)
## WAIC SE dWAIC dSE pWAIC weight
## prim_mod2 663.8711 170.8462 0.0000 NA 66.44911 1.000000e+00
## prim_mod1 1639.3996 655.6354 975.5285 535.7239 166.81626 1.467857e-212
The new model seems to be much better from a predictive point of view, however, the dSE is greater than half of dWAIC, suggesting that the large difference might not be that significant.
Looking at its coefficients:
precis(prim_mod2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a -5.2724270 0.20360903 -5.5994234 -4.9523954 3126.197 1.000068
## br 1.3082097 0.05050304 1.2266821 1.3889137 2422.261 1.001420
## bb 0.2341085 0.05051157 0.1549084 0.3173637 2934.845 1.001139
The coefficient for brain size has reduced a lot now to around 0.2. The coefficient for research effort is 1.31, implying exp(1.31) = 3.7 increase in social learning for each unit increase in log research effort. We will get similar results for coefficient interpretation as before.
Part (c)
<- dagitty("dag{
primate_dag SL [outcome]
RE [exposure]
BR [exposure]
SL <- BR
SL <- RE
}")
ggdag(primate_dag) + theme_void()
We expect brain size to affect social learning. With larger brain sizes in primates, we would expect better showcase of social learning. We postulate that research effort also affects the observed social learning counts. Since we have not studies all species, so species that have not been studies extensively with similar brain size would likely show lower exhibits of social learning.