Statistical Rethinking Chapter 14

This chapter further discusses Multilevel models and how we can extend the idea of pooling from intercepts to slopes as well. This will also improve estimates of intercepts by borrowing information across parameters types.

This is the essence of the general varying effects strategy: Any batch of parameters with exchangeable index values can and probably should be pooled. Exchangeable just means the index values have no true ordering, because they are arbitrary labels. There’s nothing special about intercepts; slopes can also vary by unit in the data, and pooling information among them makes better use of the data.

There might be covariation between the intercepts and slopes, which allows us to squeeze out even more information out of the data.

Varying slopes by construction

To pool info across intercepts and slopes we need to model their covariance. In conventional multilevel model, this becomes possible via a joint multivariate Gaussian distribution for all the varying effects (intercepts + slopes).

vcov for a fit model returns the covariance in the parameter’s posterior distributions. We will simulate the scenario of a robot visiting various coffee shops to understand this. Thus we go from a separate Gaussian distribution for intercepts and slopes to a two-dimensional Gaussian distribution for intercepts (first dimension) and slopes (second dimension).

Simulate the population

Premise: Robot visits coffee shops, in morning or afternoon and there’s a wait time before getting coffee. Different cafes have different intercepts to show how busy they might be (more busy more wait time). Assume mornings are more busy. Moreover, different coffee shops might have different relationship between morning and afternoon wait times (varying slopes).

To define the population of cafés, we need to define the average wait times in morning and afternoon as well as the correlation between them.

a <- 3.5       # average morning wait time
b <- (-1)      # average difference afternoon wait time
sigma_a <- 1   # std dev in intercepts
sigma_b <- 0.5 # std dev in slopes
rho <- (-0.7)  # correlation between intercepts and slopes

To simulate sample cafés, we need to build a 2D multivariate Gaussian distribution. This requires a vector of 2 means a 2x2 matrix of variances and covariances.

For the mean:

Mu <- c(a, b)

a is the mean intercept (wait time in morning). b is the mean slope (difference in wait time between afternoon and morning).

The matrix of variances is arranged like this:

\[ \begin{pmatrix} \text{variance of intercepts} & \text{covariance of intercepts & slopes} \\ \text{covariance of intercepts & slopes} & \text{variance of slopes} \end{pmatrix} \]

In mathematical form:

\[ \begin{pmatrix} \sigma_\alpha^2 & \sigma_\alpha \sigma_\beta \rho \\ \sigma_\alpha \sigma_\beta \rho & \sigma_\beta^2 \end{pmatrix} \]

The variance in intercepts is \(\sigma_\alpha^2\) while the variance in slopes is \(\sigma_\beta^2\) (these are found along the diagonal of the matrix)
the other two items are the same, the covariance between variances and slopes (the product of their sd dev and correlation) \(\sigma_\alpha \sigma_\beta \rho\)

There are several options to build this matrix.

# using matrix command directly
cov_ab <- sigma_a * sigma_b * rho
Sigma <- matrix(c(sigma_a^2, cov_ab, cov_ab, sigma_b^2), ncol = 2)

The other common way to build the covariance matrix treats the std dev and correlations separately and then matrix multiplies them to produce the covariance matrix.

sigmas <- c(sigma_a, sigma_b) # std devs
Rho <- matrix(c(1, rho, rho, 1), nrow = 2) # correlation matrix

# now matrix multiply to get covariance matrix
Sigma <- diag(sigmas) %*% Rho %*% diag(sigmas)

We are now ready to simulate some cafes, each with its own intercept and slope.

N_cafes <- 20

To simulate, we just sample randomly from the multivariate Gaussian distribution defined by Mu and Sigma:

library(MASS)
set.seed(5)
vary_effects <- mvrnorm(N_cafes, Mu, Sigma)
vary_effects

##           [,1]       [,2]
##  [1,] 4.223962 -1.6093565
##  [2,] 2.010498 -0.7517704
##  [3,] 4.565811 -1.9482646
##  [4,] 3.343635 -1.1926539
##  [5,] 1.700971 -0.5855618
##  [6,] 4.134373 -1.1444539
##  [7,] 3.794469 -1.6264661
##  [8,] 3.946598 -1.7152794
##  [9,] 3.864267 -0.9071677
## [10,] 3.467614 -0.6804054
## [11,] 2.242875 -0.6181516
## [12,] 4.159506 -1.6592120
## [13,] 4.300283 -2.1125474
## [14,] 3.506948 -1.4406430
## [15,] 4.382086 -1.8798983
## [16,] 3.521133 -1.3506986
## [17,] 4.216713 -0.9192799
## [18,] 5.913003 -1.2313624
## [19,] 3.477306 -0.3570341
## [20,] 3.774899 -1.0570457

Each row is a café, with the first column defining its intercept and the second its slope.

a_cafe <- vary_effects[,1]
b_cafe <- vary_effects[,2]

Visualising the intercepts and slopes along with the distribution’s contours:

library(rethinking)
plot(a_cafe, b_cafe, col = "royalblue4", 
     xlab = "intercepts (a_cafe) - (Average morning wait)", ylab = "slopes (b_cafe) - (Avg diff between afternoon and morning wait time)")

# overlay population distribution
library(ellipse)

levels <- c(0.1, 0.3, 0.5, 0.8, 0.99)
for(l in seq_along(levels)){
    # lines(ellipse(Sigma, centre = Mu, level = l), col = col.alpha('black', 0.5))
    lines(ellipse(Sigma, centre = Mu, level = levels[l]), col = l)
}

The contour lines show the multivariate Gaussian population of intercepts and slopes for the 20 cafés sampled.

Simulate Observations

We simulated individual cafés and their avg properties. Now we need to simulate the robot visiting them and collecting data. Below we simulate 10 visits to each café, 5 in the morning and 5 in the afternoon. The robot records the wait time during each visit.

set.seed(22)
N_visits <- 10
afternoon <- rep(0:1, N_visits * N_cafes/2)
cafe_id <- rep(1:N_cafes, each = N_visits)
mu <- a_cafe[cafe_id] + b_cafe[cafe_id] * afternoon
sigma <- 0.5        # std dev within cafes
wait <- rnorm(N_visits * N_cafes, mu, sigma)
d <- data.frame(cafe = cafe_id, afternoon = afternoon, wait = wait)

head(d)

##   cafe afternoon     wait
## 1    1         0 3.967893
## 2    1         1 3.857198
## 3    1         0 4.727875
## 4    1         1 2.761013
## 5    1         0 4.119483
## 6    1         1 3.543652

In this case the data is balanced, each café was observed exactly 10 times and the time of day is balanced as well. In general the data need to be balanced though.

The varying slopes model

We will use the simulated data and create a model to learn about the data generating process. The model will be similar to the varying intercepts model, but now we will have a joint population of intercepts and slopes instead of just a distribution of varying intercepts.

The probability of the data and the linear model:

\[\begin{align} W_i &\sim \text{Normal}(\mu_i, \sigma) & \text{[likelihood]} \\ \mu_i &= \alpha_{ \text{CAFÉ[i]} } + \beta_{CAFÉ[i]}A_i & \text{[linear model]} \end{align}\]

Then comes the matrix of varying intercepts and slopes with it’s covariance matrix:

\[\begin{align} \begin{bmatrix} \alpha_{ \text{CAFÉ[i]} } \\ \beta_{CAFÉ[i]} \end{bmatrix} &\sim \text{MVNormal} \left( \begin{bmatrix} \alpha \\ \beta \end{bmatrix}, S \right) & \text{[population of varying effects]} \\ S &= \begin{pmatrix} \sigma_\alpha & 0 \\ 0 & \sigma_\beta \end{pmatrix} R \begin{pmatrix} \sigma_\alpha & 0 \\ 0 & \sigma_\beta \end{pmatrix} & \text{[construct covariance matrix]} \end{align}\]

These lines state that each café has an intercept \(\alpha_{\text{CAFÉ[i]}}\) and slope \(\beta_{\text{CAFÉ[i]}}\) with a prior distribution defined by the 2d Gaussian distribution with means \(\alpha\) and \(\beta\) and covariance matrix S. This statement of prior will adaptively regularise the individual intercepts, slopes and the correlation among them.

The second line defines the construction of the covariance matrix, by factoring it into separate standard deviations, \(\sigma_\alpha\) and \(\sigma_\beta\) and a correlation matrix R.

Finally we have the hyper-priors that define the adaptive varying effects priors:

\[ \begin{align} \alpha &\sim \text{Normal}(5, 2) &\text{[prior for average intercept]} \\ \beta &\sim \text{Normal}(-1, 0.5) & \text{[prior for average slope]} \\ \sigma &\sim \text{Exponential}(1) & \text{[prior stddev within cafés]} \\ \sigma_\alpha &\sim \text{Exponential}(1) & \text{[prior stddev among intercepts]} \\ \sigma_\beta &\sim \text{Exponential}(1) & \text{[prior stddev among slopes]} \\ R &\sim \text{LKJcorr}(2) & \text{[prior for correlation matrix]} \end{align} \]

The final line defines the prior for the correlation matrix. It is not easy to conceptualise what a distribution of matrices means, but in this simple case we have a correlation matrix of 2x2 size. So it looks like this:

\[ R = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \]

where \(\rho\) is the correlation between intercepts and slopes. So we only need define a prior for a single parameter. In larger matrices, this process is more complicated.

the “LKJcorr(2)” defines a weakly informative prior on \(\rho\) sceptical of values near the extremes (-1 & 1). This distribution has a single parameter \(\eta\) that controls how sceptical the prior os of large correlations in the matrix. “LKJcorr(1)” defines a flat prior overall ll valid correlation matrices. A value > 1 leads to less likelihood of extreme correlations.

Let’s sample some random matrices from it and visualise:

set.seed(123)
# eta = 2
R <- rlkjcorr(1e4, K = 2, eta = 2)
dens(R[,1,2], xlab = "correlation", ylim = c(0, 1.1))
text(0.2, y =0.78, "eta = 2")

# eta = 1
R <- rlkjcorr(1e4, K = 2, eta = 1)
dens(R[,1,2], xlab = "correlation", add = TRUE, col = 'red')
text(0, y =0.55, "eta = 1", col = 'red')

# eta = 4
R <- rlkjcorr(1e4, K = 2, eta = 4)
dens(R[,1,2], xlab = "correlation", add = TRUE, col = 'royalblue4')
text(0, y =0.95, "eta = 4", col = 'royalblue4')

We finally fit a model to the above specification:

set.seed(867530)
m14.1 <- ulam(
    alist(
        wait ~ normal(mu, sigma), 
        mu <- a_cafe[cafe] + b_cafe[cafe]*afternoon, 
        c(a_cafe, b_cafe)[cafe] ~ multi_normal(c(a,b), Rho, sigma_cafe), 
        a ~ normal(5, 2), 
        b ~ normal(-1, 0.5), 
        sigma_cafe ~ exponential(1), 
        sigma ~ exponential(1), 
        Rho ~ lkj_corr(2)
    ), data = d, chains = 4, cores = 4
)

multi_normal is a stan function defining a multivariate Gaussian distribution that takes a vector of means, a correlation matrix, Rho and a vector of standard deviations sigma_cafe and constructs the covariance matrix internally. The similar R functions are dmvnorm and dmvnorm2.

precis(m14.1)

##             mean         sd       5.5%      94.5%    n_eff    Rhat4
## a      3.6521552 0.22366324  3.2989676  4.0164863 2326.240 1.000260
## b     -1.1329358 0.14949313 -1.3671926 -0.9001166 2273.210 1.002161
## sigma  0.4739909 0.02641391  0.4339268  0.5182957 1822.516 0.999832

We get results that are very close to the actual simulation parameters.

Posterior correlations

Let’s look inspect the posterior distributions of varying effects. First, we will examine the posterior correlation between intercepts and slopes.

post <- extract.samples(m14.1)
dens(post$Rho[,1,2], xlim = c(-1, 1), 
     col = 'royalblue4', lwd = 2, xlab = "correlation")  # posterior
text(-0.25, 2, "posterior", col = 'royalblue4', font = 2, cex = 1.3)

R <- rlkjcorr(1e4, K = 2, eta = 2)                                  # prior
dens(R[,1, 2], add = TRUE, lty = 2, lwd = 2)
text(0.25, 0.6, "prior", font = 2, cex = 1.3)

The posterior seems to have correctly learned the negative correlation between intercepts and slopes.

Redoing the model with flatter and more informative priors:

m14.1_flat <- ulam(
    alist(
        wait ~ normal(mu, sigma), 
        mu <- a_cafe[cafe] + b_cafe[cafe]*afternoon, 
        c(a_cafe, b_cafe)[cafe] ~ multi_normal(c(a,b), Rho, sigma_cafe), 
        a ~ normal(5, 2), 
        b ~ normal(-1, 0.5), 
        sigma_cafe ~ exponential(1), 
        sigma ~ exponential(1), 
        Rho ~ lkj_corr(1)
    ), data = d, chains = 4, cores = 4
)

m14.1_info <- ulam(
    alist(
        wait ~ normal(mu, sigma), 
        mu <- a_cafe[cafe] + b_cafe[cafe]*afternoon, 
        c(a_cafe, b_cafe)[cafe] ~ multi_normal(c(a,b), Rho, sigma_cafe), 
        a ~ normal(5, 2), 
        b ~ normal(-1, 0.5), 
        sigma_cafe ~ exponential(1), 
        sigma ~ exponential(1), 
        Rho ~ lkj_corr(4)
    ), data = d, chains = 4, cores = 4
)

Comparing the posterior and prior for these new models:

post_flat <- extract.samples(m14.1_flat)
post_info <- extract.samples(m14.1_info)


withr::with_par(
    list(mfrow = c(2, 1)), 
    code = {
        
        # for model with flatter prior for Rho
        dens(post_flat$Rho[,1,2], xlim = c(-1, 1), 
             col = 'royalblue4', lwd = 2, xlab = "correlation")  # posterior
        mtext("Flatter prior with eta = 1")
        text(-0.25, 2, "posterior", col = 'royalblue4', font = 2, cex = 1.3)
        
        R <- rlkjcorr(1e4, K = 2, eta = 1)                                  # prior
        dens(R[,1, 2], add = TRUE, lty = 2, lwd = 2)
        text(0.25, 0.6, "prior", font = 2, cex = 1.3)
        
        
        # for model with more informative prior
        dens(post_info$Rho[,1,2], xlim = c(-1, 1), 
             col = 'royalblue4', lwd = 2, xlab = "correlation")  # posterior
        text(-0.25, 2, "posterior", col = 'royalblue4', font = 2, cex = 1.3)
        mtext("Informative prior with eta = 4")
        
        R <- rlkjcorr(1e4, K = 2, eta = 2)                                  # prior
        dens(R[,1, 2], add = TRUE, lty = 2, lwd = 2)
        text(0.25, 0.6, "prior", font = 2, cex = 1.3)
        
    }
    
)

We get similar posterior with both priors.

Posterior shrinkage

The multilevel model estimated the posterior distributions for intercepts and slopes of each café. The inferred correlation between them was used to pool info across them. This is similar to how the inferred variation among intercepts pools info among them and how the inferred variation between slopes is used to pool info for the individual slopes. The variances and correlations defined a Multivariate Gaussian prior which learned from the data to adaptively regularise both the intercepts and slopes.

We will plot the posterior mean varying effects and also overlay the raw unpooled estimates for comparison, as well as the contours of the inferred prior to help visualise the shrinkage.

#compute unpooled estimates directly from data
a1 <- sapply(1:N_cafes, function(i) mean(wait[cafe_id == i & afternoon == 0]))
b1 <- sapply(1:N_cafes, function(i) mean(wait[cafe_id == i & afternoon == 1])) - a1

# extract posterior means of partially pooled estimates
post <- extract.samples(m14.1)
a2 <- colMeans(post$a_cafe)
b2 <- colMeans(post$b_cafe)

# plot both and connect with lines

plot(a1, b1, xlab = "intercept", ylab = "slope", 
     pch = 16, col = rangi2, ylim = c(min(b1) - 0.1, max(b1) + 0.1), 
     xlim = c(min(a1) - 0.1, max(a1) + 0.1))

points(a2, b2, pch = 1)

for(i in 1:N_cafes) 
    lines( c(a1[i], a2[i]), c(b1[i], b2[i]) )

# superimposing contours of the population

# compute posterior mean bivariate Gaussian
Mu_est <- c(mean(post$a), mean(post$b))
rho_est <- mean(post$Rho[,1,2])
sa_est <- mean(post$sigma_cafe[,1])
sb_est <- mean(post$sigma_cafe[,2])
cov_ab <- sa_est*sb_est*rho_est
Sigma_est <- matrix(c(sa_est^2, cov_ab, cov_ab, sb_est^2), ncol = 2)

# draw contours
library(ellipse)
for(l in c(0.1, 0.3, 0.5, 0.8, 0.99))
    lines(ellipse(Sigma_est, centre = Mu_est, level = l), col = col.alpha('black', 0.2))

blue points are the unpooled estimates for each café
open points are the posterior means from the model
lines connect points belonging to same café
each open point is displaces from the blue towards the centre of the contours as a result of shrinkage in both dimensions. Blue points farther from the centre experience more shrinkage, as they are less plausible, given the inferred population.

The shrinkage is not in direct lines towards the centre. For instance the point in the middle at the top. It had an average intercept (~3.5) and lies in the middle along the horizontal axis, but it also had an unusually high slope, so pooled info from other cafés results in scepticism about the slope and reducing the slope leads to reducing the intercept as well due to the correlation between them. The angled shrinkage lines reflect this negative correlation between the intercepts and slopes.

We next present this same information but on the outcome scale:

# convert varying effects to waiting times
wait_morning_1 <- (a1)
wait_afternoon_1 <- (a1 + b1)
wait_morning_2 <- (a2)
wait_afternoon_2 <- (a2 + b2)

# plot both and connect with lines

plot(wait_morning_1, wait_afternoon_1, xlab = "morning wait", ylab = "afternoon wait", 
     pch = 16, col = rangi2, 
     ylim = c(min(wait_afternoon_1) - 0.1, max(wait_afternoon_1) + 0.1), 
     xlim = c(min(wait_morning_1) - 0.1, max(wait_morning_1) + 0.1))
points(wait_morning_2, wait_afternoon_2, pch = 1)
for(i in 1:N_cafes)
    lines(c(wait_morning_1[i], wait_morning_2[i]), 
          c(wait_afternoon_1[i], wait_afternoon_2[i]))
abline(a = 0, b = 1, lty = 2)

# For the contour we need variance and covariance. We can use some formula as
# well owing to simple relations among Gaussian random variables. But we will
# use simulation instead so we can see how to compute anything of interest

# now shrinkage distribution by simulation
v <- mvrnorm(1e4, Mu_est, Sigma_est)
v[,2] <- v[,1] + v[,2] # calculate afternoon wait
Sigma_est2 <- cov(v)
Mu_est2 <- Mu_est
Mu_est2[2] <- Mu_est[1] + Mu_est[2]

# draw contours
library(ellipse)

for(l in c(0.1, 0.3, 0.5, 0.8, 0.99))
    lines(ellipse(Sigma_est2, centre = Mu_est2, level = l), col = col.alpha('black', 0.2))

x-axis - expected morning wait time in minutes
y-axis - expected afternoon wait
blue points are unpooled raw estimates
open points are posterior predictions using the pooled estimates
diagonal dashed line shows when morning and afternoon wait time are equal

Shrinkage on the parameter scale naturally produces shrinkage on the outcome scale. The contours show the population of wait times and the population is now positively correlated, cafés with longer morning wait times also have longer afternoon wait times (they are popular). But the population lies mostly below the dashed line, implying on average wait time in afternoon is less than wait time in morning.

Advanced varying slopes

In this section we will how to build models with more than 2 varying effects - varying intercepts + more than one varying slope - as well as more than one type of cluster.

We will again look at the chimpanzee data where there were two types of clusters: actors and blocks. We had two slopes there:

one for the effect of pro social option (the side of the table with two pieces of food)
one for the interaction between pro social option and the presence of another chimpanzee

We will model both types of clusters and place varying effects on the intercepts and both slopes. We will also use non-centred parametrisation.

Let’s look at the mathematical form of this cross-classified varying slopes model. The model has been broken into more than one linear model to make it easy to see the parts of intercepts and slopes. Here’s the likelihood and its linear model:

\[ \begin{align} L_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \gamma_{\text{TID[i]}} + \alpha_{ \text{ ACTOR[i], TID[i] }} + \beta_{ \text{ BLOCK[i], TID[i] }} \end{align} \]

The model above contains:

\(\gamma_{\text{TID[i]}}\) - average log-odds for each treatment
\(\alpha_{ \text{ ACTOR[i], TID[i] }}\) - effect for each actor in each treatment
\(\beta_{ \text{ BLOCK[i], TID[i] }}\) - effect for each block in each treatment

This is an interaction model that allows the effect of each treatment to vary by each actor and each block. The avg treatment (4) can thus vary by block (6) and each individual chimpanzee (7) can also respond (across blocks) to each treatment differently. Thus there are total [4 + 7x4 + 6x4] = 56 parameters.

The next part of the model builds the adaptive priors. Since there are two cluster types, actors and blocks, there are thus two multivariate Gaussian priors, and each is 4-dimensional since there are 4 treatments. The 2 priors:

\[\begin{align} \begin{bmatrix} \alpha_{j,1} \\ \alpha_{j,2} \\ \alpha_{j,3} \\ \alpha_{j,4} \end{bmatrix} &\sim \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix} , S_{ \text{ACTOR}} \right) \\ \begin{bmatrix} \beta_{j,1} \\ \beta_{j,2} \\ \beta_{j,3} \\ \beta_{j,4} \end{bmatrix} &\sim \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix} , S_{ \text{BLOCK}} \right) \end{align}\]

These state that actors and blocks come from 2 different statistical populations and within each the 4 features of each actor/block are related through a covariance matrix S specific to that population. There are no means in these priors, since we have already placed the average treatment effect \(\gamma\) in the linear model.

Here’s the ulam code for the above:

Centered parameterisation of the model

library(rethinking)
data("chimpanzees")
d <- chimpanzees
d$block_id <- d$block
d$treatment <- 1L + d$prosoc_left + 2L*d$condition
dat <- list(
    L = d$pulled_left, 
    tid = d$treatment, 
    actor = d$actor, 
    block_id = as.integer(d$block_id)
)

set.seed(4387510)
m14.2 <- ulam(
    alist(
        L ~ dbinom(1, p), 
        logit(p) <- g[tid] + alpha[actor,tid] + beta[block_id, tid], 
        
        # adaptive priors
        vector[4]: alpha[actor] ~ multi_normal(0, Rho_actor, sigma_actor), 
        vector[4]: beta[block_id] ~ multi_normal(0, Rho_block, sigma_block), 
        
        # fixed priors
        
        g[tid] ~ dnorm(0, 1), 
        sigma_actor ~ dexp(1), 
        Rho_actor ~ dlkjcorr(4), 
        sigma_block ~ dexp(1), 
        Rho_block ~ dlkjcorr(4)
    ), data = dat, chains = 4, cores = 4, refresh = 0
)

## Warning: There were 68 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.

## Warning: There were 4 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: The largest R-hat is NA, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

We get warnings for multiple “divergent transitions”. let’s look at the trankplot.

trankplot(m14.2) # plot hidden due to its size

Non-centered parameterisation of the model

We saw how parametrising the model can help in the previous chapter. Let’s re-parametrise and see if it can solve the divergent transition problems. The goal is to factor all the parameters out of the adaptive priors and place them instead in the linear model. The only problematic parts are the covariance matrices.

We have the same strategy just extrapolated to matrices. We will again make z-scores for each varying effect, but this time it will be a matrix of z-scores. Then we will multiply those z-scores into a covariance matrix so that we can get back the random effects on the right scale for the linear model. ulam has a function compose_noncentered for this.

set.seed(4387510)
m14.3 <- ulam(
    alist(
        L ~ binomial(1, p), 
        logit(p) <- g[tid] + alpha[actor, tid] + beta[block_id, tid], 
        
        # adaptive priors - non-centred
        transpars> matrix[actor,4]  :alpha <- compose_noncentered(sigma_actor, L_Rho_actor, z_actor), 
        transpars> matrix[block_id,4]:beta <- compose_noncentered(sigma_block, L_Rho_block, z_block), 
        matrix[4, actor]:    z_actor ~ normal(0, 1), 
        matrix[4, block_id]: z_block ~ normal(0, 1), 
        
        # fixed priors
        g[tid] ~ normal(0, 1), 
        
        vector[4]:sigma_actor ~ dexp(1), 
        cholesky_factor_corr[4]: L_Rho_actor ~ lkj_corr_cholesky(2), 
        
        vector[4]:sigma_block ~ dexp(1), 
        cholesky_factor_corr[4]: L_Rho_block ~ lkj_corr_cholesky(2), 
        
        # compute ordinary correlation matrices from cholesky factors
        
        gq> matrix[4,4]:Rho_actor <<- Chol_to_Corr(L_Rho_actor),
        gq> matrix[4,4]:Rho_block <<- Chol_to_Corr(L_Rho_block)
        
    ), data = dat, chains = 4, cores = 4, log_lik = TRUE
)

There’s lots of new stuff going on, which we will discuss in the Overthinking section next. But we don’t get any divergent transition anymore.

The last two lines calculate the ordinary correlation matrices from those Cholesky factors. The gq tag tells stan to do this calculation only at the end of each transition.

Let’s compare the outputs of m14.2 and m14.3:

precis(m14.2, 2)

##                       mean        sd         5.5%     94.5%     n_eff    Rhat4
## g[1]           0.251119496 0.5240923 -0.589722676 1.0995014 374.17899 1.007284
## g[2]           0.615100604 0.4036047 -0.018195567 1.2488470 507.39178 1.001927
## g[3]           0.002871722 0.6048293 -0.964767368 1.0307544 421.91607 1.001647
## g[4]           0.635133098 0.5650633 -0.298483110 1.5343772 564.55238 1.001038
## sigma_actor[1] 1.439416868 0.5390114  0.788317074 2.4926716 247.20098 1.012765
## sigma_actor[2] 0.920588420 0.4064949  0.409574813 1.6509505 708.42121 1.006758
## sigma_actor[3] 1.848648307 0.5795607  1.141481755 2.9160061 590.24635 1.016448
## sigma_actor[4] 1.578518045 0.6346487  0.823363873 2.7720114 505.55033 1.003107
## sigma_block[1] 0.404538720 0.3192185  0.030015880 0.9933687 250.86418 1.011208
## sigma_block[2] 0.396218040 0.3706993  0.009988934 1.0180057  98.44694 1.038266
## sigma_block[3] 0.303780443 0.2845077  0.021647294 0.8134728 149.92421 1.022365
## sigma_block[4] 0.525033230 0.3911309  0.062454866 1.1914177 291.58784 1.001912

precis(m14.3, 2)

##                      mean        sd         5.5%     94.5%     n_eff     Rhat4
## g[1]            0.2433251 0.4852318 -0.514506189 0.9996260  777.7854 1.0099480
## g[2]            0.6349127 0.4003680 -0.003847579 1.2573623 1140.9881 0.9992423
## g[3]           -0.0335479 0.5921058 -0.957479935 0.9373517 1191.3706 0.9997779
## g[4]            0.6477834 0.5553712 -0.219353437 1.5346800 1097.3323 0.9990984
## sigma_actor[1]  1.3941999 0.4926266  0.791166364 2.2388420  860.4268 0.9999113
## sigma_actor[2]  0.8930560 0.3917973  0.431389011 1.6075829 1284.8381 0.9997274
## sigma_actor[3]  1.8302110 0.5610336  1.127221304 2.8508855 1339.4654 0.9993560
## sigma_actor[4]  1.5795804 0.6009031  0.833857288 2.6164948 1347.9211 1.0014671
## sigma_block[1]  0.3972934 0.3091403  0.040625095 0.9471612  983.8448 1.0048769
## sigma_block[2]  0.4424893 0.3341946  0.039970995 1.0603355  815.1231 1.0001618
## sigma_block[3]  0.3050460 0.2732550  0.022120352 0.8332662 1755.5157 0.9999644
## sigma_block[4]  0.4731760 0.3769806  0.042506062 1.1500074 1210.6143 1.0027462

We get the same results from both models, but the n_eff values for the non-centred model are much higher. It also samples faster. Let’s create a scatterplot of the effective samples of the two models:

# extract n_eff values for each model
n_eff_nc <- precis(m14.3, 3, pars = c("alpha", "beta"))$n_eff
n_eff_c  <- precis(m14.2, 3, pars = c("alpha", "beta"))$n_eff

plot(n_eff_c, n_eff_nc,  xlab = "Centered (default)", ylab = "non-centered (cholesky)", lwd = 1.5, 
     ylim = c(0, 2000), xlim = c(0, 2000))
abline(a = 0, b = 1, lty = 2)

The non-centred model clearly produces more efficient samples per parameters. Practically this means we can similar results for less number of iterations.

The model has 76 parameters:

4 avg treatment effects
4x7 varying effects on actor
4x6 varying effects on block
8 standard deviations
12 free correlation parameters

However, the effective number of parameters is around 27:

WAIC(m14.3)

##       WAIC      lppd  penalty  std_err
## 1 544.8538 -245.4933 26.93358 19.78271

The two varying effects populations for actors and blocks regularise the varying effects so the varying intercepts and slopes count for less than one effective parameter.

The standard deviation parameters can give us a sense of how aggressively the varying effects are being regularised:

precis(m14.3, depth = 2, pars = c("sigma_actor", "sigma_block"))

##                     mean        sd       5.5%     94.5%     n_eff     Rhat4
## sigma_actor[1] 1.3941999 0.4926266 0.79116636 2.2388420  860.4268 0.9999113
## sigma_actor[2] 0.8930560 0.3917973 0.43138901 1.6075829 1284.8381 0.9997274
## sigma_actor[3] 1.8302110 0.5610336 1.12722130 2.8508855 1339.4654 0.9993560
## sigma_actor[4] 1.5795804 0.6009031 0.83385729 2.6164948 1347.9211 1.0014671
## sigma_block[1] 0.3972934 0.3091403 0.04062509 0.9471612  983.8448 1.0048769
## sigma_block[2] 0.4424893 0.3341946 0.03997099 1.0603355  815.1231 1.0001618
## sigma_block[3] 0.3050460 0.2732550 0.02212035 0.8332662 1755.5157 0.9999644
## sigma_block[4] 0.4731760 0.3769806 0.04250606 1.1500074 1210.6143 1.0027462

These are just posterior means and the amount of shrinkage averages over the entire posterior, however, we can still see from the small means that the regularisation is quite aggressive, especially so for block.

The overfitting risk is much milder here compared to a model with only ordinary fixed effects.

Posterior Predictions

Let’s next check the posterior predictions against the average for each actor and each treatment.

# compute mean for each actor in each treatment
pl <- by(d$pulled_left, list(d$actor, d$treatment), mean)

# generate posterior predictions using link
datp <- list(
    actor = rep(1:7, each = 4), 
    tid = rep(1:4, times =7), 
    block_id = rep(5, times = 4*7)
)
p_post <- link(m14.3, data = datp)
p_mu <- apply(p_post, 2, mean)
p_ci <- apply(p_post, 2, PI)

# set up plot
plot(NULL, xlim = c(1, 28), ylim = c(0, 1), xlab = "", ylab = "proportion left lever", 
     xaxt = "n", yaxt = "n")
axis(2, at = c(0, 0.5, 1), labels = c(0, 0.5, 1))
abline(h = 0.5, lty = 2)

# divide screen for actors
for(j in 1:7) abline(v = (j-1)*4 + 4.5, lwd = 0.5)
for(j in 1:7) text((j-1)*4 + 2.5, 1.1, concat("actor", j), xpd = TRUE)

xo <- 0.1 # offset distance to stagger raw data and prediction
# raw data
for(j in (1:7)[-2]){
    lines( (j-1)*4 + c(1, 3) - xo, pl[j, c(1,3)], lwd = 2, col = "royalblue4")
    lines( (j-1)*4 + c(2, 4) - xo, pl[j, c(2,4)], lwd = 2, col = "royalblue4")
}

points(1:28-xo, t(pl), pch = 16, col = "white", cex = 1.7)
points(1:28-xo, t(pl), pch = c(1, 1, 16, 16), col = "royalblue4", lwd = 2)

yoff <- 0.175
text( 1-xo, pl[1,1] - yoff, "R/N", pos = 1, cex = 0.8)
text( 2-xo, pl[1,2] + yoff, "L/N", pos = 3, cex = 0.8)
text( 3-xo, pl[1,3] - yoff, "R/P", pos = 1, cex = 0.8)
text( 4-xo, pl[1,4] + yoff, "L/P", pos = 3, cex = 0.8)

# posterior predictions
for(j in (1:7)[-2]){
    lines( (j-1)*4+c(1,3)+xo , p_mu[(j-1)*4+c(1,3)] , lwd=2 )
    lines( (j-1)*4+c(2,4)+xo , p_mu[(j-1)*4+c(2,4)] , lwd=2 )
}
for ( i in 1:28 ) lines( c(i,i)+xo , p_ci[,i] , lwd=1 )
points( 1:28+xo , p_mu , pch=16 , col="white" , cex=1.3 )
points( 1:28+xo , p_mu , pch=c(1,1,16,16) )

raw data in blue
posterior means and 89% CI in black
open circles are treatments without a partner
filled circles are with partner
pro social treatments alternate right-left-right-left as labelled for actor 1. Compared to models from previous chapters, the model accommodates a lot more variation among individuals.

The posterior however, does not just repeat the data, but applies shrinkage in several places. For instance, actor 2, that only pulled the left lever, has its raw estimates clinging to the top but the posterior predictions shrink inward. It shrinks more for treatments 1:2 but less for others, this is since these treatments had less variation among actors (look back at m14.3 precis output). Less variation among actors in a treatment leads to more shrinkage among actors in that same treatment.

Overthinking: Non-centered parameterisation of the multilevel model

For inefficient chains,increasing the no of iterations can produce reliable samples from the posterior, but this is still inefficient and unreliable and in some cases, the chains could still be biased in subtle ways that are hard to detect. Re-parametrisation is a better way. Model m14.3 uses Cholesky decomposition to take the covariance out of the prior. The top part of the model was the same as m14.2. The changes are the following extra lines:

# adaptive priors - non-centred
transpars> matrix[actor,4]  :alpha <- compose_noncentered(sigma_actor, L_Rho_actor, z_actor), 
transpars> matrix[block_id,4]:beta <- compose_noncentered(sigma_block, L_Rho_block, z_block), 
matrix[4, actor]:    z_actor ~ normal(0, 1), 
matrix[4, block_id]: z_block ~ normal(0, 1),

The first two lines (starting with transpars) define the matrices of varying effects alpha and beta. Each is a matrix with a row for each actor/block and a column for each treatment effect. compose_noncentered mixes the vector of standard deviations, the correlation matrix and the z-scores together to make a matrix of parameters on the correct scale for the linear model. This allows the matrices of z-scores to be normal(0,1).

The other change to the model to make it non-centred is to replace the correlation matrices with “Cholesky factor”, cholesky_factor_corr.

A CHOLESKY DECOMPOSITION L is a way to represent a square symmetric matrix like a correlation matrix R such that \(R=LL^T\). We can then multiply L by a matrix of uncorrelated samples of z-scores and end up with a matrix of correlated samples (the varying effects). This trick lets us take the covariance matrix out of the prior. We sample a matrix of uncorrelated z-scores and then multiply by the Cholesky factor and the std devs to get the varying effects with the correct scale and correlation.

The transpars> flags define matrices alpha and beta as TRANSFORMED PARAMETERS which means that Stan will include them in the posterior, even though they are just functions of parameters. They are in the “transformed parameters” section of the stan code.

scode <- stancode(m14.3)

## data{
##     int L[504];
##     int block_id[504];
##     int actor[504];
##     int tid[504];
## }
## parameters{
##     matrix[4,7] z_actor;
##     matrix[4,6] z_block;
##     vector[4] g;
##     vector<lower=0>[4] sigma_actor;
##     cholesky_factor_corr[4] L_Rho_actor;
##     vector<lower=0>[4] sigma_block;
##     cholesky_factor_corr[4] L_Rho_block;
## }
## transformed parameters{
##     matrix[7,4] alpha;
##     matrix[6,4] beta;
##     beta = (diag_pre_multiply(sigma_block, L_Rho_block) * z_block)';
##     alpha = (diag_pre_multiply(sigma_actor, L_Rho_actor) * z_actor)';
## }
## model{
##     vector[504] p;
##     L_Rho_block ~ lkj_corr_cholesky( 2 );
##     sigma_block ~ exponential( 1 );
##     L_Rho_actor ~ lkj_corr_cholesky( 2 );
##     sigma_actor ~ exponential( 1 );
##     g ~ normal( 0 , 1 );
##     to_vector( z_block ) ~ normal( 0 , 1 );
##     to_vector( z_actor ) ~ normal( 0 , 1 );
##     for ( i in 1:504 ) {
##         p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
##         p[i] = inv_logit(p[i]);
##     }
##     L ~ binomial( 1 , p );
## }
## generated quantities{
##     vector[504] log_lik;
##     vector[504] p;
##     matrix[4,4] Rho_actor;
##     matrix[4,4] Rho_block;
##     Rho_block = multiply_lower_tri_self_transpose(L_Rho_block);
##     Rho_actor = multiply_lower_tri_self_transpose(L_Rho_actor);
##     for ( i in 1:504 ) {
##         p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
##         p[i] = inv_logit(p[i]);
##     }
##     for ( i in 1:504 ) log_lik[i] = binomial_lpmf( L[i] | 1 , p[i] );
## }

The calculations in the transformed parameters part merge vectors of standard deviations, sigma_actor and sigma_block with Cholesky correlation factors, L_Rho_actor and L_Rho_block. The function diag_pre_multiply makes a diagonal matrix from the sigma vector and then multiplies with the Cholesky correlation factors to produce the right covariance matrix. This is finally multiplied by the z-scores. The resultant matrix is then transposed (' at the end) for convenience to index it as alpha[actor, effect] instead of actor[effect, actor].

In the model part of the stan code, alpha and beta are then available as parameters, so the linear model part looks the same:

cat( regmatches(scode, regexpr("model\\{.*?\\}", scode)) )

## model{
##     vector[504] p;
##     L_Rho_block ~ lkj_corr_cholesky( 2 );
##     sigma_block ~ exponential( 1 );
##     L_Rho_actor ~ lkj_corr_cholesky( 2 );
##     sigma_actor ~ exponential( 1 );
##     g ~ normal( 0 , 1 );
##     to_vector( z_block ) ~ normal( 0 , 1 );
##     to_vector( z_actor ) ~ normal( 0 , 1 );
##     for ( i in 1:504 ) {
##         p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
##         p[i] = inv_logit(p[i]);
##     }

1. Vector p is declared to hold our linear model calculations for each case
2-6) Priors are defined in terms of Cholesky correlation factors and vectors of std devs
7-8) z-score matrices are assigned priors using to_vector since normal(0,1) returns vectors not matrices. z-scores are still stored as matrices.
9-13) - linear model computation using the alpha and beta matrices from the transformed parameters block.

The last section in the code is “generated quantities” where variables that are functions of each sample can be calculated. This is used to transformed the Cholesky factors into ordinary correlation matrices so they can be interpreted as such as well as to compute the log-probabilities needed to calculate WAIC/PSIS

cat( regmatches(scode, regexpr("generated quantities\\{.*?\\}", scode)) )

## generated quantities{
##     vector[504] log_lik;
##     vector[504] p;
##     matrix[4,4] Rho_actor;
##     matrix[4,4] Rho_block;
##     Rho_block = multiply_lower_tri_self_transpose(L_Rho_block);
##     Rho_actor = multiply_lower_tri_self_transpose(L_Rho_actor);
##     for ( i in 1:504 ) {
##         p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
##         p[i] = inv_logit(p[i]);
##     }

the function multiply_lower_tri_self_tranpose is a compact and efficient way to perform the matrix algebra needed to turn the Cholesky factor L into the corresponding matrix \(R=LL^T\).

Instruments and causal designs

In a DAG, many paths can connect a variable to an outcome. We want to leave open the causal paths and close the backdoors (the non-causal paths). Sometimes, closing all the backdoors might not be possible. We discuss in this section what can be done in such cases: exploit a combination of natural experiments and clever modelling that allow causal inference even when non-causal paths cannot be closed.

Instrumental variables

We focus on the example of studying the impact of education on wage. There could be any number of confounds that influence both education and wage.

Something like the above. The backdoor path E <- U -> W stops us from learning the true impact of E on W.

Although we cannot condition on the unobserved U, we can instead find a suitable INSTRUMENTAL VARIABLE.

In causal terms, an instrumental variable is a variable that acts like a natural experiment on the exposure E. In technical terms, an instrumental variable Q is a variable that satisfies these criteria:

Independent of U (Q \(\perp \! \!\! \perp\) U)

Not independent of E (Q \(\not \! \perp \! \!\! \perp\) E)

Q cannot influence W except through E

The last line is called the EXCLUSION RESTRICTION and cannot be strictly tested and is often implausible. The first line cannot be tested either. But if domain knowledge suggests that these are true then we can continue. (Often we can’t test the independence implications, but might be able to test other implications in the form of inequality constraints).

In the education and wage example, the instrument for education would like the following:

dag_edu <- dagitty("dag{
                   U [unobserved]
                   E <- U -> W
                   E -> W
                   Q -> E
}")
coordinates(dag_edu) <- list(
    x = c(U = 1, E = 0, W = 2, Q = -1), 
    y = c(U = 0, E = 1, W = 1, Q = 0)
)
drawdag(dag_edu)

In the above DAG, Q is an instrument. However, we cannot directly add it to the regression of W on E. This is because by including E, E becomes a collider for Q and U, and thus the non-causal path from Q to W (via E) opens up (it is non-causal since changing Q doesn’t result in any change in W through this path). This confounds the coefficient of Q, and it won’t be zero anymore since it will pick up the association between U and W and thus the coefficient on E gets even more confounded. Used this way, Q will be called a BIAS AMPLIFIER.

As example, we will suppose that Q indicates the quarter of the year the person was born in - winter, spring, summer and fall. People born earlier in the year tend to get less schooling (since they are biologically older when they start school as well as since they become eligible to drop out of school earlier). We will believe that Q in an instrumental, i.e. it only influences W through E and is not influenced by confounds U.

We will simulate the data on the above premise for 500 simulated people:

set.seed(73)
N <- 500
U_sim <- rnorm(N)
Q_sim <- sample(1:4, size = N, replace = TRUE)
E_sim <- rnorm(N, U_sim + Q_sim)
W_sim <- rnorm(N, U_sim + 0*E_sim)
dat_sim <- list(
    W = standardize(W_sim), 
    E = standardize(E_sim), 
    Q = standardize(Q_sim)
)

Q varies from 1 to 4, larger values are associated with more education through addition of Q_sim to the mean of E_sim
assumed true influence of education on wages is zero
Q does influence education, so it can serve as an instrument for discovering E -> W

We will consider three models:

Naively regress wages on education

m14.4 <- ulam(
    alist(
        W ~ dnorm(mu, sigma), 
        mu <- aW + bEW*E, 
        aW ~ dnorm(0, 0.2), 
        bEW ~ dnorm(0, 0.5), 
        sigma ~ dexp(1)
    ), data = dat_sim, chains = 4, cores = 4, refresh = 0
)
precis(m14.4)

##              mean         sd        5.5%      94.5%    n_eff     Rhat4
## aW    0.001202152 0.04037652 -0.06347356 0.06607981 1706.487 1.0029059
## bEW   0.397292299 0.03973951  0.33541289 0.45833100 1786.064 1.0015440
## sigma 0.918847608 0.03116978  0.86996310 0.97097246 1846.831 0.9990899

This is an ordinary confound where the unobserved U ruins our inference. Even if E does increase W, the estimate from this model will still be biased upwards.

Add Q as predictor

m14.5 <- ulam(
    alist(
        W ~ dnorm(mu, sigma), 
        mu <- aW + bEW*E + bQW * Q, 
        aW ~ dnorm(0, 0.2), 
        bEW ~ dnorm(0, 0.5), 
        bQW ~ dnorm(0, 0.5), 
        sigma ~ dexp(1)
    ), data = dat_sim, chains = 4, cores = 4, refresh = 0
)
precis(m14.5)

##               mean         sd        5.5%       94.5%    n_eff     Rhat4
## aW    -0.001420348 0.03796389 -0.06310167  0.05880091 1859.084 0.9994690
## bEW    0.636837583 0.04656647  0.56318393  0.71159853 1394.526 1.0023708
## bQW   -0.404594622 0.04705495 -0.47999277 -0.33128785 1495.233 1.0041473
## sigma  0.856158001 0.02734498  0.81318830  0.90082169 1805.371 0.9990282

As expected from the DAG, bQW picks up an association from U and bEW is even further from the truth now (from 0.4 to 0.64). This is bias amplification.

Correctly use the instrumental variable Q

We need to use the generative model. We will write a simple generative version of the DAG which has 4 sub-models:

Model for how wages W are caused by education E and the unobserved confound U:

\[ \begin{align} W_i &\sim \text{Normal}(\mu_{W,i}, \sigma_w) \\ \mu_{W,i} &= \alpha_W + \beta_{EW} E_i + U_i \end{align} \]

Model for how education levels E are caused by quarter of birth Q (instrument) and the same unobserved confound U:

\[ \begin{align} E_i &\sim \text{Normal}(\mu_{E,i}, \sigma_E) \\ \mu_{E,i} &= \alpha_E + \beta_{QE} Q_i + U_i \end{align} \]

Model for Q which just says one-quarter of all people are born in each quarter of the year:

\[ Q_i \sim \text{Categorical}([0.25, 0.25, 0.25, 0.25]) \]

Final model says that the unobserved confound U is normally distributed with the mean zero and standard deviation one.

\[ U_i \sim \text{Normal}(0, 1) \]

U can have some other distribution as well.

Now we will translate the above generative model into a statistical one. One way is to use brute force by treating the \(U_i\) values as missing data and imputing them (we will see how to do this in next chapter). Another more efficient way is to average over them and estimate instead the covariance between W and E. We will do this by defining W and E as coming from a common multivariate normal distribution:

\[\begin{align} \begin{pmatrix} W_i \\ E_i \end{pmatrix} &\sim \text{MVNormal} \left( \begin{pmatrix} \mu_{W,i} \\ \mu_{E,i} \end{pmatrix} , S \right) \\ \mu_{w,i} &= \alpha_W + \beta_{EW} E_i \\ \mu_{E,i} &= \alpha_E + \beta_{QE} Q_i \end{align}\]

The matrix S above is the error covariance between wages and education. It is not the descriptive covariance between these variables, but the matrix equivalent of \(\sigma\) from a Gaussian regression. The above is a MULTIVARIATE LINEAR MODEL, a regression with multiple simultaneous outcomes, all doodled with a joint error structure. Each variable gets its own linear model, yielding the two \(\mu\) definitions.

E is both an outcome and a predictor inside the mean for \(W\), this is an implication of the DAG. All it says is that E might influence W and that also pairs of W, E values might have some residual correlation through the unobserved confound U.

Here’s the model for the above:

m14.6 <- ulam(
    alist(
        c(W,E) ~ multi_normal( c(muW, muE), Rho, Sigma), 
        muW <- aW + bEW*E, 
        muE <- aE + bQE*Q, 
        c(aW, aE) ~ normal(0, 0.2), 
        c(bEW, bQE) ~ normal(0, 0.5), 
        Rho ~ lkj_corr(2), 
        Sigma ~ exponential(1)
    ), data = dat_sim, chains = 4, cores = 4, refresh = 0
)
precis(m14.6, depth = 3)

##                   mean           sd        5.5%      94.5%     n_eff     Rhat4
## aE        0.0005860671 3.524882e-02 -0.05518997 0.05854603 1252.2155 1.0045501
## aW        0.0001002765 4.470421e-02 -0.07179647 0.06960390 1324.5154 1.0003266
## bQE       0.5880281968 3.570527e-02  0.53087931 0.64550948 1093.1261 1.0013553
## bEW      -0.0517993567 7.830728e-02 -0.17755412 0.06840691  768.0384 1.0081419
## Rho[1,1]  1.0000000000 0.000000e+00  1.00000000 1.00000000       NaN       NaN
## Rho[1,2]  0.5437002594 5.345918e-02  0.45404924 0.62598226  972.9080 1.0062702
## Rho[2,1]  0.5437002594 5.345918e-02  0.45404924 0.62598226  972.9080 1.0062702
## Rho[2,2]  1.0000000000 8.084567e-17  1.00000000 1.00000000 1425.6774 0.9979980
## Sigma[1]  1.0258297083 4.857538e-02  0.95330462 1.10718216 1018.7797 1.0067045
## Sigma[2]  0.8091442654 2.610979e-02  0.76843024 0.85188647 1509.3109 0.9998755

bEW, the estimated influence of education on wages is small and has interval including zero
Rho[1,2], the correlation between the two outcomes, wages and education is reliably positive which reflects the common influence of U. This correlation is conditional on E (for W) and Q (for E). It is not the raw empirical correlation, but rather the residual correlation.

Simulations with other scenarios

Let’s look at a scenario where education has a positive influence but the confound hides it:

set.seed(73)
N <- 500
U_sim <- rnorm(N)
Q_sim <- sample(1:4, size = N, replace = TRUE) 
E_sim <- rnorm(N, U_sim + Q_sim)
W_sim <- rnorm(N, -U_sim + 0.2*E_sim)
dat_sim <- list(
    W = standardize(W_sim), 
    E = standardize(E_sim), 
    Q = standardize(Q_sim) 
)

Re-running the models m14.4 and m14.6 for the above scenario:

m14.4x <- ulam(m14.4, data = dat_sim, chains = 4, cores = 4)
m14.6x <- ulam(m14.6, data = dat_sim, chains = 4, cores = 4)

precis(m14.4x)

##               mean         sd        5.5%       94.5%    n_eff     Rhat4
## aW     0.001798145 0.04323516 -0.06609534  0.07079458 1763.498 1.0012874
## bEW   -0.111945137 0.04527591 -0.18284859 -0.03822366 1891.031 0.9998971
## sigma  0.995766707 0.03195580  0.94597062  1.04793213 1874.175 0.9998731

We get a very small and negative effect of education, which is the opposite of the truth.

precis(m14.6x, 3)

##                  mean           sd        5.5%       94.5%     n_eff     Rhat4
## aE        0.001942843 3.582773e-02 -0.05664315  0.05919863 1748.1816 0.9993583
## aW       -0.002689091 4.679668e-02 -0.07857863  0.07125456 1824.0208 0.9989835
## bQE       0.588697162 3.460468e-02  0.53461303  0.64498450 1449.3879 0.9997106
## bEW       0.284889757 8.061355e-02  0.15901581  0.41491078  956.6505 1.0014599
## Rho[1,1]  1.000000000 0.000000e+00  1.00000000  1.00000000       NaN       NaN
## Rho[1,2] -0.461918075 5.870532e-02 -0.55112341 -0.36776079 1098.4997 1.0004280
## Rho[2,1] -0.461918075 5.870532e-02 -0.55112341 -0.36776079 1098.4997 1.0004280
## Rho[2,2]  1.000000000 8.302803e-17  1.00000000  1.00000000 2110.6071 0.9979980
## Sigma[1]  1.074728061 4.557794e-02  1.00513637  1.15154227 1094.4124 1.0009016
## Sigma[2]  0.809381745 2.646818e-02  0.76858466  0.85355019 1793.4526 1.0005710

bEW is positive and includes 0.2 in its credible interval, the actual effect
the residual correlation is negative since the confound positively influences one and negatively the other

Using Dagitty for instrumental variables

{dagitty} has a function instrumentalVariables that will find instruments if they are present in a DAG.

library(dagitty)
dagIV <- dagitty("dag{
                 Q -> E <- U -> W <- E
}")
instrumentalVariables(dagIV, exposure = "E", outcome = "W")

##  Q

In general, it is not possible to statistically prove whether a variable is a good instrument. As always, we need scientific knowledge outside of the data to make sense of the data.

Other Designs

Instrumental variables are like natural experiments that impersonate a randomised experiment. In the previous example, the external shock from Q to E was like an experimental manipulation that allowed us to estimate the impact of that external shock and thereby derive a causal estimate.

There are other ways to find natural experiments besides instrumental variables. We consider two such ways.

In addition to the backdoor criterion, we have the FRONT-DOOR CRITERION which is relevant in a DAG like this:

We again want to find the causal influence of X on Y but control for the unobserved confound U. We can if we can a perfect mediator Z. Then the causal effect can be estimated by expressing the generative model as a statistical model, similar to the instrumental variable example before.

An example of this is the influence of social ties formed in college on voting behaviour in the US Senate blog link, where the question is whether senators who went to the same college vote more similarly because of their social ties. The pure association between attending the same college and voting is possibly confounded by lots of things. The front-door trick is to find some mechanism through which social ties must act. In the case of the United States Senate, a mechanism could be who sits next to who. It is easier to talk to and coordinate with people sitting nearby. And since junior members are often assigned seats effectively at random, seating is unlikely to share the same confounds as college attendance. Now consider some senators who attended UCLA. Some of them end up seated near one another. Others end up seated next to rival UC Berkeley alums. If the ones seated near one another vote more similarly to one another than to the UCLA alums seated elsewhere, that could be causal evidence that social ties influence voting, as mediated by proximity on the Senate floor.

A more common design is REGRESSION DISCONTINUITY(or RDD). Suppose that we want to estimate the effect of winning an academic award on future success. This is confounded by unobserved factors, like ability, that influence both the award and later success. But if we compare individuals who were just below the cutoff for the award to those who were just above the cutoff, these individuals should be similar in the unobserved factors. It’s as if the award were applied at random, for individuals close to the cutoff. This is the idea behind regression discontinuity. In practice, one trend is fit for individuals above the cutoff and another to those below the cutoff. Then an estimate of the causal effect is the average difference between individuals just above and just below the cutoff. While the difference near the cutoff is of interest, the entire function influences this difference. So some care is needed in choosing functions for the overall relationship between the exposure and the outcome.

Social relations as correlated varying effects

This section presents an example were we construct a custom covariance matrix with special scientific meaning.

We will work with KosterLeckie data, which contains two tables with data on household gift exchanges from Koster and Leckie: kl_dyads & kl_households.

library(rethinking)
data("KosterLeckie")

We will be using kl_dyads in which each row is a dyad (group of two) of households from a community in Nicaragua. We will be modelling gift exchanges between these households. The outcome variables: giftsAB and giftsBA give the counts of gifts in either direction within each dyad. The variables hidA and hidB tell us the household IDs in each dyad and did is a unique dyad id.

Here’s the raw distribution of gifts across dyads.

with(kl_dyads, plot(giftsAB, giftsBA, pch = 16, col = "steelblue", 
     xlim = c(0, 100), ylim = c(0, 100), 
     xlab = "gifts from household A to B", ylab = "gifts from B to A"))
abline(a = 0, b = 1, lty = 2)

The correlation here is:

cor(kl_dyads$giftsAB, kl_dyads$giftsBA)

## [1] 0.2391549

We cannot take this correlation number as a measure of balance in the exchange of gifts. Moreover, there will be households that tend to exchange gifts more regularly than others. Also, households that are poor will give less gifts and receive more, which will be a factor independent of AB grouping. To statistically separate the balanced exchange from generalised differences in giving and receiving, we need a model that will treat these as separate. The model considered here is often called a SOCIAL RELATIONS MODEL (SRM).

We will model gifts from household A to household b as a combination of varying effects specific to the household and the dyad. The outcome variables are Poisson variables. Thus, the first part of the model is:

\[ \begin{align} y_{A \to B} &\sim \text{Poisson}(\lambda_{AB}) \\ \log \lambda_{AB} &= \alpha + g_A + rB + dAB \end{align} \]

\(\alpha\) - the average gifting rate (on the log scale) across all dyads
\(g_A\) - varying effect parameter for the generalised giving tendency of household A, regardless of dyad
\(r_B\) - generalised receiving of household B, regardless of dyad
\(d_{AB}\) - dyad-specific rate that A gives to B.

We also have a similar model from B to A:

\[ \begin{align} y_{B \to A} &\sim \text{Poisson}(\lambda_{BA}) \\ \log \lambda_{BA} &= \alpha + g_B + r_A + d_{BA} \end{align} \]

Thus each household H needs varying effects, a \(g_H\) and a \(r_H\). Plus, each dyad AB has two varying effects, \(d_{AB}\) and \(d_{BA}\). We want to allow g and r params to be correlated (maybe people who give a lot also receive a lot). We also want to allow the dyad effects to be correlated (is there balance within dyads?). We can achieve these objectives with two different multi-normal priors.

The first one represents the population of household effects:

\[ \begin{align} \begin{pmatrix} g_i \\ r_i \end{pmatrix} &\sim \text{MVNormal}( \begin{pmatrix} 0 \\ 0 \end{pmatrix}), \begin{pmatrix} \sigma_g^2 & \sigma_g \sigma_r \rho_{gr} \\ \sigma_g \sigma_r \rho{gr} & \sigma_r^2 \end{pmatrix} \end{align} \]

For any household i, a pair of g and r parameters are assigned a prior with a typical covariance matrix with 2 sd and a correlation parameter.

The second multi-normal prior will represent the population of dyad effects:

\[ \begin{align} \begin{pmatrix} d_{ij} \\ d_{ji} \end{pmatrix} &\sim \text{MVNormal}( \begin{pmatrix} 0 \\ 0 \end{pmatrix}), \begin{pmatrix} \sigma_d^2 & \sigma_d^2 \rho_d \\ \sigma_d^2 \rho_d & \sigma_d^2 \end{pmatrix} \end{align} \]

For a dyad whit households i and j, there is a pair of dyad effects with a prior with another covariance matrix. Since the labels i,j are arbitrary, so we effectively have the same variance \(\sigma_d\) for both households along with a correlation parameters. When \(\rho\) is positive, both households give more and vice versa. If \(\rho\) is near zero, then there is no pattern within dyads.

While building the model, attention needs to be given to constructing the dyad covariance in a custom way.

kl_data <- list(
    N = nrow(kl_dyads), 
    N_households = max(kl_dyads$hidB), 
    did = kl_dyads$did, 
    hidA = kl_dyads$hidA, 
    hidB = kl_dyads$hidB, 
    giftsAB = kl_dyads$giftsAB, 
    giftsBA = kl_dyads$giftsBA
)

m14.7 <- ulam(
    alist(
        giftsAB ~ poisson(lambdaAB), 
        giftsBA ~ poisson(lambdaBA), 
        log(lambdaAB) <- a + gr[hidA, 1] + gr[hidB, 2] + d[did, 1], 
        log(lambdaBA) <- a + gr[hidB, 1] + gr[hidA, 2] + d[did, 2], 
        a ~ normal(0, 1), 
        
        # gr matrix of varying effects
        vector[2]: gr[N_households] ~ multi_normal(0, Rho_gr, sigma_gr), 
        Rho_gr ~ lkj_corr(4), 
        sigma_gr ~ exponential(1), 
        
        # dyad effects
        transpars> matrix[N, 2]: d <- compose_noncentered(rep_vector(sigma_d, 2), L_Rho_d, z), 
        matrix[2,N]: z ~ normal(0, 1), 
        cholesky_factor_corr[2]: L_Rho_d ~ lkj_corr_cholesky(8), 
        sigma_d ~ exponential(1), 
        
        # compute correlation matrix for dyads
        gq> matrix[2,2]: Rho_d <<- Chol_to_Corr(L_Rho_d)
    ), data = kl_data, chains = 4, cores = 4, iter = 2000
)

In the code above, the first part just describes the linear model. The g and r combined in a single matrix for all households, so we have gr[A, 1] and gr[B, 2] to get the respective effects. The dyad effects d[did, 1] & d[did, 2] are for household A and B respectively (arbitrarily ordered in such a way).

The second para of the code defines the matrix of giving and receiving effects as a multi normal dist. gr will be a matrix with N rows and 2 columns, with the first giving the giving varying effect and the second the receiving varying effect.

The third para defines the special dyad matrix. This had been converted to non-centred form for efficiency. The rep_vector(sigma_d, 2) copies the sd into a vector of length 2 and composes the covariance matrix so that we end up with the same variance for both effects.

Finally the last line computes the correlation matrix for the dyads. This is necessary since we are using Cholesky factor the the parametrisation. Chol_to_Corr multiplies a matrix by its own transpose to make the Cholesky factor back into its original matrix. The gq> at the start places the line in Stan’s generated quantities block, which holds code that is executed after each Hamiltonian transition.

The model has 1264 params, with around 600 dyad params. Let’s look at the covariance matrix to some useful info:

precis(m14.7, depth = 3, pars = c("Rho_gr", "sigma_gr"))

##                   mean           sd       5.5%       94.5%     n_eff     Rhat4
## Rho_gr[1,1]  1.0000000 0.000000e+00  1.0000000  1.00000000       NaN       NaN
## Rho_gr[1,2] -0.4007574 1.908276e-01 -0.6826757 -0.06991661 1602.1969 1.0018430
## Rho_gr[2,1] -0.4007574 1.908276e-01 -0.6826757 -0.06991661 1602.1969 1.0018430
## Rho_gr[2,2]  1.0000000 8.490820e-17  1.0000000  1.00000000 4187.5964 0.9989995
## sigma_gr[1]  0.8338731 1.378849e-01  0.6440853  1.08009541 2201.4813 1.0022546
## sigma_gr[2]  0.4211645 9.178526e-02  0.2912788  0.57978293  964.1492 1.0019713

The correlation between giving and receiving is negative (89% PI between -0.77 and -0.1). Thus those who give more receive less. The sd params sigma_gr[1] and sigma_gr[2] show evidence that rates of giving are more variable than rates of receiving.

Let’s plot these giving and receiving effects to visualise the covariance structure in the parameters. For each household, we want to compute the posterior predictive giving and receiving rates, across all dyads. We can do this by using the linear model and adding a to each giving or receiving parameter.

post <- extract.samples(m14.7)
g <- sapply(1:25, function(i) post$a + post$gr[, i, 1]) # there were 25 households
r <- sapply(1:25, function(i) post$a + post$gr[, i, 2])
Eg_mu <- apply(exp(g), 2, mean)
Er_mu <- apply(exp(r), 2, mean)

str(g)

##  num [1:4000, 1:25] 0.384 0.141 0.313 -0.355 -0.234 ...

4000 samples x 25 households

g is the matrix for giving rates, r is the matrix for receiving rates. Eg_mu and Er_mu hold the means on the outcome scale for each household.

There is uncertainty in each of the above parameter. We could look at the same for a single household in the following manner:

plot(exp(g[,1]), exp(r[,1]), pch = 20, col = col.alpha("steelblue", 0.4), 
     xlab = "Giving rate for household 1", ylab = "receiving rate for household 1", 
     main = "Posterior distribution of giving/receiving rates for Household 1", 
     xlim = c(0, 6), ylim = c(0, 6))
abline(a = 0, b = 1, lty = 2)

Lots of variation in both the giving and receiving rate for Household 1.

A more cleaner visualisation can be made with contours. The bivariate distribution of each g and r is approximately Gaussian (on the latent scale of the linear model). So its shape can be described by an ellipse. This ellipse can then be projected on the outcome scale.

plot(NULL, xlim = c(0, 8.6), ylim = c(0, 8.6), 
     xlab = "generalised giving", ylab = "generalised receiving", lwd = 1.5)
abline(a = 0, b = 1, lty = 2)

# ellipses
library(ellipse)

for(i in 1:25){
    Sigma <- cov(cbind( g[, i], r[, i]))
    Mu <- c(mean(g[,i]), mean(r[,i]))
    
    for(l in c(0.5)){
        el <- ellipse(Sigma, centre = Mu, level = l)
        lines(exp(el), col = col.alpha("black", 0.5))
    }
}

# household means
points(Eg_mu, Er_mu, pch = 21, bg = "white", lwd = 1.5)

From the above graph we can see:

the negative correlation between giving and receiving rates (high g -> low r and vice versa). The diagonal helps see this
there is a greater variation in the giving rates (more horizontally spread out)
absent any dyad effect, each point is a household and the ellipses show the 50% compatibility region

Next we will look at the dyad effect.

precis(m14.7, 3, pars = c("Rho_d", "sigma_d"))

##                 mean         sd      5.5%     94.5%    n_eff    Rhat4
## Rho_d[1,1] 1.0000000 0.00000000 1.0000000 1.0000000      NaN      NaN
## Rho_d[1,2] 0.8793472 0.03395177 0.8206649 0.9276365 1126.535 1.001690
## Rho_d[2,1] 0.8793472 0.03395177 0.8206649 0.9276365 1126.535 1.001690
## Rho_d[2,2] 1.0000000 0.00000000 1.0000000 1.0000000      NaN      NaN
## sigma_d    1.1023767 0.05635003 1.0169481 1.1964059  885.732 1.006345

The correlation is positive and strong and the variation between dyads is greater than the variation in rates of giving/receiving among households. This implies that households are balanced and if one household gives less than average, the other house gives less too, after accounting for the generalised giving and receiving (i.e. the “residual” gift giving is highly correlated). The raw dyad effect can be plotted to see the strength of this pattern:

dy1 <- apply(post$d[,,1], 2, mean)
dy2 <- apply(post$d[,,2], 2, mean)
plot(dy1, dy2, pch = 20, 
     xlab = "household A in dyad", ylab = "household B in dyad")
abline(h = 0, lty = 2, v = 0)

These are only the posterior means, there is again uncertainty within each dyad as well, but overall the balance is quite high. This could reflect reciprocity adjusted for overall wealth levels, or it could reflect types of relationships among households, like kin obligations (not included in the model here).

Rethinking: Where everybody knows your name

The gift example is a SOCIAL NETWORK model. In that light, an important feature missing from this model is the TRANSITIVITY of social relationships. If household A is friends with household B, and household C is friends with household B, then households A and C are more likely to be friends. This isn’t magic. It just arises from unobserved factors that create correlated relationships. For example, people who go to the same pub tend to know one another. The pub is an unmeasured confound for inferring causes of social relations. Models that can estimate and expect transitivity can be better. This can be done using something called a STOCHASTIC BLOCK MODEL. To fit such a model, however, we’ll need some techniques in the next chapter.

Continuous categories and the Gaussian process

So far we have used varying effects (intercepts and slopes) for nominal variables. However, in certain cases, we might have continuous variables like age/income/stature, where people of same age expected to share same exposure, but even people of similar age will have some commonality, although to a lesser degree, and this covariation will fall off as the difference in age increases. In this case we will want to have not unique intercepts for every age, but similar intercepts for similar ages.

The general approach of applying varying effects to continuous categories is known as GAUSSIAN PROCESS REGRESSION. Neal 1998 - a highly cited overview with notes on implementation (read later). The general process involves defining some dimension along which case differ (age/location/etc.), measure the distance between each pair of cases and then the model estimates a function for the covariance between pairs of case at different distances.

Example: Spatial autocorrelation in Oceanic tools

In chapter 11 we considered the islands of Oceania and studied the development of complex tools as a factor of island population and degree of contact between them.

library(rethinking)
data(Kline)
Kline

##       culture population contact total_tools mean_TU
## 1    Malekula       1100     low          13     3.2
## 2     Tikopia       1500     low          22     4.7
## 3  Santa Cruz       3600     low          24     4.0
## 4         Yap       4791    high          43     5.0
## 5    Lau Fiji       7400    high          33     5.0
## 6   Trobriand       8000    high          19     4.0
## 7       Chuuk       9200    high          40     3.8
## 8       Manus      13000     low          28     6.6
## 9       Tonga      17500    high          55     5.4
## 10     Hawaii     275000     low          71     6.6

The contact variable was categorical with 3 categories - low, medium and high. However, this variable takes no note of which society was in contact with which society. High contacts among low populations islands may not have as much effect as contact with high population societies.

Secondly, since tools were exchanged among societies, then the total no of tools for each are not really independent of each other, (even after conditioning on all predictors) and we would expect close geographic neighbours to have more similar tool counts because of exchange.

Finally, closer islands may share unmeasurable geographic features that may lead to similar technological industries. Thus, space could matter in multiple ways.

We will use Gaussian Process Regression here. A distance matrix among the societies will be defined and then similarity in tool counts will be studied with the geographic distance.

Loading the data:

# load distance matrix
data("islandsDistMatrix") # matrix of pairwise distances b/w islands in 1000 kms

# display (measured in thousands of km)
Dmat <- islandsDistMatrix
colnames(Dmat) <- c("Ml", "Ti", "SC", "Ya", "Fi", "Tr", "Ch", "Mn", "To", "Ha")
round(Dmat, 1)

##             Ml  Ti  SC  Ya  Fi  Tr  Ch  Mn  To  Ha
## Malekula   0.0 0.5 0.6 4.4 1.2 2.0 3.2 2.8 1.9 5.7
## Tikopia    0.5 0.0 0.3 4.2 1.2 2.0 2.9 2.7 2.0 5.3
## Santa Cruz 0.6 0.3 0.0 3.9 1.6 1.7 2.6 2.4 2.3 5.4
## Yap        4.4 4.2 3.9 0.0 5.4 2.5 1.6 1.6 6.1 7.2
## Lau Fiji   1.2 1.2 1.6 5.4 0.0 3.2 4.0 3.9 0.8 4.9
## Trobriand  2.0 2.0 1.7 2.5 3.2 0.0 1.8 0.8 3.9 6.7
## Chuuk      3.2 2.9 2.6 1.6 4.0 1.8 0.0 1.2 4.8 5.8
## Manus      2.8 2.7 2.4 1.6 3.9 0.8 1.2 0.0 4.6 6.7
## Tonga      1.9 2.0 2.3 6.1 0.8 3.9 4.8 4.6 0.0 5.0
## Hawaii     5.7 5.3 5.4 7.2 4.9 6.7 5.8 6.7 5.0 0.0

This distance matrix will be used to estimate varying intercepts for each society that account for non-independence in tools as a function of their geographical similarity. Thus the varying intercept will be based on a continuous measure and these intercepts will be correlated among the neighbours.

We will use the “Scientific” tool model:

\[ \begin{align} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \alpha P_i^\beta / \lambda \end{align} \] We would like to adjust these \(\lambda\) values by a varying intercept param. We can’t just add it, since then \(\lambda\) could become negative. So we will make it multiplicative.

\[ \begin{align} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \exp (k_{ \text{SOCIETY}}) \alpha P_i^\beta / \lambda \end{align} \]

where \(\exp (k_{ \text{SOCIETY}})\) is the varying intercept which will be estimated using the geographic distance instead of based on some category.

The main part of the Gaussian process is the multivariate prior for these intercepts:

\[\begin{align} \begin{pmatrix} k_1 \\ k_2 \\ k_3 \\ \cdots \\ k_{10} \end{pmatrix} &\sim \text{MVNormal} \left( \begin{pmatrix} 0 \\ 0 \\ 0 \\ \cdots \\ 0 \end{pmatrix} , K \right) & \text{[prior for intercepts]} \\[2ex] K_{ij} &= \eta^2 \exp(-\rho^2 D_{ij}^2) + \delta_{ij} \sigma^2 & \text{[define covariance matrix]} \end{align}\]

The first line is the 10-Dimensional Gaussian prior for intercepts (there are 10 societies in the dist matrix). The vector of means is all zeros so that in the linear model, the average society will scale \(\lambda\) by exp(0) = 1. Negative values will scale it down and positive scale it up.

The covariance matrix is named K, and the covariance between any pair of societies i and j is \(K_{ij}\). This is defined in the second line and the formula uses three parameters - \(\eta, \rho, \sigma\) - to model how the covariance between societies changes.

\(\exp (\rho^2 D_{ij})\) gives K its shape. \(D_{ij}\) is the distance between the \(i\)-th and \(j\)-th societies. Thus the covariance between any two societies i and j declines exponentially with the squared distance between them. \(\rho\) determines the rate of decline, if it is large, then it declines rapidly. Distance doesn’t have to be squared, but it is the most common assumption that allows the often-realistic property of allowing covariance to decline more quickly as distance grows.

Let’s visualise how this function would look like for linear-decline alternative. (We will use \(\rho=1\) in this case).

# linear
curve(exp(-1*x), from = 0, to = 4, lty = 2, xlab = "distance", ylab = "correlation")
# squared
curve(exp(-1*x^2), from = 0, add = TRUE)

legend("topright", legend = c("linear", "squared"), lty = c(2, 1))
mtext("Shape of the function relating distance to the covaiance K", adj = 0, font = 2)

The vertical axis is here only part of the total covariance function and we can think of it as the proportion of the max correlation between any pair of societies. The dashed curve produces an exact exponential shape, while the solid line (squared distance) produces a half-Gaussian decline that is initially slower than the exponential but then rapidly accelerates and becomes faster than exponential.

\(\eta^2\) is the max covariance between any two societies. \(\delta_{ij}\sigma^2\) provides for extra covariance beyond \(\eta^2\) when \(i=j\). It does this because the function \(\delta_{ij}\) is equal to 1 when \(i=j\) and zero otherwise. In this example this term will not matter, since we only have one observation per society, but if we had more than one observation per society, then \(\sigma\) would describe how these observations covary.

We will define priors for \(\rho, \eta, \sigma\). We will define them for the square of each and estimate them on the same scale to make it computationally easier. Since we don’t need \(\sigma\) for our model, so we will fix it to an irrelevant constant.

\[ \begin{align} \eta^2 &\sim \text{Exponential}(2) \\ \rho^2 &\sim \text{Exponential}(0.5) \end{align} \]

\(\rho^2\) and \(\eta^2\) must be positive, so we have place exponential priors on them.

In the ulam code, we can signal that we want the squared distance Gaussian process prior via GPL2.

data(Kline2)    # load the ordinary data with coordinates
d <- Kline2
d$society <- 1:10    # index observations

dat_list <- list(
    T = d$total_tools, 
    P = d$population, 
    society = d$society, 
    Dmat = islandsDistMatrix
)


m14.8 <- ulam(
    alist(
        T ~ dpois(lambda), 
        lambda <- (a * P^b / g) * exp(k[society]), 
        
        vector[10]:k ~ multi_normal(0, SIGMA), 
        
        matrix[10, 10]:SIGMA <- cov_GPL2(Dmat, etasq, rhosq, 0.01), 
        c(a, b, g) ~ dexp(1), 
        etasq ~ dexp(2), 
        rhosq ~ dexp(0.5)
    ), data = dat_list, chains = 4, cores = 4, iter = 2000
)

Checking the posterior:

precis(m14.8, depth = 2)

##              mean         sd        5.5%      94.5%     n_eff    Rhat4
## k[1]  -0.15980191 0.31469574 -0.67035102 0.32237263  797.6852 1.003305
## k[2]  -0.01072118 0.30338388 -0.49010572 0.47149539  721.8542 1.004069
## k[3]  -0.05800329 0.28613102 -0.50817851 0.38280989  737.6456 1.004128
## k[4]   0.36329975 0.26250892 -0.02026005 0.77714603  812.5776 1.001372
## k[5]   0.08829879 0.25911228 -0.29138759 0.48411944  820.2150 1.000576
## k[6]  -0.37269278 0.27500408 -0.81739902 0.02589047  961.8621 1.002277
## k[7]   0.14844959 0.25438874 -0.22497992 0.55674986  891.3241 1.001565
## k[8]  -0.20146935 0.26569493 -0.60056103 0.19937082  826.4089 1.001872
## k[9]   0.27483953 0.24475724 -0.08541670 0.66392123  872.2737 1.001038
## k[10] -0.15665215 0.32961040 -0.69718513 0.34502215 1264.8250 1.000045
## g      0.60732744 0.57221869  0.07406282 1.66837192 1943.6689 1.000279
## b      0.27721752 0.08421188  0.14517495 0.41705914 1265.7719 1.002080
## a      1.39219671 1.03710426  0.24284199 3.28157276 2422.9587 1.000494
## etasq  0.19367642 0.20739463  0.03176059 0.54443732 1203.1241 1.001731
## rhosq  1.34821846 1.59991043  0.08367097 4.35576233 2035.3172 1.000979

The coefficient for log population b is similar to how it was before the Gaussian Process. (we are not using log(lambda) in the linear model, if we do then we end up using log of population. We have just removed the log from both sides of the equation). Thus we cannot explain all of the association between tool counts and population as a side effect of geographic contact.

The k parameters are the varying intercepts for each society. Like a and b, they are on the log-count scale, so interpreting them raw is hard.

In order to understand the parameters that describe the covariance with distance, rhosq and etasq, we will want to plot the function they imply. The joint posterior distribution of these two params, defines a posterior distribution of covariance functions. We can plot a bunch of them to get a sense of it.

Note: The uncertainty of the function is not the same as the uncertainty of the mean function

post <- extract.samples(m14.8)
prior <- extract.prior(m14.8, n = 1e3, refresh = 0)


x_seq <- seq(from = 0, to = 10, length.out = 100)

# compute the posterior mean covariance
pmcov <- sapply(x_seq, function(x) post$etasq * exp(-post$rhosq * x^2))
pmcov_mu <- colMeans(pmcov)

# compute the prior mean covariance
prcov <- sapply(x_seq, function(x) prior$etasq * exp(-prior$rhosq * x^2))
prcov_mu <- colMeans(prcov)

withr::with_par(
    list(mfrow = c(1, 2)), 
    code = {
        
        # Prior
        
        # plot the prior median covariance function
        plot(NULL, xlab = "distance (thousand km)", ylab = "covariance", 
             xlim = c(0, 10), ylim = c(0, 2), main = "Gaussian process prior")
        
        # plot 50 functions sampled from prior
        for(i in 1:50)
            curve(prior$etasq[i] * exp(-prior$rhosq[i] * x^2), add = TRUE, col = col.alpha("black", 0.3))
        
        
        # Posterior
        
        # plot the posterior median covariance function
        plot(NULL, xlab = "distance (thousand km)", ylab = "covariance", 
             xlim = c(0, 10), ylim = c(0, 2), main = "Gaussian process posterior")
        lines(x_seq, pmcov_mu, lwd = 2)
        
        # plot 50 functions sampled from posterior
        for(i in 1:50)
            curve(post$etasq[i] * exp(-post$rhosq[i] * x^2), add = TRUE, col = col.alpha("black", 0.3))
    }
)

Each combination of values for \(\sigma^2\) and \(\eta^2\) produces a relationship between covariance and distance. The posterior median relationship is shown by the thicker curve and represents a centre of plausibility. But from the curves we can see that there is uncertainty about the spatial covariance. The posterior mean is at around 0.2, but there are multiple curves double that as well as below it. However majority of the posterior curves decline to zero covariance before 4000 km.

Since the covariances are on the log-count scale, it is hard to interpret them directly. So let’s look at the correlations among societies that are implied by the posterior median. First we push the parameters back through the function for K, the covariance matrix.

# compute posterior median covariance among societies
K <- matrix(0, nrow = 10, ncol = 10)
for(i in 1:10)
    for(j in 1:10)
        K[i,j] <- median(post$etasq) * exp(-median(post$rhosq) * islandsDistMatrix[i,j]^2)
diag(K) <- median(post$etasq) + 0.01

Second, we convert K to a correlation matrix:

# convert to correlation matrix
Rho <- round(cov2cor(K), 2)

# add row/col names for convenience
colnames(Rho) <- c("Ml", "Ti", "SC", "Ya", "Fi", "Tr", "Ch", "Mn", "To", "Ha")
rownames(Rho) <- colnames(Rho)
Rho

##      Ml   Ti   SC   Ya   Fi   Tr   Ch   Mn   To Ha
## Ml 1.00 0.78 0.69 0.00 0.29 0.04 0.00 0.00 0.07  0
## Ti 0.78 1.00 0.86 0.00 0.29 0.04 0.00 0.00 0.05  0
## SC 0.69 0.86 1.00 0.00 0.15 0.10 0.01 0.01 0.02  0
## Ya 0.00 0.00 0.00 1.00 0.00 0.01 0.15 0.13 0.00  0
## Fi 0.29 0.29 0.15 0.00 1.00 0.00 0.00 0.00 0.59  0
## Tr 0.04 0.04 0.10 0.01 0.00 1.00 0.08 0.53 0.00  0
## Ch 0.00 0.00 0.01 0.15 0.00 0.08 1.00 0.30 0.00  0
## Mn 0.00 0.00 0.01 0.13 0.00 0.53 0.30 1.00 0.00  0
## To 0.07 0.05 0.02 0.00 0.59 0.00 0.00 0.00 1.00  0
## Ha 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  1

The cluster of societies in the top left are highly correlated - Malekula (MI), Tikopia(Ti) and Santa Cruz (SC), all above 0.8. These societies are very close together and also have similar tool totals, which we will visualise shortly. These correlations were estimated with log of population in the model, and thus suggest some additional resemblance even accounting for the average association between population and tools. On the other end of the spectrum is Hawaii, which is very far away from all the other societies and thus the correlation has decayed to zero.

We will plot these correlations on a crude map of the Pacific Ocean. The Kline2 data frame provides latitude and longitude for each society that can be used for this purpose. We will also scale the size of each society in proportion to its log population.

# scale point size to logpop
psize <- d$logpop / max(d$logpop)
psize <- exp(psize * 1.5) - 2

# plot raw data and labels
plot(d$lon2, d$lat, xlab = "longitude", ylab = "latitude", 
     col = rangi2, cex = psize, pch = 16, xlim = c(-50, 30))
labels <- as.character(d$culture)
text(d$lon2, d$lat, labels = labels, cex = 0.7, pos = c(2, 4, 3, 3, 4, 1, 3, 2, 4, 2))

# overlay lines shaded by Rho
for(i in 1:10)
    for(j in 1:10)
        if(i < j)  # one line per unique i,j pair
            lines(c(d$lon2[i], d$lon2[j]), c(d$lat[i], d$lat[j]), 
                  lwd = 2, col = col.alpha("black", Rho[i,j]^2))

Darker lines indicate stronger correlations, with pure white being zero correlation and pure black 100% correlation. The 3 close societies: Santa Cruz, Malekula and Tikopa stand out. To make even more sense out of the figure, we would need to compare against the simultaneous relationship between tools and log population. The below plot combines the average posterior predictive relationship between log population and total tools with the shaded correlation lines for each pair of societies:

# compute posterior median relationship, ignoring distance

logpop.seq <- seq(from = 6, to = 14, length.out = 30)

# the computation for this is not correct in the book, it can either of the two following methods:

# lambda <- sapply(logpop.seq, function(lp) exp(log(post$a) + (post$b)*lp - log(post$g)))
lambda <- sapply(logpop.seq, function(lp) (post$a * exp(lp)^post$b / post$g))

lambda.median <- apply(lambda, 2, median)
lambda.PI80 <- apply(lambda, 2, PI, prob = 0.8)

# plot raw data and labels
plot(d$logpop, d$total_tools, col = rangi2, cex = psize, pch = 16,
     xlab = "log population", ylab = "total tools")
text(d$logpop, d$total_tools, labels = labels, cex = 0.7, 
     pos = c(4, 3, 4, 2, 2, 1, 4, 4, 4, 2))

# display posterior predictions
lines(logpop.seq, lambda.median, lty = 2)
lines(logpop.seq, lambda.PI80[1,], lty = 2)
lines(logpop.seq, lambda.PI80[2,], lty = 2)


# overlay correlations
for(i in 1:10)
    for(j in 1:10)
        if(i < j)
            lines( c(d$logpop[i], d$logpop[j]), c(d$total_tools[i], d$total_tools[j]) , 
                   lwd = 2, col = col.alpha('black', Rho[i,j]^2))

Malekula, Tikopia, and Santa Cruz all have less than the expected number of tools for their population which is why the spatial covariance is high between them
Manus and Trobriand have a substantial posterior correlation and fewer tools than expected for their population sizes
Tonga has more tools than expected for its population and its proximity to Fiji pulls up Fiji’s expected tools, even though it is geographically close to Malekula, Tikopia and Santa Cruz and they must exert their spatial influence on it. Thus the model believes that Fiji would have less than expected tools for its population as well if it were not for the influence of Fiji

The correlations that this model describes by geographic distance may be due to some other unmeasured commonalities as well.

For example, Manus and the Trobriands are geologically and ecologically quite different from Fiji and Tonga. So it could be availability of, for example, tool stone that explains some of the correlations. The Gaussian process regression is a grand and powerful descriptive model. As a result, its output is always compatible with many different causal explanations.

Rethinking: Dispersion by other names

The model in this section uses a Poisson likelihood, which is often sensitive to outliers, like the Hawaii data. You could use a gamma-Poisson likelihood instead, as explained in Chapter 12. But note that the varying effects in this example already induce additional dispersion around the Poisson mean. Adding Gaussian noise to each Poisson observation is another traditional way to handle over-dispersion in Poisson models. But do try the model with gamma-Poisson as well, so you can compare.

Let’s create a similar model but instead of Poisson distribution for the outcome, we will use gamma-Poisson (so that it models a distribution of lambdas).

m14.8_gm <- ulam(
    alist(
        T ~ dgampois(lambda, phi), 
        lambda <- (a * P^b / g) * exp(k[society]), 
        
        vector[10]:k ~ multi_normal(0, SIGMA), 
        
        matrix[10, 10]:SIGMA <- cov_GPL2(Dmat, etasq, rhosq, 0.01), 
        c(a, b, g) ~ dexp(1), 
        etasq ~ dexp(2), 
        rhosq ~ dexp(0.5), 
        
        phi ~ dexp(1)
    ), data = dat_list, chains = 4, cores = 4, iter = 2000
)

precis(m14.8_gm)

##            mean        sd        5.5%     94.5%     n_eff     Rhat4
## g     0.7320523 0.7336745 0.069346861 2.1477996 2767.3143 1.0022359
## b     0.3176051 0.1247511 0.123251945 0.5204985 1770.9328 1.0018052
## a     1.2305711 1.0342075 0.167327625 3.1361356 2824.3360 0.9998828
## etasq 0.1704369 0.2144804 0.006663188 0.5469543  808.7455 1.0035203
## rhosq 1.7298606 1.8803577 0.061731497 5.4240206 2280.8263 1.0014933
## phi   3.5137902 1.6117319 1.460898173 6.3171336 2678.3096 1.0002951

The variance of a gamma-Poisson is equal to \(\lambda + \lambda^2/ \phi\). Thus large values of phi can make it similar to a Poisson process. The posterior phi lies in the range of 1.57-6.37 (large enough).

let us make the previous plot with population and tools + correlation for this new model.

island_plot <- function(post, title){
    # compute posterior median relationship, ignoring distance
    logpop.seq <- seq(from = 6, to = 14, length.out = 30)
    
    # the computation for this is not correct in the book, it can either of the two following methods:
    
    # lambda <- sapply(logpop.seq, function(lp) exp(log(post$a) + (post$b)*lp - log(post$g)))
    lambda <- sapply(logpop.seq, function(lp) (post$a * exp(lp)^post$b / post$g))
    
    lambda.median <- apply(lambda, 2, median)
    lambda.PI80 <- apply(lambda, 2, PI, prob = 0.8)
    
    # plot raw data and labels
    plot(d$logpop, d$total_tools, col = rangi2, cex = psize, pch = 16,
         xlab = "log population", ylab = "total tools", main = title)
    text(d$logpop, d$total_tools, labels = labels, cex = 0.7, 
         pos = c(4, 3, 4, 2, 2, 1, 4, 4, 4, 2))
    
    # display posterior predictions
    lines(logpop.seq, lambda.median, lty = 2)
    lines(logpop.seq, lambda.PI80[1,], lty = 2)
    lines(logpop.seq, lambda.PI80[2,], lty = 2)
    
    
    # overlay correlations
    for(i in 1:10)
        for(j in 1:10)
            if(i < j)
                lines( c(d$logpop[i], d$logpop[j]), c(d$total_tools[i], d$total_tools[j]) , 
                       lwd = 2, col = col.alpha('black', Rho[i,j]^2))
    
}

withr::with_par(
    list(mfrow = c(2, 1)), 
    code = {
       
        # plot Poisson Process
        island_plot(extract.samples(m14.8), "Model with Poisson process")
        island_plot(extract.samples(m14.8_gm), "Model with gamma-Poisson process")
         
    }
)

There are only minor differences in the predictions. The gamma-Poisson model by design expects more variation, so the bands are wider and it is less influenced by Hawaii.

Overthinking: Non-centered islands

To build a non-centred version of m14.8, we can use the same general trick of converting the covariance matrix to a Cholesky factor and then multiplying that factor by the z-scores of each varying intercept.

m14.8_nc <- ulam(
    alist(
        T ~ dpois(lambda), 
        lambda <- (a * P^b / g) * exp(k[society]), 
        
        # non-centred Gaussian Process prior
        transpars> vector[10]: k <<- L_SIGMA * z, 
        vector[10]:z ~ normal(0, 1), 
        
        transpars> matrix[10, 10]: L_SIGMA <<- cholesky_decompose(SIGMA), 
        transpars> matrix[10, 10]: SIGMA   <<- cov_GPL2(Dmat, etasq, rhosq, 0.01), 
        
        
        c(a, b, g) ~ dexp(1), 
        etasq ~ dexp(2), 
        rhosq ~ dexp(0.5)
    ), data = dat_list, chains = 4, cores = 4, iter = 2000
)

The only new element is the stan function cholesky_decompose which takes covariance (or correlation matrix) and returns its Cholesky factor, which is then mixed with z-scores as before to produce varying effects on the right scale.

precis(m14.8_nc, 2)

##              mean         sd        5.5%       94.5%    n_eff     Rhat4
## z[1]  -0.44857780 0.75509781 -1.69018870  0.73445394 2256.053 0.9997379
## z[2]   0.40697274 0.77528926 -0.82402776  1.64434409 2876.586 1.0004887
## z[3]  -0.23203707 0.78474366 -1.46311106  1.03875185 2982.079 0.9997127
## z[4]   0.95914206 0.63341441 -0.05344655  2.00559899 2082.502 1.0012509
## z[5]   0.33806350 0.67473647 -0.70258648  1.44704622 2155.464 1.0006269
## z[6]  -1.10894661 0.66127657 -2.18098619 -0.11157950 2458.436 0.9992716
## z[7]   0.33823199 0.59195781 -0.57881384  1.28698992 2599.182 0.9995257
## z[8]  -0.38339851 0.68833183 -1.49722571  0.68918192 3382.758 1.0005782
## z[9]   0.76977283 0.66824315 -0.29096948  1.84632387 2546.455 0.9999848
## z[10] -0.48336787 0.82189832 -1.80475985  0.82182301 1713.897 0.9992089
## g      0.60964146 0.58712233  0.07337928  1.70900816 2400.131 1.0012097
## b      0.28011788 0.08426083  0.15116574  0.41530461 1535.304 0.9994702
## a      1.38080619 1.02938885  0.23902873  3.31075386 2938.760 1.0002431
## etasq  0.18781103 0.19336866  0.02731325  0.53120359 1798.337 1.0001300
## rhosq  1.34458673 1.64452691  0.08017669  4.43102900 2501.140 1.0006755
## k[1]  -0.16132361 0.30092868 -0.65528193  0.29354865 2249.331 0.9996267
## k[2]  -0.01475467 0.29613259 -0.46422385  0.45087156 2071.460 1.0003354
## k[3]  -0.06581361 0.28178206 -0.50474261  0.36941120 2153.604 1.0001215
## k[4]   0.34905185 0.26091675 -0.04440591  0.76763989 2073.114 0.9998542
## k[5]   0.07271611 0.25782238 -0.31096921  0.48422630 2384.534 1.0003823
## k[6]  -0.38531671 0.26810390 -0.82133839  0.00343289 2308.289 1.0000403
## k[7]   0.13949256 0.25611124 -0.24600434  0.54368480 2328.070 0.9996791
## k[8]  -0.21009284 0.25933642 -0.61876572  0.17323417 2443.664 1.0000403
## k[9]   0.25815154 0.24680168 -0.09692506  0.63825506 2147.277 0.9997921
## k[10] -0.17304571 0.34474979 -0.71380672  0.35213670 1590.601 0.9996458

We can see n_eff and Rhat values are better than before.

Example: Phylogenetic distance

In the case of species, “distance” is measured by how long since a common ancestor. It’s a fact that species with more recent common ancestors have higher trait correlations.

Phylogenetic distance can have 2 important causal influences:

Two species that only recently separated tend to be more similar, assuming their traits are not maintained by selection but rather drifting neutrally around.
Phylogenetic distance is a proxy for unobserved variables that generate covariance among species, even when selection matters. Closely related species likely share more of these. E.g. all mammals nurse their young with milk, flight in birds influences many traits. When not observed, phylogenetic distance is a potentially useful proxy for these variables.

We want to study the causal influence of Group size (G) on brain size (B). Our hypothesis is that larger group size leads to larger brains (maybe living in large society leads to larger brains to cope with the complexity of cooperation and manipulation). This hypothesis implies a causal time series:

library(dagitty)
gb <- dagitty("dag{
              G_1 [unobserved]
              B_1 [unobserved]
              U_1 [unobserved]
              U_2 [unobserved]
              B_2 <- G_1 -> G_2
              B_1 -> B_2 <- U_1
              U_2 <- U_1 -> G_2
}")

coordinates(gb) <- list(
   x = c(G_1 = 0, G_2 = 1, B_1 = 0, B_2 = 1, U_1 = 0, U_2 = 1),
   y = c(G_1 = 0, G_2 = 0, B_1 = 1, B_2 = 1, U_1 = 2, U_2 = 2) 
)
drawdag(gb)

The subscripts represent time. Each variable influences itself in the next time. There is a causal influence of \(G_1\) on \(B_2\) - a species’ recent group size influenced its current brain size. We would like to estimate this, however there are possibly many unobserved confounds. We have not observed the past, and this is where the branching history of species can help estimate the influence of the past. Thus, phylogenetic relationships expressed as a distance can be used to partially reconstruct confounds. This depends on having both a good phylogeny and a good model of the relationship between phylogenetic distance and its trait evolution.

This approach looks like the following:

gb2 <- dagitty("dag{
               U [unobserved]
               G -> B
               G <- M -> B
               G <- U -> B
               P -> U -> M
               
}")
coordinates(gb2) <- list(
   x = c(G = 0, B = 2, M = 1, U = 1, P = 2),
   y = c(G = 0, B = 0, M = 1, U = 2, P = 2) 
)
drawdag(gb2)

We are interested in \(G \to B\). There is one confound we know for sure, body mass M which possibly influences both G and B. The unobserved U could potentially influence all 3. Finally, we let phylogenetic relationships (P) influence U. P can be causal since it we travelled back in time and delayed a split between two species, it could influence the expected differences in their traits. Thus the timing of the split and not the phylogeny is causal. P may also influence the other variables, but those arrows are omitted for clarity.

In the chain \(G \to B\), there are backdoors in G through both M and U. We can condition on M to block it. For U, we would like to use P to somehow reconstruct the covariation that U induces between G and B.

GLMs that try to include the phylogenetic distance often go by the name PHYLOGENETIC REGRESSION. There are many variants and they all use some function of phylogenetic distance to model the covariation between species. We will look at the basic phylogenetic regression first before looking at more flexible methods of modelling phylogenetic distance.

We will load in the primates data for this:

Load Data

library(rethinking)
data("Primates301")     # primates data
data("Primates301_nex") # phylogeny data

# plot it using ape package
library(ape)
plot(ladderize(Primates301_nex), type = 'fan', font = 1, no.margin = TRUE, label.offset = 1, cex = 0.5)

The phylogeny is plotted above. This tree will be used as a way to model unobserved confounds. We would also like to deal with the fact that some groups of closely related species may be over-represented in nature, which produces imbalance in sampling. Varying effects can help us here. But we will get the varying effects from the phylogenetic tree structure.

Ordinary Regression

Before using the tree we will run an ordinary regression analysing (log) group size as a function of (log) brain size and (log) body size. We will build this regression in a different manner that will help later. We can think of the species a single variable with 301 trait values (some more similar than others). In a typical regression we would condition on predictor variables after which the model would expect correlations. So we can write such a model using a big, multi-variate outcome distribution.

\[ \begin{align} B &\sim \text{MVNormal}(\mu, S) \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \end{align} \] where B is a vector species brain sizes and S is a covariance matrix with as many rows and columns as there are species. In an ordinary regression, this matrix takes the form:

\[ S = \sigma^2 I \] where \(\sigma\) is the same standard deviation we have been using and I is an Identity Matrix. We can think if I as a correlation matrix with all correlations zero. So multiplying it by variance gives each species the same residual variance. It is an ordinary linear regression but having a single multi-variate outcome.

d <- Primates301
d$name <- as.character(d$name)
dstan <- d[complete.cases(d$group_size, d$body, d$brain), ]
spp_obs <- dstan$name

We have 151 species left now.

dat_list <- list(
    N_spp = nrow(dstan), 
    M     = standardize(log(dstan$body)), 
    B     = standardize(log(dstan$brain)), 
    G     = standardize(log(dstan$group_size)), 
    Imat  = diag(nrow(dstan))
)

m14.9 <- ulam(
    alist(
        B ~ multi_normal(mu, SIGMA), 
        mu <- a + bM*M + bG*G, 
        matrix[N_spp, N_spp]: SIGMA <- Imat * sigma_sq, 
        a ~ normal(0, 1), 
        c(bM, bG) ~ normal(0, 0.5), 
        sigma_sq ~ exponential(1)
    ), data = dat_list, chains = 4, cores = 4, refresh = 0
)
precis(m14.9)

##                  mean          sd        5.5%      94.5%    n_eff    Rhat4
## a        0.0001238052 0.017444976 -0.02719666 0.02776568 1842.179 1.000279
## bG       0.1244050170 0.022918814  0.08871709 0.16109367 1536.011 1.000626
## bM       0.8921027765 0.022788289  0.85619534 0.92917625 1522.766 1.000544
## sigma_sq 0.0473658948 0.005489545  0.03921367 0.05668674 1832.660 1.001699

We get a reliably positive association between brain size and group size, as well a strong association with body mass. We can’t interpret these as causal associations since we know there are confounds.

Brownian motion

We will now conduct two different kinds of phylogenetic regression. In both we will replace the covariance matrix S above with a different matrix that encodes some phylogenetic information. The first type is the oldest and most conservative, a BROWNIAN MOTION interpretation of the phylogeny that implies a very particular covariance matrix. Brownian motion just means Gaussian random walks. If species traits drift randomly with respect to one another after speciation, then the covariance between a pair of species ends up being linearly related to the phylogenetic branch distance between them. The traits don’t actually evolve neutrally and they also evolve at different rates in different parts of the tree.

We will now compute the implied covariance matrix, the distance matrix and show the relation between them.

library(ape)
tree_trimmed <- keep.tip(Primates301_nex, spp_obs)
Rbm <- corBrownian(phy = tree_trimmed)
V <- vcv(Rbm)
Dmat <- cophenetic(tree_trimmed)
plot(Dmat, V, xlab = "phylogenetic distance", ylab = "covariance")

The points represent a pair of species. The x-axis is the phylogenetic/patristic distance. y-axis is the covariance under the Brownian model. They are just inverse of each other. This can be seen even more clearly if we look at their heat maps side by side.

withr::with_par(
    list(mfrow = c(1, 2)), 
    code = {
        image(V, col = hcl.colors(256, "Geyser"))
        image(Dmat, col = hcl.colors(256, "Geyser"))
    }
)

We can insert this new matrix into our regression. The model remains the same overall. We need to get the rows and columns in the same order as the rest of the data and then convert it to a correlation matrix (so we can estimate the residual variance) and then plug it in the place of the identity matrix.

# put species in right order

dat_list$V <- V[spp_obs, spp_obs]

# convert to correlation matrix
dat_list$R <- dat_list$V / max(V)

# Brownian motion model
m14.10 <- ulam(
    alist(
        B ~ multi_normal(mu, SIGMA), 
        mu <- a + bM*M + bG*G, 
        matrix[N_spp, N_spp]: SIGMA <- R * sigma_sq,  # changed Imat to R
        a ~ normal(0, 1), 
        c(bM, bG) ~ normal(0, 0.5), 
        sigma_sq ~ exponential(1)
    ), data = dat_list, chains = 4, cores = 4, refresh = 0
)
precis(m14.10)

##                 mean         sd        5.5%      94.5%    n_eff     Rhat4
## a        -0.18870473 0.16635621 -0.45915572 0.08124923 2588.157 0.9995629
## bG       -0.01298129 0.02056278 -0.04591848 0.01999901 2339.116 1.0010232
## bM        0.70092261 0.03671297  0.64230300 0.75872818 2114.627 0.9992867
## sigma_sq  0.16109552 0.01956030  0.13259837 0.19452634 1963.052 1.0002300

The effect of group size now becomes almost zero with mass on both sides of zero. The big change from the previous model suggests that there is a lot of clustering of brain size in the tree and that this produces a spurious relationship with group size, which also clusters in the tree. The correlation matrix decides how the model using this clustering.

OU Process

The Brownian motion is a special kind of Gaussian process in which the covariance declines in a very rigid way with increasing distance. It is common to use PAGEL’S LAMBDA instead, which scales all of the species correlations by a common factor and thus still has the same rigidity. Another common alternative is the ORNSTEIN-UHLENBECK PROCESS (or OU Process) which is a dampened Brownian motion process that tends to return towards some mean(s). It constraints the variation making the relationship between phylogenetic distance and covariance linear.

The OU process defines the covariance between 2 species i and j as:

\[ K(i, j) = \eta^2 \exp (- \rho^2 D_{ij}) \]

This is an exponential distance kernel (the previous one was quadratic kernel). The covariance between points declines rapidly making for much less smooth functions. It is also usually harder to fit to data due to the roughness. T

We will next use the OU process kernel (more generally known as the L1 norm) which ulam provides as cov_GPL1.

# add scaled and reordered distance matrix
dat_list$Dmat <- Dmat[spp_obs, spp_obs] / max(Dmat)

m14.11 <- ulam(
    alist(
        B ~ multi_normal(mu, SIGMA), 
        mu <- a + bM*M + bG*G, 
        matrix[N_spp, N_spp]: SIGMA <- cov_GPL1(Dmat, etasq, rhosq, 0.01), 
        a ~ normal(0, 1), 
        c(bM, bG) ~ normal(0, 0.5), 
        etasq ~ half_normal(1, 0.25),
        rhosq ~ half_normal(3, 0.25)
    ), data = dat_list, chains = 4, cores = 4, refresh = 0
)
precis(m14.11)

##              mean          sd        5.5%      94.5%    n_eff     Rhat4
## a     -0.06514939 0.074336822 -0.18356731 0.05001109 1652.041 1.0023274
## bG     0.04947867 0.023314769  0.01060433 0.08632936 1621.553 1.0011844
## bM     0.83316949 0.028836676  0.78620300 0.87741109 1993.051 0.9998163
## etasq  0.03504978 0.006681566  0.02553404 0.04643753 1993.149 0.9998754
## rhosq  2.79804603 0.250248403  2.40591979 3.20004077 1765.693 0.9985964

Now group size is seemingly associated with brain size again though the association is small. But most the posterior mass is above zero. The difference in results must be due to the difference in covariance function used. So let’s look at the posterior covariance functions implied by etasq and rhosq.

post <- extract.samples(m14.11)

plot(NULL, xlim = c(0, max(dat_list$Dmat)), ylim = c(0, 1.5), 
     xlab = "phylogenetic distance", ylab = "covariance")

# posterior
for(i in 1:30)
    curve(post$etasq[i] * exp(-post$rhosq[i]*x), add = TRUE, col = rangi2)

# prior mean and 89% interval
eta <- abs(rnorm(1e3, 1, 0.25))
rho <- abs(rnorm(1e3, 3, 0.25))
d_seq <- seq(from = 0, to = 1, length.out = 50)
K <- sapply(d_seq, function(x) eta*exp(-rho*x))

lines(d_seq, colMeans(K), lwd = 2)
shade(apply(K, 2, PI), d_seq)

text(0.5, 0.5, "prior")
text(0.2, 0.1, "posterior", col = rangi2)

x-axis is the scaled phylogenetic distance with one as maximum distance
y-axis is the covariance
blue curves are 30 draws from the posterior distribution
black curve is the prior mean

The posterior is hugging the bottom, indicating a very low covariance between species at any distance.

There just isn’t a lot of phylogenetic covariance for brain sizes, at least according to this model and these data. As a result, the phylogenetic distance doesn’t completely explain away the association between group size and brain size, as it did in the Brownian motion model.