Statistical Rethinking Chapter 9

Metropolis Algorithm - King moving around island

Ring of islands with different population sizes. King has to spend time in each island in proportion of their population.

  1. King in some island, each week decides to either travel to an adjacent island or stay put.
  2. Tosses coin, and depending on result makes one of the two adjacent islands as target.
  3. Counts relative prop of current (c) and proposal island (p). Then
    • if p > c - move to proposed island with prob = 1
    • otherwise, move to proposed island with a probability of = p / c

Simulating the above.

num_weeks <- 1e5
positions <- rep(0, num_weeks)
current <- 10
for(i in 1:num_weeks){
    ## record current position
    positions[i] <- current
    
    # flip coing to generate proposal (one of adjacent islands)
    proposal <- current + sample( c(-1, 1), size = 1)
    
    # boundary looping
    if( proposal < 1)  proposal <- 10
    if( proposal > 10) proposal <- 1
    
    # move?
    prob_move <- proposal / current
    current <- ifelse(runif(1) < prob_move, proposal, current)
}

Plotting the results:

library(withr)
with_par(list(mfrow = c(1, 2)), 
         code = {
             plot(1:500, positions[1:500], xlab = "week", ylab = "island", pch = 20, col = 'blue')
             
             plot(table(positions), xlab = 'island', ylab = 'number of weeks', col = '#1E59AE')
         }
)

Metropolis Algorithm

For high dimensional distributions, samples are from the max. If for instance we assume a multivariate gaussian distribution (of varying dimensions as plotted below), with mean 0, the max is near 0, however, as shown below, we get samples that get further and further away from it as the number of dimensions increases.

library(rethinking)
D <- 10
T <- 1e3
# Y <- rethinking::rmvnorm(T, rep(0, D), diag(D))
# rad_dist <- function(Y) sqrt(sum(Y^2))
# Rd <- sapply(1:T, function(i) rad_dist(Y[i,]))
# dens(Rd)

plot(NULL, xlim = c(0, 35), ylim = c(0, 1), xlab = "Radial distance from mode")

for(D in c(1, 10, 100, 1000)){
    
    Y <- rethinking::rmvnorm(T, rep(0, D), diag(D))
    rad_dist <- function(Y) sqrt(sum(Y^2))
    Rd <- sapply(1:T, function(i) rad_dist(Y[i,]))
    dens(Rd, add = TRUE)
    k <- density(Rd)
    text(x = summary(k$x)['Median'], y = summary(k$y)['Max.'], labels = D, adj = c(0, -1.3))
    
}

Hamltonian Monte Carlo

HMC algo needs 5 things to go:

  1. a function U that returns the negative log-probability of the data at the current position (parameter values)
  2. a function grad_U that returns the gradient of the negative log-probability at the current position
  3. a step size epsilon
  4. a count of leapfrog steps L
  5. a starting position current_q

Position is a vector of parameter values. Gradient also same length vector.

Below U function custom built for 2D Gaussian example. It expresses the log posterior as follows:

\[ \sum_i \log p (y_i | \mu_y, 1) + \sum_i \log p (x_i | \mu_x, 1) + \log p (\mu_y | 0, 0.5) + \log p (\mu_x | 0, 0.5) \]

Here $p(x | a, b) implies Gaussian density of x at mean a and standard deviation b.

# U needs to return neg-log probability
U <- function(q, a = 0, b = 1, k = 0, d = 1){
  muy <- q[1]
  mux <- q[2]
  U <- (sum(dnorm(y, muy, 1, log = TRUE)) + 
          sum(dnorm(x, mux, 1, log = TRUE)) + 
          dnorm(muy, a, b, log = TRUE) + 
          dnorm(mux, k, d, log = TRUE)
  )
  return(-U)
}

The derivative ofthe logarithm of any univariate Gaussian with mean a and standard deviation b with respect to a is:

\[ \frac{\partial \log N (y| a, b)}{\partial a} = \frac{y - a}{b ^ 2} \]

Derivative of a sum is a sum of derivatives, so the gradients becomes: $$ \[\begin{aligned} \frac{\partial U}{\partial \mu_x} &= \frac{\partial \log N(x | \mu_x, 1)}{\partial \mu_x} + \frac{\partial \log N (\mu_x | 0, 0.5)}{\partial \mu_x} \\ &= \sum_i \frac{x_i - \mu_x}{1^2} + \frac{0 - \mu_x}{0.5^2} \end{aligned}\]

$$

Similar for \(\mu_y\).

# gradient function
# need vector of parital derivatives of U with respect to vector q

U_gradient <- function(q, a = 0 , b = 1, k = 0, d = 1){
  muy <- q[1]
  mux <- q[2]
  
  G1 <- sum(y - muy) + (a - muy) / b^2     # dU/dmuy
  G2 <- sum(x - mux) + (k - mux) / d^2     # dU/dmux
  return( c(-G1, -G2))                     # negative because energy is neg-log-prob
}

# test data
set.seed(7)
y <- rnorm(50)
x <- rnorm(50)
x <- as.numeric(scale(x))
y <- as.numeric(scale(y))

Plotting code:

library(shape) # for fancy arrows
Q <- list()
Q$q <- c(-0.1,0.2)
pr <- 0.3
plot( NULL , ylab="muy" , xlab="mux" , xlim=c(-pr,pr) , ylim=c(-pr,pr) )
step <- 0.03

L <- 11 # 0.03/28 for U-turns --- 11 for working example
n_samples <- 4
path_col <- col.alpha("black",0.5)
points( Q$q[1] , Q$q[2] , pch=4 , col="black" )

for ( i in 1:n_samples ) {
  Q <- HMC2( U , U_gradient , step , L , Q$q )
  if ( n_samples < 10 ) {
    for ( j in 1:L ) {
      
      # kinetic energy
      K0 <- sum(Q$ptraj[j,]^2)/2                        
      
      lines( Q$traj[ j : (j + 1) , 1 ] , Q$traj[ j : (j + 1) , 2] , col = path_col , lwd = 1 + 2 * K0 )
    }
    
    points( Q$traj[1:L+1,] , pch=16 , col="white" , cex=0.35 )
    Arrows( Q$traj[L,1] , Q$traj[L,2] , Q$traj[L+1,1] , Q$traj[L+1,2] ,
            arr.length=0.35 , arr.adj = 0.7 )
    text( Q$traj[L+1,1] , Q$traj[L+1,2] , i , cex=0.8 , pos=4 , offset=0.4 )
  }
  
  points( Q$traj[L+1,1] , Q$traj[L+1,2] , pch=ifelse( Q$accept==1 , 16 , 1 ) ,
          col=ifelse( abs(Q$dH)>0.1 , "red" , "black" ) )
}

ULAM

Redoing rugged example using quap first.

data(rugged) 
d <- rugged

# make log version of outcome 
d$log_gdp <- log( d$rgdppc_2000 )

# extract countries with GDP data 
dd <- d[ complete.cases(d$rgdppc_2000) , ]

# rescale variables 
dd$log_gdp_std <- dd$log_gdp / mean(dd$log_gdp) 
dd$rugged_std <- dd$rugged / max(dd$rugged)

# make vars to index Africa (1) or not (2)
dd$cid <- ifelse(dd$cont_africa == 1, 1, 2)

m8.3 <- quap( 
    alist(
        log_gdp_std ~ dnorm( mu , sigma ) ,
        mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
        a[cid] ~ dnorm( 1 , 0.1 ) ,
        b[cid] ~ dnorm( 0 , 0.3 ) ,
        sigma ~ dexp( 1 )
    ) , data=dd 
)
precis(m8.3, depth = 2)
##             mean          sd        5.5%       94.5%
## a[1]   0.8865640 0.015674552  0.86151302  0.91161495
## a[2]   1.0505666 0.009935872  1.03468714  1.06644602
## b[1]   0.1324981 0.074199237  0.01391344  0.25108286
## b[2]  -0.1426057 0.054745410 -0.23009945 -0.05511197
## sigma  0.1094859 0.005934188  0.10000194  0.11896990

Now we will fit this using HMC.

We can use the same formula list as before, but we need to do two changes:

  • Perform variable preprocessing beforehand. For instance for poly powers, compute them before hand rather than wasting computing power inside HMC every chain.
  • Make a new trimmed data frame with only variables of interest. (This can be helpful if some of the vars not needed contain NA values, in which case stan will fail).
dat_slim <- list(
  log_gdp_std = dd$log_gdp_std, 
  rugged_std = dd$rugged_std, 
  cid = as.integer(dd$cid)
)
str(dat_slim)
## List of 3
##  $ log_gdp_std: num [1:170] 0.88 0.965 1.166 1.104 0.915 ...
##  $ rugged_std : num [1:170] 0.138 0.553 0.124 0.125 0.433 ...
##  $ cid        : int [1:170] 1 2 2 2 2 2 2 2 2 1 ...

with lists, we can have diff vars of different length (which will be useful in multilevel models).

Sample from posterior using ulam

m9.1 <- ulam( 
  alist(
    log_gdp_std ~ dnorm( mu , sigma ) ,
    mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) , 
    a[cid] ~ dnorm( 1 , 0.1 ) , 
    b[cid] ~ dnorm( 0 , 0.3 ) , 
    sigma ~ dexp( 1 )
  ) , data=dat_slim , chains=1 
)
## 
## SAMPLING FOR MODEL '4e638bc82c4247257e76c1202e0a46c7' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.107 seconds (Warm-up)
## Chain 1:                0.064 seconds (Sampling)
## Chain 1:                0.171 seconds (Total)
## Chain 1:

ulam converts above formula to stancode. The converted code can be viewed using stancode(m9.1).

We can see summary using precis just like quap.

precis(m9.1, depth = 2)
##             mean          sd         5.5%       94.5%    n_eff     Rhat4
## a[1]   0.8862969 0.016440675  0.859857484  0.91166170 537.7017 1.0006389
## a[2]   1.0502280 0.009349024  1.035621917  1.06428462 670.3211 0.9997121
## b[1]   0.1355869 0.078702003  0.008922524  0.26239381 538.1529 1.0006114
## b[2]  -0.1414820 0.055829945 -0.231168322 -0.04905019 601.8853 0.9980997
## sigma  0.1116130 0.006568439  0.101756685  0.12238406 332.5514 0.9989425

There are two new columns. Details later.

  • n_eff - effectively the number of independent samples we managed to get
  • Rhat4 - estimate of how markov chain converged to target distribution. It should approach 1 from above. 4 indicates the fourth generation. (soon it will be Rhat5!)

Running multiple chains in parallel

m9.1 <- ulam( 
  alist(
    log_gdp_std ~ dnorm( mu , sigma ) ,
    mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) , 
    a[cid] ~ dnorm( 1 , 0.1 ) , 
    b[cid] ~ dnorm( 0 , 0.3 ) , 
    sigma ~ dexp( 1 )
  ) , data=dat_slim , chains=4, cores = 4 
)

show will tell us the model formula and how much time it took for each chain to run.

show(m9.1)
## Hamiltonian Monte Carlo approximation
## 2000 samples from 4 chains
## 
## Sampling durations (seconds):
##         warmup sample total
## chain:1   0.05   0.04  0.09
## chain:2   0.05   0.03  0.08
## chain:3   0.06   0.04  0.10
## chain:4   0.06   0.03  0.09
## 
## Formula:
## log_gdp_std ~ dnorm(mu, sigma)
## mu <- a[cid] + b[cid] * (rugged_std - 0.215)
## a[cid] ~ dnorm(1, 0.1)
## b[cid] ~ dnorm(0, 0.3)
## sigma ~ dexp(1)

2k samples from all 4 chains. Each chain has 1000 sample, but by default uses the first 500 to adapt.

precis(m9.1, 2)
##             mean          sd        5.5%       94.5%    n_eff     Rhat4
## a[1]   0.8868582 0.015129035  0.86291601  0.91099559 2666.123 0.9994667
## a[2]   1.0505662 0.010172545  1.03433807  1.06699208 3017.206 0.9997286
## b[1]   0.1340069 0.072391547  0.02013722  0.25062543 2260.158 0.9998661
## b[2]  -0.1404236 0.055773679 -0.22707214 -0.05420522 2464.641 0.9989013
## sigma  0.1114573 0.006239214  0.10189516  0.12161957 2367.507 0.9985288

Effective samples are more than 2000!! This is since the adaptive sampler Stan uses is “very good” and generated sequential samples that are better than uncorrelated. Essentially it can explore the posterior distribution so efficiently that it can beat random. It’s Jayne's principle in action.

Jayne’s principle - if there is a random way of doing something, then there is a non-randomised way that delivers better performance but requires more thought.

Visualisation

pairs(m9.1)

Seems multivariate gaussian.

Checking the chain

Chain visualisations can help spot problems.

First we see a trace plot, which plots the samples in sequential order, joined by a line.

traceplot(m9.1)
## [1] 1000
## [1] 1
## [1] 1000

by default this shows the progress of all chains. we can use chains = 1 to get it for one chain only.

We look for 3 things in these chains:

  1. stationarity - path of each chain staying within the same high-probability portion of the posterior distribution
  2. good mixing - means that the chain rapidly explores the full region. It does not slowly wander, but rapidly zig-zags
  3. convergence - mutiple independent chains stick around the same region of high probability.

With multiple chains, they plot over each other, and it can become hard to see independent pathologies in chains. Another visualisation method is a plot of the distributino of the ranked samples, a TRACE RANK PLOT or TRANK PLOT.

This takes all the samples for each individual param and then ranks them:- lowest sample gets rank 1, largest gets teh max rank (the no of samples across al chains). Then we draw histogram of these ranks for each individual chain. If the chains are exploring the same space efficiently, the histograms should be similar to one another and largely overlapping.

trankplot(m9.1)

Horizontal is rank, from 1 to the no of samples (2000 in this example). Vertical axis is the frequency of ranks in each bin of the histogram.

Markov Chain Parameters

How many samples?

  1. What is important is effective number of samples, not the real samples.
    • iter - default 1000
    • warmup - default iter/2, so 500 warmup samples + 500 real samples to use for inference
    • what matters is the effective number of samples, not the raw number.
    • n_eff can be though of as the length of a Markov chain with no autocorrelation that would provide the same quality of estimate as your chain. It can be larger than the length of the chain if the sequential sampels are anti-correlated in the right way. n_eff is only an estimate.
  2. What do we want to know?
    • Only posterior means - a couple hundred could be enough.
    • exact shape in the extreme tails of the posterior, 99th percentile, etc. - then many more
    • tail ESS - the effective sample size (similar to n_eff) in the tails of the posterior
    • warmup - with stan models, half of total samples can be devoted. For simpler models, much less is sufficient.

How many chains?

  1. When initially debugging, use a single chain. Some error messages don’t dipslay unless using one chain.
  2. When deciding whether the chains are valid, we need more than one chain.
  3. During the final run from which we will make inferences from, only one chain is required. Using more is fine too.

one short chain to debug, four chains for verification and inference

Rhat

Gelman-Rubin convergence diagnostic \(\hat R\). If n_eff is lower than actual no of iterations, it means chains are inefficient but still okay. When Rhat is above 1.00 it usually indicates that the chain has not yet converged, and samples may not be trustworthy. If you draw more iterations, it could be fine, or it could never converge. See the Stan user manual for more details. It’s important however not to rely too much on these diagnostics. Like all heuristics, there are cases in which they provide poor advice. For example, Rhat can reach 1.00 even for an invalid chain. So view it perhaps as a signal of danger, but never of safety. For conventional models, these metrics typically work well.

Taming a wild chain

Broad flat regions of the posterior density can cause problems. They occur when we use flat priors. This can generate a wild wandering Markov chain that erratically samples extremely positive and extremely negative parameter values.

In below example, we try estimate the mean and sd of two Gaussian observations -1 and 1, but use flat priors.

y <- c(-1, 1)
set.seed(11)
m9.2 <- ulam(
  alist(
    y ~ dnorm(mu, sigma), 
    mu <- alpha,
    alpha ~ dnorm(0, 1000), 
    sigma ~ dexp(0.0001)
  ), data = list(y = y), chains = 3
)
## 
## SAMPLING FOR MODEL 'd26c527083e7eda89b17a8c2eccd6019' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.226 seconds (Warm-up)
## Chain 1:                0.142 seconds (Sampling)
## Chain 1:                0.368 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'd26c527083e7eda89b17a8c2eccd6019' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.281 seconds (Warm-up)
## Chain 2:                0.05 seconds (Sampling)
## Chain 2:                0.331 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'd26c527083e7eda89b17a8c2eccd6019' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.383 seconds (Warm-up)
## Chain 3:                0.074 seconds (Sampling)
## Chain 3:                0.457 seconds (Total)
## Chain 3:
## Warning: There were 42 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.11, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
precis(m9.2)
##             mean        sd       5.5%     94.5%     n_eff   Rhat4
## alpha  -2.773519  324.9502 -524.00894  505.2174  85.81054 1.02871
## sigma 578.510588 1392.9147   13.45968 2160.1994 253.31969 1.01652

alpha should have been close to zero. We get crazy values however, and implausibly large intervals. Even sigma is too big. n_eff and Rhat diagnositcs don’t look good either. We drew 1500 samples in total (3 chains), but effective size iis very less.

The warning about divergent transitions after warmup tells there are problems with the chains.

For simple models chaining the adapt_delta control param will usually remove the divergent transitions. We can try adding control = list(adapt_delta = 0.99) to ulam call. Default value is 0.95. (won’t be of help in this case).

A second warning will be advising checking the pairs plot. (this is stan’s pairs not rethinking’s).

pairs(m9.2@stanfit)

This is like ulam's pairs plot but divergent transitions coloured in red.

traceplot(m9.2)
## [1] 1000
## [1] 1
## [1] 1000

chains drift around and spike occasionally to extreme values. Not a healthy pair of chains.

trankplot(m9.2)

rank histograms spend long periods with one chain above / below the others. Indicates poor exploration of the posterior.

To tame this chain, we would need to use weaker priors.

set.seed(11)
m9.3 <- ulam(
  alist(
    y ~ dnorm(mu, sigma), 
    mu <- alpha, 
    alpha ~ dnorm(1, 10), 
    sigma ~ dexp(1)
  ), data = list(y = y), chains = 3
)
## 
## SAMPLING FOR MODEL '593afb1307e5b69f89a21ad207f22894' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.04 seconds (Warm-up)
## Chain 1:                0.033 seconds (Sampling)
## Chain 1:                0.073 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL '593afb1307e5b69f89a21ad207f22894' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.044 seconds (Warm-up)
## Chain 2:                0.035 seconds (Sampling)
## Chain 2:                0.079 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL '593afb1307e5b69f89a21ad207f22894' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.048 seconds (Warm-up)
## Chain 3:                0.036 seconds (Sampling)
## Chain 3:                0.084 seconds (Total)
## Chain 3:
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
precis(m9.3)
##             mean       sd      5.5%    94.5%    n_eff    Rhat4
## alpha 0.09368772 1.291687 -1.586733 2.282757 294.9239 1.005993
## sigma 1.57242939 0.924527  0.693663 3.092426 254.8605 1.003729

Much better results this time.

traceplot(m9.3)
## [1] 1000
## [1] 1
## [1] 1000
trankplot(m9.3)

both seem better too.

pairs(m9.3@stanfit)

post <- extract.samples(m9.3)
withr::with_par(
  list(mfrow = c(1, 2)), 
  code = {
    plot(NULL, xlim = c(-15, 15), ylim = c(0, 0.4), xlab = "alpha", ylab = "Density")
    dens(post$alpha, lwd = 2, lty = 1, col = 'navyblue', add = TRUE)
    dens(rnorm(1e3, 1, 10), lwd = 2, lty = 2, add = TRUE)
    text(2, 0.25, labels = "Posterior", col = 'navyblue', adj = 0)
    text(10, 0.05, labels = "Prior", adj = 0)
    
    plot(NULL, xlim = c(0, 10), ylim = c(0, 0.7), xlab = "sigma", ylab = "Density")
    dens(post$sigma, lwd = 2, lty = 1, col = 'navyblue', add = TRUE)
    dens(rexp(1e3, 1), lwd = 2, lty = 2, add = TRUE)
  }
)

Non identifiable parameters

With highly correlated predictors there is a problem of non-identifiable parameters.

To construct similar non-identifiable situations, we simulate 100 obs from a Gaussian dist with mean zero and sd 1.

set.seed(41)
y <- rnorm(100, mean = 0, sd = 1)

Then we fit the following model:

\[ \begin{align} y_i &\sim Normal(\mu, \sigma) \\ \mu &= \alpha_1 + \alpha_2 \\ \alpha_1 &\sim Normal(0, 1000) \\ \alpha_2 &\sim Normal(0, 1000) \\ \sigma &\sim Exponential(1) \end{align} \]

\(\alpha_1\) and \(\alpha_2\) can’t be identified, only their sum can be identified which should be zero after estimation.

set.seed(384)
m9.4 <- ulam(
  alist(
    y ~ dnorm(mu, sigma), 
    mu <- a1 + a2, 
    a1 ~ dnorm(0, 1000), 
    a2 ~ dnorm(0, 1000), 
    sigma ~ dexp(1)
  ), data = list(y = y), chains = 3
)
## 
## SAMPLING FOR MODEL 'e6372181749abf126ca67665f3f2df39' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 3.464 seconds (Warm-up)
## Chain 1:                3.746 seconds (Sampling)
## Chain 1:                7.21 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'e6372181749abf126ca67665f3f2df39' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 3.41 seconds (Warm-up)
## Chain 2:                3.871 seconds (Sampling)
## Chain 2:                7.281 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'e6372181749abf126ca67665f3f2df39' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 3.254 seconds (Warm-up)
## Chain 3:                3.592 seconds (Sampling)
## Chain 3:                6.846 seconds (Total)
## Chain 3:
## Warning: There were 1140 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 2.47, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
precis(m9.4)
##             mean           sd        5.5%      94.5%     n_eff    Rhat4
## a1    -37.528345 430.02223526 -605.995946 677.529388  1.853644 3.346379
## a2     37.716575 430.02596973 -677.288723 606.263939  1.853613 3.346438
## sigma   1.023165   0.06499162    0.921962   1.133209 16.062994 1.297504

Estimates seem suspicious, sd are very wide, n_eff is very small, and Rhat is high too. This is since we can’t simultaneously estimate a1 and a2, but only their sum.

We also a warning about “# transitions after warmup exceeded the max treedepth”. These indicate inefficient but not broken chains.

traceplot(m9.4)
## [1] 1000
## [1] 1
## [1] 1000

trankplot(m9.4)

Chains neither stationary nor mixing nor converging. Seeing chains like this implies don’t use these chains.

Again weakly regularizing priors is helpful.

m9.5 <- ulam(
  alist(
    y ~ dnorm(mu, sigma), 
    mu <- a1 + a2, 
    a1 ~ dnorm(0, 10), 
    a2 ~ dnorm(0, 10), 
    sigma ~ dexp(1)
  ), data = list(y = y), chains = 3
)
## 
## SAMPLING FOR MODEL '60743fdda5f692fb7902c70d7f6403b7' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 1.268 seconds (Warm-up)
## Chain 1:                1.114 seconds (Sampling)
## Chain 1:                2.382 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL '60743fdda5f692fb7902c70d7f6403b7' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 1.201 seconds (Warm-up)
## Chain 2:                1.068 seconds (Sampling)
## Chain 2:                2.269 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL '60743fdda5f692fb7902c70d7f6403b7' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 1.185 seconds (Warm-up)
## Chain 3:                1.384 seconds (Sampling)
## Chain 3:                2.569 seconds (Total)
## Chain 3:
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
precis(m9.5)
##             mean         sd        5.5%     94.5%    n_eff    Rhat4
## a1     0.3261916 6.99143112 -11.2516530 10.758395 252.7479 1.013808
## a2    -0.1309445 6.98726504 -10.5498784 11.398724 252.4394 1.013793
## sigma  1.0341261 0.07176822   0.9262933  1.152745 490.5554 1.003023

Much better now. a1 and a2 still can’t be identified separately, but the results are not as terrible as before.

traceplot(m9.5)
## [1] 1000
## [1] 1
## [1] 1000

trankplot(m9.5)

much better here too. seems stationary, and mixing well.