Statistical Rethinking Chapter 9
Metropolis Algorithm - King moving around island
Ring of islands with different population sizes. King has to spend time in each island in proportion of their population.
- King in some island, each week decides to either travel to an adjacent island or stay put.
- Tosses coin, and depending on result makes one of the two adjacent islands as target.
- Counts relative prop of current (c) and proposal island (p). Then
- if p > c - move to proposed island with prob = 1
- otherwise, move to proposed island with a probability of = p / c
Simulating the above.
<- 1e5
num_weeks <- rep(0, num_weeks)
positions <- 10
current for(i in 1:num_weeks){
## record current position
<- current
positions[i]
# flip coing to generate proposal (one of adjacent islands)
<- current + sample( c(-1, 1), size = 1)
proposal
# boundary looping
if( proposal < 1) proposal <- 10
if( proposal > 10) proposal <- 1
# move?
<- proposal / current
prob_move <- ifelse(runif(1) < prob_move, proposal, current)
current }
Plotting the results:
library(withr)
with_par(list(mfrow = c(1, 2)),
code = {
plot(1:500, positions[1:500], xlab = "week", ylab = "island", pch = 20, col = 'blue')
plot(table(positions), xlab = 'island', ylab = 'number of weeks', col = '#1E59AE')
} )
Metropolis Algorithm
For high dimensional distributions, samples are from the max. If for instance we assume a multivariate gaussian distribution (of varying dimensions as plotted below), with mean 0, the max is near 0, however, as shown below, we get samples that get further and further away from it as the number of dimensions increases.
library(rethinking)
<- 10
D <- 1e3
T # Y <- rethinking::rmvnorm(T, rep(0, D), diag(D))
# rad_dist <- function(Y) sqrt(sum(Y^2))
# Rd <- sapply(1:T, function(i) rad_dist(Y[i,]))
# dens(Rd)
plot(NULL, xlim = c(0, 35), ylim = c(0, 1), xlab = "Radial distance from mode")
for(D in c(1, 10, 100, 1000)){
<- rethinking::rmvnorm(T, rep(0, D), diag(D))
Y <- function(Y) sqrt(sum(Y^2))
rad_dist <- sapply(1:T, function(i) rad_dist(Y[i,]))
Rd dens(Rd, add = TRUE)
<- density(Rd)
k text(x = summary(k$x)['Median'], y = summary(k$y)['Max.'], labels = D, adj = c(0, -1.3))
}
Hamltonian Monte Carlo
HMC algo needs 5 things to go:
- a function
U
that returns the negative log-probability of the data at the current position (parameter values) - a function
grad_U
that returns the gradient of the negative log-probability at the current position - a step size
epsilon
- a count of
leapfrog
steps L - a
starting position
current_q
Position is a vector of parameter values. Gradient also same length vector.
Below U
function custom built for 2D Gaussian example. It expresses the log posterior as follows:
\[ \sum_i \log p (y_i | \mu_y, 1) + \sum_i \log p (x_i | \mu_x, 1) + \log p (\mu_y | 0, 0.5) + \log p (\mu_x | 0, 0.5) \]
Here $p(x | a, b) implies Gaussian density of x at mean a and standard deviation b.
# U needs to return neg-log probability
<- function(q, a = 0, b = 1, k = 0, d = 1){
U <- q[1]
muy <- q[2]
mux <- (sum(dnorm(y, muy, 1, log = TRUE)) +
U sum(dnorm(x, mux, 1, log = TRUE)) +
dnorm(muy, a, b, log = TRUE) +
dnorm(mux, k, d, log = TRUE)
)return(-U)
}
The derivative ofthe logarithm of any univariate Gaussian with mean a and standard deviation b with respect to a is:
\[ \frac{\partial \log N (y| a, b)}{\partial a} = \frac{y - a}{b ^ 2} \]
Derivative of a sum is a sum of derivatives, so the gradients becomes: $$ \[\begin{aligned} \frac{\partial U}{\partial \mu_x} &= \frac{\partial \log N(x | \mu_x, 1)}{\partial \mu_x} + \frac{\partial \log N (\mu_x | 0, 0.5)}{\partial \mu_x} \\ &= \sum_i \frac{x_i - \mu_x}{1^2} + \frac{0 - \mu_x}{0.5^2} \end{aligned}\]$$
Similar for \(\mu_y\).
# gradient function
# need vector of parital derivatives of U with respect to vector q
<- function(q, a = 0 , b = 1, k = 0, d = 1){
U_gradient <- q[1]
muy <- q[2]
mux
<- sum(y - muy) + (a - muy) / b^2 # dU/dmuy
G1 <- sum(x - mux) + (k - mux) / d^2 # dU/dmux
G2 return( c(-G1, -G2)) # negative because energy is neg-log-prob
}
# test data
set.seed(7)
<- rnorm(50)
y <- rnorm(50)
x <- as.numeric(scale(x))
x <- as.numeric(scale(y)) y
Plotting code:
library(shape) # for fancy arrows
<- list()
Q $q <- c(-0.1,0.2)
Q<- 0.3
pr plot( NULL , ylab="muy" , xlab="mux" , xlim=c(-pr,pr) , ylim=c(-pr,pr) )
<- 0.03
step
<- 11 # 0.03/28 for U-turns --- 11 for working example
L <- 4
n_samples <- col.alpha("black",0.5)
path_col points( Q$q[1] , Q$q[2] , pch=4 , col="black" )
for ( i in 1:n_samples ) {
<- HMC2( U , U_gradient , step , L , Q$q )
Q if ( n_samples < 10 ) {
for ( j in 1:L ) {
# kinetic energy
<- sum(Q$ptraj[j,]^2)/2
K0
lines( Q$traj[ j : (j + 1) , 1 ] , Q$traj[ j : (j + 1) , 2] , col = path_col , lwd = 1 + 2 * K0 )
}
points( Q$traj[1:L+1,] , pch=16 , col="white" , cex=0.35 )
Arrows( Q$traj[L,1] , Q$traj[L,2] , Q$traj[L+1,1] , Q$traj[L+1,2] ,
arr.length=0.35 , arr.adj = 0.7 )
text( Q$traj[L+1,1] , Q$traj[L+1,2] , i , cex=0.8 , pos=4 , offset=0.4 )
}
points( Q$traj[L+1,1] , Q$traj[L+1,2] , pch=ifelse( Q$accept==1 , 16 , 1 ) ,
col=ifelse( abs(Q$dH)>0.1 , "red" , "black" ) )
}
ULAM
Redoing rugged
example using quap
first.
data(rugged)
<- rugged
d
# make log version of outcome
$log_gdp <- log( d$rgdppc_2000 )
d
# extract countries with GDP data
<- d[ complete.cases(d$rgdppc_2000) , ]
dd
# rescale variables
$log_gdp_std <- dd$log_gdp / mean(dd$log_gdp)
dd$rugged_std <- dd$rugged / max(dd$rugged)
dd
# make vars to index Africa (1) or not (2)
$cid <- ifelse(dd$cont_africa == 1, 1, 2)
dd
.3 <- quap(
m8alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dnorm( 0 , 0.3 ) ,
b[cid] ~ dexp( 1 )
sigma data=dd
) ,
)precis(m8.3, depth = 2)
## mean sd 5.5% 94.5%
## a[1] 0.8865640 0.015674552 0.86151302 0.91161495
## a[2] 1.0505666 0.009935872 1.03468714 1.06644602
## b[1] 0.1324981 0.074199237 0.01391344 0.25108286
## b[2] -0.1426057 0.054745410 -0.23009945 -0.05511197
## sigma 0.1094859 0.005934188 0.10000194 0.11896990
Now we will fit this using HMC.
We can use the same formula list as before, but we need to do two changes:
- Perform variable preprocessing beforehand. For instance for poly powers, compute them before hand rather than wasting computing power inside HMC every chain.
- Make a new trimmed data frame with only variables of interest. (This can be helpful if some of the vars not needed contain NA values, in which case
stan
will fail).
<- list(
dat_slim log_gdp_std = dd$log_gdp_std,
rugged_std = dd$rugged_std,
cid = as.integer(dd$cid)
)str(dat_slim)
## List of 3
## $ log_gdp_std: num [1:170] 0.88 0.965 1.166 1.104 0.915 ...
## $ rugged_std : num [1:170] 0.138 0.553 0.124 0.125 0.433 ...
## $ cid : int [1:170] 1 2 2 2 2 2 2 2 2 1 ...
with lists, we can have diff vars of different length (which will be useful in multilevel models).
Sample from posterior using ulam
.1 <- ulam(
m9alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dnorm( 0 , 0.3 ) ,
b[cid] ~ dexp( 1 )
sigma data=dat_slim , chains=1
) , )
##
## SAMPLING FOR MODEL '4e638bc82c4247257e76c1202e0a46c7' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.107 seconds (Warm-up)
## Chain 1: 0.064 seconds (Sampling)
## Chain 1: 0.171 seconds (Total)
## Chain 1:
ulam
converts above formula to stancode
. The converted code can be viewed using stancode(m9.1)
.
We can see summary using precis
just like quap
.
precis(m9.1, depth = 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8862969 0.016440675 0.859857484 0.91166170 537.7017 1.0006389
## a[2] 1.0502280 0.009349024 1.035621917 1.06428462 670.3211 0.9997121
## b[1] 0.1355869 0.078702003 0.008922524 0.26239381 538.1529 1.0006114
## b[2] -0.1414820 0.055829945 -0.231168322 -0.04905019 601.8853 0.9980997
## sigma 0.1116130 0.006568439 0.101756685 0.12238406 332.5514 0.9989425
There are two new columns. Details later.
n_eff
- effectively the number of independent samples we managed to get- Rhat4 - estimate of how markov chain converged to target distribution. It should approach 1 from above. 4 indicates the fourth generation. (soon it will be Rhat5!)
Running multiple chains in parallel
.1 <- ulam(
m9alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dnorm( 0 , 0.3 ) ,
b[cid] ~ dexp( 1 )
sigma data=dat_slim , chains=4, cores = 4
) , )
show
will tell us the model formula and how much time it took for each chain to run.
show(m9.1)
## Hamiltonian Monte Carlo approximation
## 2000 samples from 4 chains
##
## Sampling durations (seconds):
## warmup sample total
## chain:1 0.05 0.04 0.09
## chain:2 0.05 0.03 0.08
## chain:3 0.06 0.04 0.10
## chain:4 0.06 0.03 0.09
##
## Formula:
## log_gdp_std ~ dnorm(mu, sigma)
## mu <- a[cid] + b[cid] * (rugged_std - 0.215)
## a[cid] ~ dnorm(1, 0.1)
## b[cid] ~ dnorm(0, 0.3)
## sigma ~ dexp(1)
2k samples from all 4 chains. Each chain has 1000 sample, but by default uses the first 500 to adapt.
precis(m9.1, 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8868582 0.015129035 0.86291601 0.91099559 2666.123 0.9994667
## a[2] 1.0505662 0.010172545 1.03433807 1.06699208 3017.206 0.9997286
## b[1] 0.1340069 0.072391547 0.02013722 0.25062543 2260.158 0.9998661
## b[2] -0.1404236 0.055773679 -0.22707214 -0.05420522 2464.641 0.9989013
## sigma 0.1114573 0.006239214 0.10189516 0.12161957 2367.507 0.9985288
Effective samples are more than 2000!! This is since the adaptive sampler Stan uses is “very good” and generated sequential samples that are better than uncorrelated. Essentially it can explore the posterior distribution so efficiently that it can beat random. It’s Jayne's
principle in action.
Jayne’s principle - if there is a random way of doing something, then there is a non-randomised way that delivers better performance but requires more thought.
Visualisation
pairs(m9.1)
Seems multivariate gaussian.
Checking the chain
Chain visualisations can help spot problems.
First we see a trace plot, which plots the samples in sequential order, joined by a line.
traceplot(m9.1)
## [1] 1000
## [1] 1
## [1] 1000
by default this shows the progress of all chains. we can use chains = 1
to get it for one chain only.
We look for 3 things in these chains:
- stationarity - path of each chain staying within the same high-probability portion of the posterior distribution
- good mixing - means that the chain rapidly explores the full region. It does not slowly wander, but rapidly zig-zags
- convergence - mutiple independent chains stick around the same region of high probability.
With multiple chains, they plot over each other, and it can become hard to see independent pathologies in chains. Another visualisation method is a plot of the distributino of the ranked samples, a TRACE RANK PLOT or TRANK PLOT.
This takes all the samples for each individual param and then ranks them:- lowest sample gets rank 1, largest gets teh max rank (the no of samples across al chains). Then we draw histogram of these ranks for each individual chain. If the chains are exploring the same space efficiently, the histograms should be similar to one another and largely overlapping.
trankplot(m9.1)
Horizontal is rank, from 1 to the no of samples (2000 in this example). Vertical axis is the frequency of ranks in each bin of the histogram.
Markov Chain Parameters
How many samples?
- What is important is effective number of samples, not the real samples.
iter
- default 1000warmup
- defaultiter
/2, so 500 warmup samples + 500 real samples to use for inference- what matters is the effective number of samples, not the raw number.
n_eff
can be though of as the length of a Markov chain with no autocorrelation that would provide the same quality of estimate as your chain. It can be larger than the length of the chain if the sequential sampels are anti-correlated in the right way.n_eff
is only an estimate.
- What do we want to know?
- Only posterior means - a couple hundred could be enough.
- exact shape in the extreme tails of the posterior, 99th percentile, etc. - then many more
tail ESS
- the effective sample size (similar ton_eff
) in the tails of the posterior- warmup - with stan models, half of total samples can be devoted. For simpler models, much less is sufficient.
How many chains?
- When initially debugging, use a single chain. Some error messages don’t dipslay unless using one chain.
- When deciding whether the chains are valid, we need more than one chain.
- During the final run from which we will make inferences from, only one chain is required. Using more is fine too.
one short chain to debug, four chains for verification and inference
Rhat
Gelman-Rubin convergence diagnostic \(\hat R\). If n_eff
is lower than actual no of iterations, it means chains are inefficient but still okay. When Rhat is above 1.00 it usually indicates that the chain has not yet converged, and samples may not be trustworthy. If you draw more iterations, it could be fine, or it could never converge. See the Stan user manual for more details. It’s important however not to rely too much on these diagnostics. Like all heuristics, there are cases in which they provide poor advice. For example, Rhat can reach 1.00 even for an invalid chain. So view it perhaps as a signal of danger, but never of safety. For conventional models, these metrics typically work well.
Taming a wild chain
Broad flat regions of the posterior density can cause problems. They occur when we use flat priors. This can generate a wild wandering Markov chain that erratically samples extremely positive and extremely negative parameter values.
In below example, we try estimate the mean and sd of two Gaussian observations -1 and 1, but use flat priors.
<- c(-1, 1)
y set.seed(11)
.2 <- ulam(
m9alist(
~ dnorm(mu, sigma),
y <- alpha,
mu ~ dnorm(0, 1000),
alpha ~ dexp(0.0001)
sigma data = list(y = y), chains = 3
), )
##
## SAMPLING FOR MODEL 'd26c527083e7eda89b17a8c2eccd6019' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.226 seconds (Warm-up)
## Chain 1: 0.142 seconds (Sampling)
## Chain 1: 0.368 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'd26c527083e7eda89b17a8c2eccd6019' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.281 seconds (Warm-up)
## Chain 2: 0.05 seconds (Sampling)
## Chain 2: 0.331 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'd26c527083e7eda89b17a8c2eccd6019' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.383 seconds (Warm-up)
## Chain 3: 0.074 seconds (Sampling)
## Chain 3: 0.457 seconds (Total)
## Chain 3:
## Warning: There were 42 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.11, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
precis(m9.2)
## mean sd 5.5% 94.5% n_eff Rhat4
## alpha -2.773519 324.9502 -524.00894 505.2174 85.81054 1.02871
## sigma 578.510588 1392.9147 13.45968 2160.1994 253.31969 1.01652
alpha should have been close to zero. We get crazy values however, and implausibly large intervals. Even sigma is too big. n_eff
and Rhat
diagnositcs don’t look good either. We drew 1500 samples in total (3 chains), but effective size iis very less.
The warning about divergent transitions after warmup tells there are problems with the chains.
For simple models chaining the adapt_delta
control param will usually remove the divergent transitions. We can try adding control = list(adapt_delta = 0.99)
to ulam
call. Default value is 0.95. (won’t be of help in this case).
A second warning will be advising checking the pairs plot. (this is stan’s pairs not rethinking’s).
pairs(m9.2@stanfit)
This is like ulam's
pairs plot but divergent transitions coloured in red.
traceplot(m9.2)
## [1] 1000
## [1] 1
## [1] 1000
chains drift around and spike occasionally to extreme values. Not a healthy pair of chains.
trankplot(m9.2)
rank histograms spend long periods with one chain above / below the others. Indicates poor exploration of the posterior.
To tame this chain, we would need to use weaker priors.
set.seed(11)
.3 <- ulam(
m9alist(
~ dnorm(mu, sigma),
y <- alpha,
mu ~ dnorm(1, 10),
alpha ~ dexp(1)
sigma data = list(y = y), chains = 3
), )
##
## SAMPLING FOR MODEL '593afb1307e5b69f89a21ad207f22894' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.04 seconds (Warm-up)
## Chain 1: 0.033 seconds (Sampling)
## Chain 1: 0.073 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '593afb1307e5b69f89a21ad207f22894' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.044 seconds (Warm-up)
## Chain 2: 0.035 seconds (Sampling)
## Chain 2: 0.079 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '593afb1307e5b69f89a21ad207f22894' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.048 seconds (Warm-up)
## Chain 3: 0.036 seconds (Sampling)
## Chain 3: 0.084 seconds (Total)
## Chain 3:
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
precis(m9.3)
## mean sd 5.5% 94.5% n_eff Rhat4
## alpha 0.09368772 1.291687 -1.586733 2.282757 294.9239 1.005993
## sigma 1.57242939 0.924527 0.693663 3.092426 254.8605 1.003729
Much better results this time.
traceplot(m9.3)
## [1] 1000
## [1] 1
## [1] 1000
trankplot(m9.3)
both seem better too.
pairs(m9.3@stanfit)
<- extract.samples(m9.3)
post ::with_par(
withrlist(mfrow = c(1, 2)),
code = {
plot(NULL, xlim = c(-15, 15), ylim = c(0, 0.4), xlab = "alpha", ylab = "Density")
dens(post$alpha, lwd = 2, lty = 1, col = 'navyblue', add = TRUE)
dens(rnorm(1e3, 1, 10), lwd = 2, lty = 2, add = TRUE)
text(2, 0.25, labels = "Posterior", col = 'navyblue', adj = 0)
text(10, 0.05, labels = "Prior", adj = 0)
plot(NULL, xlim = c(0, 10), ylim = c(0, 0.7), xlab = "sigma", ylab = "Density")
dens(post$sigma, lwd = 2, lty = 1, col = 'navyblue', add = TRUE)
dens(rexp(1e3, 1), lwd = 2, lty = 2, add = TRUE)
} )
Non identifiable parameters
With highly correlated predictors there is a problem of non-identifiable parameters.
To construct similar non-identifiable situations, we simulate 100 obs from a Gaussian dist with mean zero and sd 1.
set.seed(41)
<- rnorm(100, mean = 0, sd = 1) y
Then we fit the following model:
\[ \begin{align} y_i &\sim Normal(\mu, \sigma) \\ \mu &= \alpha_1 + \alpha_2 \\ \alpha_1 &\sim Normal(0, 1000) \\ \alpha_2 &\sim Normal(0, 1000) \\ \sigma &\sim Exponential(1) \end{align} \]
\(\alpha_1\) and \(\alpha_2\) can’t be identified, only their sum can be identified which should be zero after estimation.
set.seed(384)
.4 <- ulam(
m9alist(
~ dnorm(mu, sigma),
y <- a1 + a2,
mu ~ dnorm(0, 1000),
a1 ~ dnorm(0, 1000),
a2 ~ dexp(1)
sigma data = list(y = y), chains = 3
), )
##
## SAMPLING FOR MODEL 'e6372181749abf126ca67665f3f2df39' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 3.464 seconds (Warm-up)
## Chain 1: 3.746 seconds (Sampling)
## Chain 1: 7.21 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'e6372181749abf126ca67665f3f2df39' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 3.41 seconds (Warm-up)
## Chain 2: 3.871 seconds (Sampling)
## Chain 2: 7.281 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'e6372181749abf126ca67665f3f2df39' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 3.254 seconds (Warm-up)
## Chain 3: 3.592 seconds (Sampling)
## Chain 3: 6.846 seconds (Total)
## Chain 3:
## Warning: There were 1140 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 2.47, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
precis(m9.4)
## mean sd 5.5% 94.5% n_eff Rhat4
## a1 -37.528345 430.02223526 -605.995946 677.529388 1.853644 3.346379
## a2 37.716575 430.02596973 -677.288723 606.263939 1.853613 3.346438
## sigma 1.023165 0.06499162 0.921962 1.133209 16.062994 1.297504
Estimates seem suspicious, sd are very wide, n_eff is very small, and Rhat is high too. This is since we can’t simultaneously estimate a1 and a2, but only their sum.
We also a warning about “# transitions after warmup exceeded the max treedepth”. These indicate inefficient but not broken chains.
traceplot(m9.4)
## [1] 1000
## [1] 1
## [1] 1000
trankplot(m9.4)
Chains neither stationary nor mixing nor converging. Seeing chains like this implies don’t use these chains.
Again weakly regularizing priors is helpful.
.5 <- ulam(
m9alist(
~ dnorm(mu, sigma),
y <- a1 + a2,
mu ~ dnorm(0, 10),
a1 ~ dnorm(0, 10),
a2 ~ dexp(1)
sigma data = list(y = y), chains = 3
), )
##
## SAMPLING FOR MODEL '60743fdda5f692fb7902c70d7f6403b7' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 1.268 seconds (Warm-up)
## Chain 1: 1.114 seconds (Sampling)
## Chain 1: 2.382 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '60743fdda5f692fb7902c70d7f6403b7' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 1.201 seconds (Warm-up)
## Chain 2: 1.068 seconds (Sampling)
## Chain 2: 2.269 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '60743fdda5f692fb7902c70d7f6403b7' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 1.185 seconds (Warm-up)
## Chain 3: 1.384 seconds (Sampling)
## Chain 3: 2.569 seconds (Total)
## Chain 3:
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
precis(m9.5)
## mean sd 5.5% 94.5% n_eff Rhat4
## a1 0.3261916 6.99143112 -11.2516530 10.758395 252.7479 1.013808
## a2 -0.1309445 6.98726504 -10.5498784 11.398724 252.4394 1.013793
## sigma 1.0341261 0.07176822 0.9262933 1.152745 490.5554 1.003023
Much better now. a1 and a2 still can’t be identified separately, but the results are not as terrible as before.
traceplot(m9.5)
## [1] 1000
## [1] 1
## [1] 1000
trankplot(m9.5)
much better here too. seems stationary, and mixing well.