Statistical Rethinking Chapter 9 End of Chapter Questions
Easy Questions
9E1
Which of the following is a requirement of the simple Metropolis algorithm?
- The parameters must be discrete.
- The likelihood function must be Gaussian.
- The proposal distribution must be symmetric
- Proposal distribution must be symmetric. That is, the chance of proposal of going from A to B should be the same as going from B to A.
9E2
Gibbs sampling is more efficient than the Metropolis algorithm. How does it achieve this extra efficiency? Are there any limitations to the Gibbs sampling strategy?
Gibbs sampling uses adaptive proposals in which current parameter values help adjust the proposed distribution of parameters. This is done by using conjugate priors of prior and likelihood, which have analytical solutions for the posterior distribution of an individual parameter.
9E3
Which sort of parameters can Hamiltonian Monte Carlo not handle? Can you explain why?
Discrete parameters, as the way HMC works, it should be capable of evaluating at any point in the possibility space defined by the parameter values. With the analogy of the frictionless particle, based upon the random initial flick and space curvature, it could stop any point. With discrete parameters this continuity cannot be achieved.
9E4
Explain the difference between the effective number of samples, n_eff as calculated by Stan, and the actual number of samples.
The actual number of samples is, as the name states, the actual number of samples obtained by the HMC chains.
n_eff
is the effective number of samples and can be thought of as the length of a Markov chain with no autocorrelation that would provide the same quality of estimate as your chain. It can be larger than the length of the chain if the sequential sampels are anti-correlated in the right way. n_eff
is only an estimate.
Depending on how correlated or “anti-correlated” these are, n_eff
can be less or more than the actual number of samples.
9E5
Which value should Rhat approach, when a chain is sampling the posterior distribution correctly?
1 from above.
9E6
Sketch a good trace plot for a Markov chain, one that is effectively sampling from the posterior distribution. What is good about its shape? Then sketch a trace plot for a malfunctioning Markov chain. What about its shape indicates malfunction?
Same as in notes.
9E7
Repeat the problem above, but now for a trace rank plot.
Same as in notes.
Medium Questions
9M1
Re-estimate the terrain ruggedness model from the chapter, but now using a uniform prior for the standard deviation, sigma. The uniform prior should be dunif(0,1). Use ulam to estimate the posterior. Does the different prior have any detectible influence on the posterior distribution of sigma? Why or why not?
library(rethinking)
data(rugged)
<- rugged
d
# make log version of outcome
$log_gdp <- log( d$rgdppc_2000 )
d
# extract countries with GDP data
<- d[ complete.cases(d$rgdppc_2000) , ]
dd
# rescale variables
$log_gdp_std <- dd$log_gdp / mean(dd$log_gdp)
dd$rugged_std <- dd$rugged / max(dd$rugged)
dd
# make vars to index Africa (1) or not (2)
$cid <- ifelse(dd$cont_africa == 1, 1, 2)
dd
# make list of data needed
<- list(
dat_slim log_gdp_std = dd$log_gdp_std,
rugged_std = dd$rugged_std,
cid = as.integer(dd$cid)
)
# ulam with uniform prior for sigma
<- ulam(
m1 alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dnorm( 0 , 0.3 ) ,
b[cid] ~ dunif( 0, 1 )
sigma data=dat_slim , chains=4
) ,
)
# ulam with exponential prior for sigma
<- ulam(
m1_2 alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dnorm( 0 , 0.3 ) ,
b[cid] ~ dexp( 1 )
sigma data=dat_slim , chains=4
) , )
precis(m1, depth = 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8868111 0.016274050 0.86131861 0.91313214 1989.316 1.0003813
## a[2] 1.0505551 0.009992798 1.03469034 1.06671626 2916.075 0.9985253
## b[1] 0.1324859 0.073954011 0.01733176 0.25188118 2110.858 0.9983620
## b[2] -0.1427374 0.057499301 -0.22940542 -0.05106997 2965.377 0.9991757
## sigma 0.1116927 0.006072165 0.10259805 0.12178330 2054.683 0.9998245
precis(m1_2, depth = 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8864282 0.016310891 0.85981870 0.91276211 2872.463 0.9993626
## a[2] 1.0506045 0.009812083 1.03459781 1.06623663 2837.327 0.9999244
## b[1] 0.1302804 0.075913341 0.01104798 0.25069253 2636.858 0.9992483
## b[2] -0.1433479 0.054924014 -0.23353630 -0.05891528 2618.432 0.9997755
## sigma 0.1115228 0.006019836 0.10248513 0.12144650 2378.389 0.9989469
The parameter estimates for both priors are the same. However, the effective number of samples is higher with uniform prior compared to the model using exponential prior for sigma. Let’s look at the posterior distribution.
<- extract.samples(m1)
post1 <- extract.samples(m1_2)
post2
dens(post1$sigma, lwd = 2, col = 'navyblue')
dens(post2$sigma, lwd = 2, lty = 2, add = TRUE)
Again, they look mostly the same. Both are weak priors, so the likelihood overcomes it.
9M2
Modify the terrain ruggedness model again. This time, change the prior for b[cid] to dexp(0.3). What does this do to the posterior distribution? Can you explain it?
<- ulam(
m2 alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dexp( 0.3 ) ,
b[cid] ~ dexp( 1 )
sigma data=dat_slim , chains=4
) , )
precis(m2, depth = 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8874642 0.016767361 0.860813815 0.91409707 1724.159 1.0003537
## a[2] 1.0483307 0.010426554 1.031710968 1.06488842 2056.717 0.9984253
## b[1] 0.1448659 0.073507854 0.036232874 0.27368428 1408.651 0.9992732
## b[2] 0.0188353 0.018117025 0.001017523 0.05489939 1904.206 1.0001813
## sigma 0.1142173 0.006270161 0.104461415 0.12466233 1675.246 0.9997731
precis(m1_2, 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8864282 0.016310891 0.85981870 0.91276211 2872.463 0.9993626
## a[2] 1.0506045 0.009812083 1.03459781 1.06623663 2837.327 0.9999244
## b[1] 0.1302804 0.075913341 0.01104798 0.25069253 2636.858 0.9992483
## b[2] -0.1433479 0.054924014 -0.23353630 -0.05891528 2618.432 0.9997755
## sigma 0.1115228 0.006019836 0.10248513 0.12144650 2378.389 0.9989469
The estimates for b[2] are very different now. While earlier it was -0.14, with sd of 0.06, now it is 0.02 with sd of 0.02. The difference is because the exponential prior has all its probability mass concentrated on the positive side. A plot to show this below.
::with_par(
withrlist(mfrow = c(1, 2)),
code = {
dens(rnorm(1e3, mean = 0, sd = 0.3))
dens(rexp(1e3, rate = 0.3))
} )
9M3
Re-estimate one of the Stan models from the chapter, but at different numbers of warmup iterations. Be sure to use the same number of sampling iterations in each case. Compare the n_eff values. Howmuch warmup is enough?
Let’s redo the model for ruggedness
# warmup of 100 iterations
<- ulam(
m3_1 alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dnorm( 0 , 0.3 ) ,
b[cid] ~ dexp( 1 )
sigma data=dat_slim , chains=1, warmup = 100
) ,
)
# warmup of 200 iterations
<- ulam(
m3_2 alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dnorm( 0 , 0.3 ) ,
b[cid] ~ dexp( 1 )
sigma data=dat_slim , chains=1, warmup = 200
) ,
)
# warmup of 300 iterations
<- ulam(
m3_3 alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dnorm( 0 , 0.3 ) ,
b[cid] ~ dexp( 1 )
sigma data=dat_slim , chains=1, warmup = 300
) ,
)
# warmup of 400 iterations
<- ulam(
m3_4 alist(
~ dnorm( mu , sigma ) ,
log_gdp_std <- a[cid] + b[cid]*( rugged_std - 0.215 ) ,
mu ~ dnorm( 1 , 0.1 ) ,
a[cid] ~ dnorm( 0 , 0.3 ) ,
b[cid] ~ dexp( 1 )
sigma data=dat_slim , chains=1, warmup = 400
) , )
The default number of iterations is 1000. So real number of samples will be 900, 800, 700 and 600 for thse 4 models. Looking at the effective number of samples for each:
precis(m3_1, 2) # 900
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8868196 0.015655697 0.86134499 0.91204240 1295.8535 0.9990307
## a[2] 1.0505861 0.009953961 1.03462147 1.06647458 843.9829 1.0002072
## b[1] 0.1353347 0.080317484 0.01443962 0.26884678 351.0417 0.9990841
## b[2] -0.1396552 0.056509423 -0.23439111 -0.04748051 661.4314 0.9989569
## sigma 0.1117724 0.005991255 0.10257824 0.12148386 528.2865 1.0053726
precis(m3_2, 2) # 800
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8867867 0.015157119 0.864024785 0.91126074 1131.4749 0.9994350
## a[2] 1.0505268 0.010513049 1.033731610 1.06759346 1243.7486 0.9989282
## b[1] 0.1378207 0.076746426 0.009455774 0.26236293 777.9604 0.9992472
## b[2] -0.1424589 0.051370490 -0.227434767 -0.06610859 510.1402 1.0021467
## sigma 0.1113461 0.006291357 0.102042173 0.12167635 1003.0415 0.9991153
precis(m3_3, 2) # 700
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8872582 0.015877073 0.86045015 0.91229131 668.5576 0.9986100
## a[2] 1.0509631 0.009964136 1.03476323 1.06646338 1061.9460 1.0007350
## b[1] 0.1362285 0.073149006 0.02233877 0.25432897 702.7756 0.9987956
## b[2] -0.1410841 0.050583953 -0.22221589 -0.05923978 659.8180 0.9989708
## sigma 0.1120259 0.006402932 0.10274074 0.12291943 577.4579 1.0014438
precis(m3_4, 2) # 600
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 0.8863480 0.015493872 0.86065375 0.9117343 783.7135 0.9986833
## a[2] 1.0508508 0.010120473 1.03437132 1.0660946 813.4244 0.9984072
## b[1] 0.1338079 0.078338840 0.01253298 0.2633544 670.9734 0.9989067
## b[2] -0.1407705 0.052839966 -0.22217897 -0.0564480 648.7546 0.9983508
## sigma 0.1113028 0.006245091 0.10186310 0.1221233 912.9741 0.9987929
For the first two models, for a few params the effective number of samples is much less than the real number of samples. From warmup = 300 for this model, the effective number of samples becomes higher than the real number of samples (implying that the samples are very good and anti-correlated).
Hard Questions
9H1
Run the model below and then inspect the posterior distribution and explain what it is accomplishing. Compare the samples for the parameters a and b. Can you explain the different trace plots? If you are unfamiliar with the Cauchy distribution, you should look it up. The key feature to attend to is that it has no expected value. Can you connect this fact to the trace plot?
<- ulam(
mp alist(
~ dnorm(0,1),
a ~ dcauchy(0,1)
b data=list(y=1) , chains=1
), )
##
## SAMPLING FOR MODEL 'bcf56ee89f6cf2a4224a4139ff01c7d4' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.06 seconds (Warm-up)
## Chain 1: 0.065 seconds (Sampling)
## Chain 1: 0.125 seconds (Total)
## Chain 1:
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
precis(mp)
## mean sd 5.5% 94.5% n_eff Rhat4
## a -0.07426057 0.9612544 -1.552383 1.438437 177.6474 1.0004785
## b 0.03172055 8.1080657 -8.845772 11.605233 183.6743 0.9987324
Model seems to be sampling from the two given distributions (there is no link with datapoint y=1). For ‘a’, the mean and sd is close to the distribution provided. For ‘b’ however, the mean is a little far away, but the sd is much more terrible.
traceplot(mp)
## [1] 1000
## [1] 1
## [1] 1000
the chain for ‘a’ seems to be stationary with less outliers. For the parameter ‘b’, it seems to have some very big outliers. Cauchy distribution does have fatter tails and no expected mean, so outliers are expected. The chain had trouble converging.
Plotting from posterior samples.
<- extract.samples(mp)
post dev.off()
## null device
## 1
::with_par(
withrlist(mfrow = c(1, 2)),
code = {
dens(post$a, lty = 1, col = 'navyblue', xlim = c(-4, 4))
curve(dnorm(x, 0, 1), from = -4, to = 4, lty = 2, add = TRUE)
legend("topright", lty = c(1, 2), legend = c("posterior", "expected"))
mtext("Posterior for a")
dens(post$b, lty = 1, col = 'navyblue', xlim = c(-20, 20))
curve(dcauchy(x, 0, 1), from = -20, to = 20, lty = 2, add = TRUE)
mtext("Posterior for b")
} )
The posterior for normal distribution matches the expected. The cauchy posterior however has much more outliers than expected.
9H2
Recall the divorce rate example from Chapter 5. Repeat that analysis, using ulam this time, fitting models m5.1, m5.2, and m5.3. Use compare to compare the models on the basis of WAIC or PSIS. To use WAIC or PSIS with ulam, you need add the argument log_log=TRUE. Explain the model comparison results.
argument is now log_lik
not log_log
. It needs to be passed to ulam
.
library(rethinking)
data(WaffleDivorce)
<- WaffleDivorce
d
# standardize variables
$D <- standardize( d$Divorce )
d$M <- standardize( d$Marriage )
d$A <- standardize( d$MedianAgeMarriage )
d
# dataset with just regression vars
<- list(D = d$D,
dd M = d$M,
A = d$A)
# Divorce against Age at marriage
.1 <- ulam(
m5alist(
~ dnorm(mu, sigma),
D <- a + bA*A,
mu ~ dnorm(0, 0.2),
a ~ dnorm(0, 0.5),
bA ~ dexp(1)
sigma
), data = dd, log_lik = TRUE
)
##
## SAMPLING FOR MODEL '5605ea5f0769a3b35ef3557a0659786d' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.052 seconds (Warm-up)
## Chain 1: 0.052 seconds (Sampling)
## Chain 1: 0.104 seconds (Total)
## Chain 1:
# divorce against marriage rate
.2 <- ulam(
m5alist(
~ dnorm( mu , sigma ) ,
D <- a + bM * M ,
mu ~ dnorm( 0 , 0.2 ) ,
a ~ dnorm( 0 , 0.5 ) ,
bM ~ dexp( 1 )
sigma
) ,data = dd , log_lik = TRUE
)
##
## SAMPLING FOR MODEL '5562557d1eb945f85e38992afb87f838' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.083 seconds (Warm-up)
## Chain 1: 0.063 seconds (Sampling)
## Chain 1: 0.146 seconds (Total)
## Chain 1:
# D ~ A + D
.3 <- ulam(
m5alist( D ~ dnorm( mu , sigma ) ,
<- a + bM*M + bA*A ,
mu ~ dnorm( 0 , 0.2 ) ,
a ~ dnorm( 0 , 0.5 ) ,
bM ~ dnorm( 0 , 0.5 ) ,
bA ~ dexp( 1 )
sigma
) ,data = dd , log_lik = TRUE
)
##
## SAMPLING FOR MODEL '3fea1cdc1f8786451ed127638cb7b165' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.071 seconds (Warm-up)
## Chain 1: 0.063 seconds (Sampling)
## Chain 1: 0.134 seconds (Total)
## Chain 1:
coeftab_plot(coeftab(m5.1, m5.2, m5.3))
compare(m5.1, m5.2, m5.3, func = WAIC)
## WAIC SE dWAIC dSE pWAIC weight
## m5.1 125.9284 12.649997 0.0000000 NA 3.689475 0.5682455579
## m5.3 126.4815 12.469959 0.5531214 0.7773932 4.227350 0.4309503778
## m5.2 139.0496 9.705361 13.1212594 9.1870631 2.822440 0.0008040643
The models have similar WAIC. Taking dWAIC and dSE together, there’s no prediction improvement in adding marriage rate or using the multi-regressin model. This is similar to the conclusions we reached in chapter 5, that after knowing median age at marriage, the marriage rate provides no additional information about the divorce rate.
9H3
Sometimes changing a prior for one parameter has unanticipated effects on other parameters. This is because when a parameter is highly correlated with another parameter in the posterior, the prior influences both parameters. Here’s an example to work and think through. Go back to the leg length example in Chapter 6 and use the code there to simulate height and leg lengths for 100 imagined individuals. Below is the model you fit before, resulting in a highly correlated posterior for the two beta parameters. This time, fit the model using ulam. Compare the posterior distribution produced by the code above to the posterior distribution produced when you change the prior for br so that it is strictly positive.
Note the constraints list. What this does is constrain the prior distribution of
br
so that it has positive probability only above zero. In other words, that prior ensures that the posterior distribution for br will have no probability mass below zero. Compare the two posterior distributions for m5.8s and m5.8s2. What has changed in the posterior distribution of both beta parameters? Can you explain the change induced by the change in prior?
library(rethinking)
<- 100 # number of individuals
N set.seed(909)
<- rnorm(N,10,2) # sim total height of each
height <- runif(N,0.4,0.5) # leg as proportion of height
leg_prop
# sim left leg as proportion + error
<- leg_prop * height + rnorm(N , 0 , 0.02)
leg_left
# sim right leg as proportion + error
<- leg_prop*height + rnorm(N , 0 , 0.02)
leg_right
# combine into data frame
<- list(height = height,
d leg_left = leg_left,
leg_right = leg_right)
.8s <- ulam(
m5alist(
~ dnorm( mu , sigma ) ,
height <- a + bl*leg_left + br*leg_right ,
mu ~ dnorm( 10 , 100 ) ,
a ~ dnorm( 2 , 10 ) ,
bl ~ dnorm( 2 , 10 ) ,
br ~ dexp( 1 )
sigma data=d, chains=4, log_lik = TRUE,
) , start=list(a=10,bl=0,br=0.1,sigma=1) )
## Warning: There were 1191 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
# br lower bound at zero
.8s2 <- ulam(
m5alist(
~ dnorm( mu , sigma ) ,
height <- a + bl*leg_left + br*leg_right ,
mu ~ dnorm( 10 , 100 ) ,
a ~ dnorm( 2 , 10 ) ,
bl ~ dnorm( 2 , 10 ) ,
br ~ dexp( 1 )
sigma data=d, chains=4,log_lik = TRUE,
) , constraints=list(br="lower=0"),
start=list(a=10,bl=0,br=0.1,sigma=1) )
##
## SAMPLING FOR MODEL '99941c176379a8cca4e2c77751596ac1' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 8.237 seconds (Warm-up)
## Chain 1: 9.343 seconds (Sampling)
## Chain 1: 17.58 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '99941c176379a8cca4e2c77751596ac1' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 8.343 seconds (Warm-up)
## Chain 2: 11.023 seconds (Sampling)
## Chain 2: 19.366 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '99941c176379a8cca4e2c77751596ac1' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 7.242 seconds (Warm-up)
## Chain 3: 10.83 seconds (Sampling)
## Chain 3: 18.072 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '99941c176379a8cca4e2c77751596ac1' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 8.419 seconds (Warm-up)
## Chain 4: 8.642 seconds (Sampling)
## Chain 4: 17.061 seconds (Total)
## Chain 4:
## Warning: There were 7 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: There were 1099 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
Comparing the posteriors of the two models.
precis(m5.8s)
## mean sd 5.5% 94.5% n_eff Rhat4
## a 0.9748631 0.28478330 0.5165577 1.4169806 906.1307 1.001488
## bl 0.2221700 2.44527124 -3.5677276 4.1929742 604.0307 1.002141
## br 1.7750616 2.45010299 -2.2081316 5.5593370 602.5882 1.002131
## sigma 0.6318075 0.04495998 0.5657285 0.7069093 815.1919 1.003927
precis(m5.8s2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a 0.9778575 0.29736505 0.5229600 1.4498872 736.8171 1.007853
## bl -0.8784301 1.92643703 -4.4109563 1.6538449 351.7605 1.007613
## br 2.8744066 1.92891258 0.3142771 6.4264434 352.0929 1.007655
## sigma 0.6311880 0.04538031 0.5639201 0.7085788 815.5653 1.008098
As mentioned back in chapter 5, in this case, we cannot determine the values of bl
or br
separately, but only their sum. The sum of their means is still approx 2. However, since we have now forced br
to be positive, the model’s PI for bl
gets more negative.
coeftab_plot(coeftab(m5.8s, m5.8s2), pars = c("bl", "br"))
Looking at this plot, we can see that while br
interval has moved to the right, bl
has moved to the left by roughly the same amount. After it all it is their sum that we can determine, so if br
is strictly positive, then bl
must move more towards the negative range.
pairs(m5.8s, pars = c("bl", "br"))
pairs(m5.8s2, pars = c("bl", "br"))
Here we can see more clearly that in the second model, as we have made br rightly skewed, bl has turned left skewed by almost similar amount.
9H4
For the two models fit in the previous problem, use WAIC or PSIS to compare the effective numbers of parameters for each model. You will need to use log_lik=TRUE to instruct ulam to compute the terms that both WAIC and PSIS need. Which model has more effective parameters? Why?
compare(m5.8s, m5.8s2, func = WAIC)
## WAIC SE dWAIC dSE pWAIC weight
## m5.8s2 194.6241 11.35733 0.0000000 NA 2.931005 0.5147153
## m5.8s 194.7418 11.24456 0.1177562 0.7445792 3.050454 0.4852847
compare(m5.8s, m5.8s2, func = PSIS)
## PSIS SE dPSIS dSE pPSIS weight
## m5.8s2 194.6475 11.41778 0.0000000 NA 2.942753 0.5176196
## m5.8s 194.7886 11.30956 0.1410153 0.7422116 3.073831 0.4823804
According to both WAIC and PSIS, the unconstrained model seems to have slightly more effective number of parameters (pWAIC/pPSIS) 3.1 vs 2.9. One reason could be that the sd for bl
and br
was much less in m5.8s2
.
9H5
Modify the Metropolis algorithm code from the chapter to handle the case that the island populations have a different distribution than the island labels. This means the island’s number will not be the same as its population.
<- seq(100, 1000, by = 100)
populations <- sample(populations)
island_pop
<- 1e5
num_weeks <- rep(0, num_weeks)
positions <- 10
current for(i in 1:num_weeks){
## record current position idx
<- current
positions[i]
# flip coing to generate proposal (one of adjacent islands)
<- current + sample( c(-1, 1), size = 1)
proposal
# boundary looping
if( proposal < 1) proposal <- 10
if( proposal > 10) proposal <- 1
# move?
<- island_pop[proposal] / island_pop[current] # changed here to use actual population numbers
prob_move <- ifelse(runif(1) < prob_move, proposal, current)
current }
Plotting the above:
<- island_pop / sum(island_pop)
pop_prop <- table(positions) / length(positions)
stay_prop
plot(1:10, pop_prop, type = 'h', lwd = 2)
lines( (1:10) + 0.5, stay_prop, type = 'h', lwd = 2, col = 'navyblue')
legend("topright", col = c("black", "navyblue"), legend = c("actual population", "Time king stayed"), lwd = 2)
The two match.
9H6
Modify the Metropolis algorithm code from the chapter to write your own simple MCMC estimator for globe tossing data and model from Chapter 2.
In globe tossing data, we had tossed a globe, caught it in hand and noted whether our finger was touching land or water. This was used to determine the proportion of water in globe. In equation form:
\[
\begin{align}
W &\sim Binomial(N, p) \\
p &\sim Uniform(0, 1)
\end{align}
\] where W is the number of times we observed Water during our tosses. and p
is the proportion of water in globe. We assume a flat uniform prior between 0 and 1.
We need to modify the above code of metropolis algo. Repeating from the text:
- Islands are the parameter values
- Population sizes are the posterior probabilities at each param value
- Weeks are the samples.
So we will start with p
having some random value between 0 and 1. At each step, proposals will be generated as +0.01 or -0.01. Decision will be taken by computing the likelihood of observing W water in N tosses for the proposal and current param value and then computing their ratio as the chance of move.
<- 6 # from chapter 2, 6 water in 9 tosses
W <- 9 # from chapter 2,
N
<- 1e5
iters <- rep(0, iters) # keeping variable names similar for easy referencing
positions <- runif(1)
current
for(i in 1:iters){
## record current position idx
<- current
positions[i]
# flip coing to generate proposal (one of adjacent islands)
<- current + sample( c(-0.01, 0.01), size = 1)
proposal
# boundary looping
if( proposal < 0) proposal <- 1
if( proposal > 1) proposal <- 0
# compute the posterior probability as: Likelihood times prior
<- dbinom(W, N, proposal) * dunif(proposal, 0, 1)
prob_proposal <- dbinom(W, N, current) * dunif(current, 0, 1)
prob_current
# move?
<- prob_proposal / prob_current
prob_move <- ifelse(runif(1) < prob_move, proposal, current)
current }
Plotting the path that it took (similar to traceplot, the movement of the chain) and density plot for final values.
::with_par(
withrlist(mfrow = c(2, 1)),
code = {
plot(1:iters, positions, type = "l", col = col.alpha('blue', alpha = 0.7), lwd = 0.7, main = "Path of `p` during 10k iterations")
dens(positions, main = "Density plot for p")
# abline(v = 0.72, lty = 2)
} )
The model seems to be converging towards 0.6-0.8 range. The density plot also looks to be maximising at around 0.7.
<- density(positions)
res $x[which.max(res$y)] res
## [1] 0.683913
p
’s distribution max at around 0.72
9H7
Can you write your own Hamiltonian Monte Carlo algorithm for the globe tossing data, using the R code in the chapter? You will have to write your own functions for the likelihood and gradient, but you can use the HMC2 function.
HMC algo needs 5 things to go:
- a function
U
that returns the negative log-probability of the data at the current position (parameter values): we can usedbinom
here and uselog = TRUE
argument - a function
grad_U
that returns the gradient of the negative log-probability at the current position:
For this we have \[ U = {n \choose k} p^k (1-p)^{n-k} \\ \] Taking its log and differentiating this with respect to p:
$$ \[\begin{align} log(U) &= \log \bigg[ {n \choose k} p^k (1-p)^{n-k}\bigg] \\[2ex] log(U) & = \log({n \choose k}) + k \log(p) + (n-k) \log(1-p) \\[2ex] \text{Taking Derivative}&: \\[2ex] \frac{d \log(U)}{dp} &= 0 + \frac{k}{p} + (n-k)\frac{-1}{1-p} \\[2ex] &= \frac{k}{p} - \frac{n-k}{1-p} \\[2ex] \end{align}\] $$
- a step size
epsilon
: we will try different values - a count of
leapfrog
steps L: different values will be tried - a
starting position
current_q: will start with some random number between 0 and 1
Putting all of this together
# U needs to return neg-log-probability
<- function( q , k = 6, n = 9) {
U -sum( dbinom(k, n, q, log = TRUE) )
}
# gradient function
<- function( p, k = 6, n = 9 ) {
U_gradient -sum(k / p - (n - k) / (1 - p))
}
# data
<- list()
Q $q <- 0.5
Q
<- 0.029
step <- 7
L
<- 100
n_samples
<- rep(NA, n_samples)
samples
# HMC
for (i in 1:n_samples) {
<- HMC2(U, U_gradient, step, L, Q$q)
Q <- Q$q
samples[i]
}
# Plotting results
par(mfrow = c(2, 1))
# traceplot
plot(1:n_samples, samples, col = 'navyblue', lwd = 1, type = "l",
ylab = "proportion of water in globe", xlab = "Sample #", xlim = c(0, n_samples), ylim = c(0, 1), main = "Path p took")
# posterior density
dens(samples, xlim = c(0, 1), main = "Posterior distribution of p")
# print mean value of samples
mean(samples, na.rm = TRUE)
## [1] 0.639087
The above runs the HMC algo. The value of stop and Leapfrog can make a big difference. Multiple different values were tried. (hard to say any one combo is best though). The posterior seems to be centered around 0.64-0.7.