Statistical Rethinking Chapter 9 End of Chapter Questions

Easy Questions

9E1

Which of the following is a requirement of the simple Metropolis algorithm?

The parameters must be discrete.

The likelihood function must be Gaussian.

The proposal distribution must be symmetric

Proposal distribution must be symmetric. That is, the chance of proposal of going from A to B should be the same as going from B to A.

9E2

Gibbs sampling is more efficient than the Metropolis algorithm. How does it achieve this extra efficiency? Are there any limitations to the Gibbs sampling strategy?

Gibbs sampling uses adaptive proposals in which current parameter values help adjust the proposed distribution of parameters. This is done by using conjugate priors of prior and likelihood, which have analytical solutions for the posterior distribution of an individual parameter.

9E3

Which sort of parameters can Hamiltonian Monte Carlo not handle? Can you explain why?

Discrete parameters, as the way HMC works, it should be capable of evaluating at any point in the possibility space defined by the parameter values. With the analogy of the frictionless particle, based upon the random initial flick and space curvature, it could stop any point. With discrete parameters this continuity cannot be achieved.

9E4

Explain the difference between the effective number of samples, n_eff as calculated by Stan, and the actual number of samples.

The actual number of samples is, as the name states, the actual number of samples obtained by the HMC chains.

n_eff is the effective number of samples and can be thought of as the length of a Markov chain with no autocorrelation that would provide the same quality of estimate as your chain. It can be larger than the length of the chain if the sequential sampels are anti-correlated in the right way. n_eff is only an estimate.

Depending on how correlated or “anti-correlated” these are, n_eff can be less or more than the actual number of samples.

9E5

Which value should Rhat approach, when a chain is sampling the posterior distribution correctly?

1 from above.

9E6

Sketch a good trace plot for a Markov chain, one that is effectively sampling from the posterior distribution. What is good about its shape? Then sketch a trace plot for a malfunctioning Markov chain. What about its shape indicates malfunction?

Same as in notes.

9E7

Repeat the problem above, but now for a trace rank plot.

Same as in notes.

Medium Questions

9M1

Re-estimate the terrain ruggedness model from the chapter, but now using a uniform prior for the standard deviation, sigma. The uniform prior should be dunif(0,1). Use ulam to estimate the posterior. Does the different prior have any detectible influence on the posterior distribution of sigma? Why or why not?

library(rethinking)
data(rugged) 
d <- rugged

# make log version of outcome 
d$log_gdp <- log( d$rgdppc_2000 )

# extract countries with GDP data 
dd <- d[ complete.cases(d$rgdppc_2000) , ]

# rescale variables 
dd$log_gdp_std <- dd$log_gdp / mean(dd$log_gdp) 
dd$rugged_std <- dd$rugged / max(dd$rugged)

# make vars to index Africa (1) or not (2)
dd$cid <- ifelse(dd$cont_africa == 1, 1, 2)

# make list of data needed
dat_slim <- list(
    log_gdp_std = dd$log_gdp_std, 
    rugged_std = dd$rugged_std, 
    cid = as.integer(dd$cid)
)

# ulam with uniform prior for sigma
m1 <- ulam( 
    alist(
        log_gdp_std ~ dnorm( mu , sigma ) ,
        mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) , 
        a[cid] ~ dnorm( 1 , 0.1 ) , 
        b[cid] ~ dnorm( 0 , 0.3 ) , 
        sigma ~ dunif( 0, 1 )
    ) , data=dat_slim , chains=4 
)

# ulam with exponential prior for sigma
m1_2 <- ulam( 
    alist(
        log_gdp_std ~ dnorm( mu , sigma ) ,
        mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) , 
        a[cid] ~ dnorm( 1 , 0.1 ) , 
        b[cid] ~ dnorm( 0 , 0.3 ) , 
        sigma ~ dexp( 1 )
    ) , data=dat_slim , chains=4
)

precis(m1, depth = 2)

##             mean          sd        5.5%       94.5%    n_eff     Rhat4
## a[1]   0.8868111 0.016274050  0.86131861  0.91313214 1989.316 1.0003813
## a[2]   1.0505551 0.009992798  1.03469034  1.06671626 2916.075 0.9985253
## b[1]   0.1324859 0.073954011  0.01733176  0.25188118 2110.858 0.9983620
## b[2]  -0.1427374 0.057499301 -0.22940542 -0.05106997 2965.377 0.9991757
## sigma  0.1116927 0.006072165  0.10259805  0.12178330 2054.683 0.9998245

precis(m1_2, depth = 2)

##             mean          sd        5.5%       94.5%    n_eff     Rhat4
## a[1]   0.8864282 0.016310891  0.85981870  0.91276211 2872.463 0.9993626
## a[2]   1.0506045 0.009812083  1.03459781  1.06623663 2837.327 0.9999244
## b[1]   0.1302804 0.075913341  0.01104798  0.25069253 2636.858 0.9992483
## b[2]  -0.1433479 0.054924014 -0.23353630 -0.05891528 2618.432 0.9997755
## sigma  0.1115228 0.006019836  0.10248513  0.12144650 2378.389 0.9989469

The parameter estimates for both priors are the same. However, the effective number of samples is higher with uniform prior compared to the model using exponential prior for sigma. Let’s look at the posterior distribution.

post1 <- extract.samples(m1)
post2 <- extract.samples(m1_2)

dens(post1$sigma, lwd = 2, col = 'navyblue')
dens(post2$sigma, lwd = 2, lty = 2, add = TRUE)

Again, they look mostly the same. Both are weak priors, so the likelihood overcomes it.

9M2

Modify the terrain ruggedness model again. This time, change the prior for b[cid] to dexp(0.3). What does this do to the posterior distribution? Can you explain it?

m2 <- ulam( 
    alist(
        log_gdp_std ~ dnorm( mu , sigma ) ,
        mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) , 
        a[cid] ~ dnorm( 1 , 0.1 ) , 
        b[cid] ~ dexp( 0.3 ) , 
        sigma ~ dexp( 1 )
    ) , data=dat_slim , chains=4
)

precis(m2, depth = 2)

##            mean          sd        5.5%      94.5%    n_eff     Rhat4
## a[1]  0.8874642 0.016767361 0.860813815 0.91409707 1724.159 1.0003537
## a[2]  1.0483307 0.010426554 1.031710968 1.06488842 2056.717 0.9984253
## b[1]  0.1448659 0.073507854 0.036232874 0.27368428 1408.651 0.9992732
## b[2]  0.0188353 0.018117025 0.001017523 0.05489939 1904.206 1.0001813
## sigma 0.1142173 0.006270161 0.104461415 0.12466233 1675.246 0.9997731

precis(m1_2, 2)

##             mean          sd        5.5%       94.5%    n_eff     Rhat4
## a[1]   0.8864282 0.016310891  0.85981870  0.91276211 2872.463 0.9993626
## a[2]   1.0506045 0.009812083  1.03459781  1.06623663 2837.327 0.9999244
## b[1]   0.1302804 0.075913341  0.01104798  0.25069253 2636.858 0.9992483
## b[2]  -0.1433479 0.054924014 -0.23353630 -0.05891528 2618.432 0.9997755
## sigma  0.1115228 0.006019836  0.10248513  0.12144650 2378.389 0.9989469

The estimates for b[2] are very different now. While earlier it was -0.14, with sd of 0.06, now it is 0.02 with sd of 0.02. The difference is because the exponential prior has all its probability mass concentrated on the positive side. A plot to show this below.

withr::with_par(
    list(mfrow = c(1, 2)), 
    code = {
        dens(rnorm(1e3, mean = 0, sd = 0.3))
        dens(rexp(1e3, rate = 0.3))    
    }
)

9M3

Re-estimate one of the Stan models from the chapter, but at different numbers of warmup iterations. Be sure to use the same number of sampling iterations in each case. Compare the n_eff values. Howmuch warmup is enough?

Let’s redo the model for ruggedness

# warmup of 100 iterations
m3_1 <- ulam( 
    alist(
        log_gdp_std ~ dnorm( mu , sigma ) ,
        mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) , 
        a[cid] ~ dnorm( 1 , 0.1 ) , 
        b[cid] ~ dnorm( 0 , 0.3 ) , 
        sigma ~ dexp( 1 )
    ) , data=dat_slim , chains=1, warmup = 100
)

# warmup of 200 iterations
m3_2 <- ulam( 
    alist(
        log_gdp_std ~ dnorm( mu , sigma ) ,
        mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) , 
        a[cid] ~ dnorm( 1 , 0.1 ) , 
        b[cid] ~ dnorm( 0 , 0.3 ) , 
        sigma ~ dexp( 1 )
    ) , data=dat_slim , chains=1, warmup = 200
)

# warmup of 300 iterations
m3_3 <- ulam( 
    alist(
        log_gdp_std ~ dnorm( mu , sigma ) ,
        mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) , 
        a[cid] ~ dnorm( 1 , 0.1 ) , 
        b[cid] ~ dnorm( 0 , 0.3 ) , 
        sigma ~ dexp( 1 )
    ) , data=dat_slim , chains=1, warmup = 300
)

# warmup of 400 iterations
m3_4 <- ulam( 
    alist(
        log_gdp_std ~ dnorm( mu , sigma ) ,
        mu <- a[cid] + b[cid]*( rugged_std - 0.215 ) , 
        a[cid] ~ dnorm( 1 , 0.1 ) , 
        b[cid] ~ dnorm( 0 , 0.3 ) , 
        sigma ~ dexp( 1 )
    ) , data=dat_slim , chains=1, warmup = 400
)

The default number of iterations is 1000. So real number of samples will be 900, 800, 700 and 600 for thse 4 models. Looking at the effective number of samples for each:

precis(m3_1, 2) # 900

##             mean          sd        5.5%       94.5%     n_eff     Rhat4
## a[1]   0.8868196 0.015655697  0.86134499  0.91204240 1295.8535 0.9990307
## a[2]   1.0505861 0.009953961  1.03462147  1.06647458  843.9829 1.0002072
## b[1]   0.1353347 0.080317484  0.01443962  0.26884678  351.0417 0.9990841
## b[2]  -0.1396552 0.056509423 -0.23439111 -0.04748051  661.4314 0.9989569
## sigma  0.1117724 0.005991255  0.10257824  0.12148386  528.2865 1.0053726

precis(m3_2, 2) # 800

##             mean          sd         5.5%       94.5%     n_eff     Rhat4
## a[1]   0.8867867 0.015157119  0.864024785  0.91126074 1131.4749 0.9994350
## a[2]   1.0505268 0.010513049  1.033731610  1.06759346 1243.7486 0.9989282
## b[1]   0.1378207 0.076746426  0.009455774  0.26236293  777.9604 0.9992472
## b[2]  -0.1424589 0.051370490 -0.227434767 -0.06610859  510.1402 1.0021467
## sigma  0.1113461 0.006291357  0.102042173  0.12167635 1003.0415 0.9991153

precis(m3_3, 2) # 700

##             mean          sd        5.5%       94.5%     n_eff     Rhat4
## a[1]   0.8872582 0.015877073  0.86045015  0.91229131  668.5576 0.9986100
## a[2]   1.0509631 0.009964136  1.03476323  1.06646338 1061.9460 1.0007350
## b[1]   0.1362285 0.073149006  0.02233877  0.25432897  702.7756 0.9987956
## b[2]  -0.1410841 0.050583953 -0.22221589 -0.05923978  659.8180 0.9989708
## sigma  0.1120259 0.006402932  0.10274074  0.12291943  577.4579 1.0014438

precis(m3_4, 2) # 600

##             mean          sd        5.5%      94.5%    n_eff     Rhat4
## a[1]   0.8863480 0.015493872  0.86065375  0.9117343 783.7135 0.9986833
## a[2]   1.0508508 0.010120473  1.03437132  1.0660946 813.4244 0.9984072
## b[1]   0.1338079 0.078338840  0.01253298  0.2633544 670.9734 0.9989067
## b[2]  -0.1407705 0.052839966 -0.22217897 -0.0564480 648.7546 0.9983508
## sigma  0.1113028 0.006245091  0.10186310  0.1221233 912.9741 0.9987929

For the first two models, for a few params the effective number of samples is much less than the real number of samples. From warmup = 300 for this model, the effective number of samples becomes higher than the real number of samples (implying that the samples are very good and anti-correlated).

Hard Questions

9H1

Run the model below and then inspect the posterior distribution and explain what it is accomplishing. Compare the samples for the parameters a and b. Can you explain the different trace plots? If you are unfamiliar with the Cauchy distribution, you should look it up. The key feature to attend to is that it has no expected value. Can you connect this fact to the trace plot?

mp <- ulam(
    alist(
        a ~ dnorm(0,1),
        b ~ dcauchy(0,1)
    ), data=list(y=1) , chains=1
)

## 
## SAMPLING FOR MODEL 'bcf56ee89f6cf2a4224a4139ff01c7d4' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.06 seconds (Warm-up)
## Chain 1:                0.065 seconds (Sampling)
## Chain 1:                0.125 seconds (Total)
## Chain 1:

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

precis(mp)

##          mean        sd      5.5%     94.5%    n_eff     Rhat4
## a -0.07426057 0.9612544 -1.552383  1.438437 177.6474 1.0004785
## b  0.03172055 8.1080657 -8.845772 11.605233 183.6743 0.9987324

Model seems to be sampling from the two given distributions (there is no link with datapoint y=1). For ‘a’, the mean and sd is close to the distribution provided. For ‘b’ however, the mean is a little far away, but the sd is much more terrible.

traceplot(mp)

## [1] 1000
## [1] 1
## [1] 1000

the chain for ‘a’ seems to be stationary with less outliers. For the parameter ‘b’, it seems to have some very big outliers. Cauchy distribution does have fatter tails and no expected mean, so outliers are expected. The chain had trouble converging.

Plotting from posterior samples.

post <- extract.samples(mp)
dev.off()

## null device 
##           1

withr::with_par(
    list(mfrow = c(1, 2)), 
    code = {
        dens(post$a, lty = 1, col = 'navyblue', xlim = c(-4, 4))
        curve(dnorm(x, 0, 1), from = -4, to = 4, lty = 2, add = TRUE)
        legend("topright", lty = c(1, 2), legend = c("posterior", "expected"))
        mtext("Posterior for a")
        
        dens(post$b, lty = 1, col = 'navyblue', xlim = c(-20, 20))
        curve(dcauchy(x, 0, 1), from = -20, to = 20, lty = 2, add = TRUE)
        mtext("Posterior for b")
    }
)

The posterior for normal distribution matches the expected. The cauchy posterior however has much more outliers than expected.

9H2

Recall the divorce rate example from Chapter 5. Repeat that analysis, using ulam this time, fitting models m5.1, m5.2, and m5.3. Use compare to compare the models on the basis of WAIC or PSIS. To use WAIC or PSIS with ulam, you need add the argument log_log=TRUE. Explain the model comparison results.

argument is now log_lik not log_log. It needs to be passed to ulam.

library(rethinking)
data(WaffleDivorce) 
d <- WaffleDivorce

# standardize variables 
d$D <- standardize( d$Divorce ) 
d$M <- standardize( d$Marriage ) 
d$A <- standardize( d$MedianAgeMarriage )

# dataset with just regression vars
dd <- list(D = d$D, 
           M = d$M, 
           A = d$A)

# Divorce against Age at marriage
m5.1 <- ulam(
    alist(
        D ~ dnorm(mu, sigma), 
        mu <- a + bA*A, 
        a ~ dnorm(0, 0.2), 
        bA ~ dnorm(0, 0.5), 
        sigma ~ dexp(1)
    ), 
    data = dd, log_lik = TRUE
)

## 
## SAMPLING FOR MODEL '5605ea5f0769a3b35ef3557a0659786d' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.052 seconds (Warm-up)
## Chain 1:                0.052 seconds (Sampling)
## Chain 1:                0.104 seconds (Total)
## Chain 1:

# divorce against marriage rate
m5.2 <- ulam( 
    alist( 
        D ~ dnorm( mu , sigma ) , 
        mu <- a + bM * M , 
        a ~ dnorm( 0 , 0.2 ) , 
        bM ~ dnorm( 0 , 0.5 ) , 
        sigma ~ dexp( 1 )
    ) ,
    data = dd , log_lik = TRUE
)

## 
## SAMPLING FOR MODEL '5562557d1eb945f85e38992afb87f838' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.083 seconds (Warm-up)
## Chain 1:                0.063 seconds (Sampling)
## Chain 1:                0.146 seconds (Total)
## Chain 1:

# D ~ A + D
m5.3 <-  ulam( 
    alist( D ~ dnorm( mu , sigma ) ,
           mu <- a + bM*M + bA*A ,
           a ~ dnorm( 0 , 0.2 ) ,
           bM ~ dnorm( 0 , 0.5 ) ,
           bA ~ dnorm( 0 , 0.5 ) ,
           sigma ~ dexp( 1 )
    ) ,
    data = dd , log_lik = TRUE
)

## 
## SAMPLING FOR MODEL '3fea1cdc1f8786451ed127638cb7b165' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.071 seconds (Warm-up)
## Chain 1:                0.063 seconds (Sampling)
## Chain 1:                0.134 seconds (Total)
## Chain 1:

coeftab_plot(coeftab(m5.1, m5.2, m5.3))

compare(m5.1, m5.2, m5.3, func = WAIC)

##          WAIC        SE      dWAIC       dSE    pWAIC       weight
## m5.1 125.9284 12.649997  0.0000000        NA 3.689475 0.5682455579
## m5.3 126.4815 12.469959  0.5531214 0.7773932 4.227350 0.4309503778
## m5.2 139.0496  9.705361 13.1212594 9.1870631 2.822440 0.0008040643

The models have similar WAIC. Taking dWAIC and dSE together, there’s no prediction improvement in adding marriage rate or using the multi-regressin model. This is similar to the conclusions we reached in chapter 5, that after knowing median age at marriage, the marriage rate provides no additional information about the divorce rate.

9H3

Sometimes changing a prior for one parameter has unanticipated effects on other parameters. This is because when a parameter is highly correlated with another parameter in the posterior, the prior influences both parameters. Here’s an example to work and think through. Go back to the leg length example in Chapter 6 and use the code there to simulate height and leg lengths for 100 imagined individuals. Below is the model you fit before, resulting in a highly correlated posterior for the two beta parameters. This time, fit the model using ulam. Compare the posterior distribution produced by the code above to the posterior distribution produced when you change the prior for br so that it is strictly positive.

Note the constraints list. What this does is constrain the prior distribution of br so that it has positive probability only above zero. In other words, that prior ensures that the posterior distribution for br will have no probability mass below zero. Compare the two posterior distributions for m5.8s and m5.8s2. What has changed in the posterior distribution of both beta parameters? Can you explain the change induced by the change in prior?

library(rethinking)
N <- 100                               # number of individuals
set.seed(909) 
height <- rnorm(N,10,2)                # sim total height of each
leg_prop <- runif(N,0.4,0.5)           # leg as proportion of height

# sim left leg as proportion + error
leg_left <- leg_prop * height + rnorm(N , 0 , 0.02) 

# sim right leg as proportion + error
leg_right <- leg_prop*height +  rnorm(N , 0 , 0.02)

# combine into data frame 
d <- list(height = height,
          leg_left = leg_left,
          leg_right = leg_right)
m5.8s <- ulam(
    alist(
        height ~ dnorm( mu , sigma ) ,
        mu <- a + bl*leg_left + br*leg_right ,
        a ~ dnorm( 10 , 100 ) ,
        bl ~ dnorm( 2 , 10 ) ,
        br ~ dnorm( 2 , 10 ) ,
        sigma ~ dexp( 1 )
    ) , data=d, chains=4, log_lik = TRUE,
    start=list(a=10,bl=0,br=0.1,sigma=1) )

## Warning: There were 1191 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

# br lower bound at zero
m5.8s2 <- ulam(
    alist(
        height ~ dnorm( mu , sigma ) ,
        mu <- a + bl*leg_left + br*leg_right ,
        a ~ dnorm( 10 , 100 ) ,
        bl ~ dnorm( 2 , 10 ) ,
        br ~ dnorm( 2 , 10 ) ,
        sigma ~ dexp( 1 )
    ) , data=d, chains=4,log_lik = TRUE,
    constraints=list(br="lower=0"),
    start=list(a=10,bl=0,br=0.1,sigma=1) )

## 
## SAMPLING FOR MODEL '99941c176379a8cca4e2c77751596ac1' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 8.237 seconds (Warm-up)
## Chain 1:                9.343 seconds (Sampling)
## Chain 1:                17.58 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL '99941c176379a8cca4e2c77751596ac1' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 8.343 seconds (Warm-up)
## Chain 2:                11.023 seconds (Sampling)
## Chain 2:                19.366 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL '99941c176379a8cca4e2c77751596ac1' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 7.242 seconds (Warm-up)
## Chain 3:                10.83 seconds (Sampling)
## Chain 3:                18.072 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL '99941c176379a8cca4e2c77751596ac1' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 4: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 4: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 4: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 4: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 4: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 4: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 4: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 4: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 4: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 4: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 4: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 8.419 seconds (Warm-up)
## Chain 4:                8.642 seconds (Sampling)
## Chain 4:                17.061 seconds (Total)
## Chain 4:

## Warning: There were 7 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.

## Warning: There were 1099 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

Comparing the posteriors of the two models.

precis(m5.8s)

##            mean         sd       5.5%     94.5%    n_eff    Rhat4
## a     0.9748631 0.28478330  0.5165577 1.4169806 906.1307 1.001488
## bl    0.2221700 2.44527124 -3.5677276 4.1929742 604.0307 1.002141
## br    1.7750616 2.45010299 -2.2081316 5.5593370 602.5882 1.002131
## sigma 0.6318075 0.04495998  0.5657285 0.7069093 815.1919 1.003927

precis(m5.8s2)

##             mean         sd       5.5%     94.5%    n_eff    Rhat4
## a      0.9778575 0.29736505  0.5229600 1.4498872 736.8171 1.007853
## bl    -0.8784301 1.92643703 -4.4109563 1.6538449 351.7605 1.007613
## br     2.8744066 1.92891258  0.3142771 6.4264434 352.0929 1.007655
## sigma  0.6311880 0.04538031  0.5639201 0.7085788 815.5653 1.008098

As mentioned back in chapter 5, in this case, we cannot determine the values of bl or br separately, but only their sum. The sum of their means is still approx 2. However, since we have now forced br to be positive, the model’s PI for bl gets more negative.

coeftab_plot(coeftab(m5.8s, m5.8s2), pars = c("bl", "br"))

Looking at this plot, we can see that while br interval has moved to the right, bl has moved to the left by roughly the same amount. After it all it is their sum that we can determine, so if br is strictly positive, then bl must move more towards the negative range.

pairs(m5.8s, pars = c("bl", "br"))

pairs(m5.8s2, pars = c("bl", "br"))

Here we can see more clearly that in the second model, as we have made br rightly skewed, bl has turned left skewed by almost similar amount.

9H4

For the two models fit in the previous problem, use WAIC or PSIS to compare the effective numbers of parameters for each model. You will need to use log_lik=TRUE to instruct ulam to compute the terms that both WAIC and PSIS need. Which model has more effective parameters? Why?

compare(m5.8s, m5.8s2, func = WAIC)

##            WAIC       SE     dWAIC       dSE    pWAIC    weight
## m5.8s2 194.6241 11.35733 0.0000000        NA 2.931005 0.5147153
## m5.8s  194.7418 11.24456 0.1177562 0.7445792 3.050454 0.4852847

compare(m5.8s, m5.8s2, func = PSIS)

##            PSIS       SE     dPSIS       dSE    pPSIS    weight
## m5.8s2 194.6475 11.41778 0.0000000        NA 2.942753 0.5176196
## m5.8s  194.7886 11.30956 0.1410153 0.7422116 3.073831 0.4823804

According to both WAIC and PSIS, the unconstrained model seems to have slightly more effective number of parameters (pWAIC/pPSIS) 3.1 vs 2.9. One reason could be that the sd for bl and br was much less in m5.8s2.

9H5

Modify the Metropolis algorithm code from the chapter to handle the case that the island populations have a different distribution than the island labels. This means the island’s number will not be the same as its population.

populations <- seq(100, 1000, by = 100)
island_pop <- sample(populations)

num_weeks <- 1e5
positions <- rep(0, num_weeks)
current <- 10
for(i in 1:num_weeks){
    ## record current position idx
    positions[i] <- current
    
    # flip coing to generate proposal (one of adjacent islands)
    proposal <- current + sample( c(-1, 1), size = 1)
    
    # boundary looping
    if( proposal < 1)  proposal <- 10
    if( proposal > 10) proposal <- 1
    
    # move?
    prob_move <- island_pop[proposal] / island_pop[current] # changed here to use actual population numbers
    current <- ifelse(runif(1) < prob_move, proposal, current)
}

Plotting the above:

pop_prop <- island_pop / sum(island_pop)
stay_prop <- table(positions) / length(positions)

plot(1:10, pop_prop, type = 'h', lwd = 2)
lines( (1:10) + 0.5, stay_prop, type = 'h', lwd = 2, col = 'navyblue')
legend("topright", col = c("black", "navyblue"), legend = c("actual population", "Time king stayed"), lwd = 2)

The two match.

9H6

Modify the Metropolis algorithm code from the chapter to write your own simple MCMC estimator for globe tossing data and model from Chapter 2.

In globe tossing data, we had tossed a globe, caught it in hand and noted whether our finger was touching land or water. This was used to determine the proportion of water in globe. In equation form:

\[ \begin{align} W &\sim Binomial(N, p) \\ p &\sim Uniform(0, 1) \end{align} \] where W is the number of times we observed Water during our tosses. and p is the proportion of water in globe. We assume a flat uniform prior between 0 and 1.

We need to modify the above code of metropolis algo. Repeating from the text:

Islands are the parameter values
Population sizes are the posterior probabilities at each param value
Weeks are the samples.

So we will start with p having some random value between 0 and 1. At each step, proposals will be generated as +0.01 or -0.01. Decision will be taken by computing the likelihood of observing W water in N tosses for the proposal and current param value and then computing their ratio as the chance of move.

W <- 6 # from chapter 2, 6 water in 9 tosses
N <- 9 # from chapter 2, 


iters <- 1e5
positions <- rep(0, iters) # keeping variable names similar for easy referencing
current <- runif(1)


for(i in 1:iters){
    ## record current position idx
    positions[i] <- current
    
    # flip coing to generate proposal (one of adjacent islands)
    proposal <- current + sample( c(-0.01, 0.01), size = 1)
    
    # boundary looping
    if( proposal < 0)  proposal <- 1
    if( proposal > 1)  proposal <- 0
    
     
    # compute the posterior probability as: Likelihood times prior
    prob_proposal <- dbinom(W, N, proposal) * dunif(proposal, 0, 1)
    prob_current <- dbinom(W, N, current) * dunif(current, 0, 1)
    
    # move?
    prob_move <- prob_proposal / prob_current
    current <- ifelse(runif(1) < prob_move, proposal, current)
}

Plotting the path that it took (similar to traceplot, the movement of the chain) and density plot for final values.

withr::with_par(
    list(mfrow = c(2, 1)), 
    code = {
        plot(1:iters, positions, type = "l", col = col.alpha('blue', alpha = 0.7), lwd = 0.7, main = "Path of `p` during 10k iterations")
        dens(positions, main = "Density plot for p")
        # abline(v = 0.72, lty = 2)
    }
)

The model seems to be converging towards 0.6-0.8 range. The density plot also looks to be maximising at around 0.7.

res <- density(positions)
res$x[which.max(res$y)]

## [1] 0.683913

p’s distribution max at around 0.72

9H7

Can you write your own Hamiltonian Monte Carlo algorithm for the globe tossing data, using the R code in the chapter? You will have to write your own functions for the likelihood and gradient, but you can use the HMC2 function.

HMC algo needs 5 things to go:

a function U that returns the negative log-probability of the data at the current position (parameter values): we can use dbinom here and use log = TRUE argument
a function grad_U that returns the gradient of the negative log-probability at the current position:

For this we have \[ U = {n \choose k} p^k (1-p)^{n-k} \\ \] Taking its log and differentiating this with respect to p:

$$ \[\begin{align} log(U) &= \log \bigg[ {n \choose k} p^k (1-p)^{n-k}\bigg] \\[2ex] log(U) & = \log({n \choose k}) + k \log(p) + (n-k) \log(1-p) \\[2ex] \text{Taking Derivative}&: \\[2ex] \frac{d \log(U)}{dp} &= 0 + \frac{k}{p} + (n-k)\frac{-1}{1-p} \\[2ex] &= \frac{k}{p} - \frac{n-k}{1-p} \\[2ex] \end{align}\] $$

a step size epsilon: we will try different values
a count of leapfrog steps L: different values will be tried
a starting position current_q: will start with some random number between 0 and 1

Putting all of this together

# U needs to return neg-log-probability
U <- function( q , k = 6, n = 9) {
    -sum( dbinom(k, n, q, log = TRUE) )
}

# gradient function
U_gradient <- function( p, k = 6, n = 9 ) {
    -sum(k / p - (n - k) / (1 - p))
}

#  data

Q <- list()
Q$q <- 0.5

step <- 0.029
L <- 7

n_samples <- 100

samples <- rep(NA, n_samples)


# HMC
for (i in 1:n_samples) {
    Q <- HMC2(U, U_gradient, step, L, Q$q)
    samples[i] <- Q$q
}

# Plotting results
par(mfrow = c(2, 1))

# traceplot
plot(1:n_samples, samples, col = 'navyblue', lwd = 1, type = "l", 
     ylab = "proportion of water in globe", xlab = "Sample #", xlim = c(0, n_samples), ylim = c(0, 1), main = "Path p took")

# posterior density
dens(samples, xlim = c(0, 1), main = "Posterior distribution of p")

# print mean value of samples
mean(samples, na.rm = TRUE)

## [1] 0.639087

The above runs the HMC algo. The value of stop and Leapfrog can make a big difference. Multiple different values were tried. (hard to say any one combo is best though). The posterior seems to be centered around 0.64-0.7.