Chapter 7 EoC
Easy Problems
7E1
State the three motivating criteria that define information entropy. Try to express each in your own words.
We want a measure of uncertainty that is:
- Continuous - should not change unexpectedly (big changes) with small changes in underlying probability.
- Increases with number of events - with more possible events, uncertainty increases (hard to tell which will happen since there are more possibilities to choose from)
- Should be additive - If we have uncertainty about two different events, then the joint uncertainty should be the sum of the two individual one
7E2
Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
Entropy is the average log probability.
\[ H(p) = - \sum_{i=1}^n p_i \log(p_i) \]
In our case, p = {0.7, 0.3}
thus \[ H(p) = - \left[ 0.7 \times log(0.7) + 0.3 \times log(0.3) \right] = 0.611 \]
<- c(0.7, 0.3)
p -sum(p * log(p))
## [1] 0.6108643
7E3
Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?
<- c(0.2, 0.25, 0.25, 0.3)
p -sum(p * log(p))
## [1] 1.376227
7E4
Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?
We can remove the event with zero probability.
<- c(1/3, 1/3, 1/3)
p -sum(p * log(p))
## [1] 1.098612
Medium Questions
7M1
Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
\[ \begin{align} AIC &= -2lppd + 2p \\ WAIC(y|\Theta) &= -2 \left(lppd - \sum_i var_{\theta} \log p(y_i | \theta) \right) \end{align} \]
TOPIC | AIC | WAIC |
---|---|---|
Estimate | average out of sample deviance | out of sample KL divergence |
Priors | should be flat or overwhelmed by likelihood | no assumptions |
Posterior | should approximately multivariate gaussian | no assumptions |
N vs k | N >> k | no assumptions |
7M2
Explain the difference between model selection and model comparison. What information is lost under model selection?
In model selection we aim to choose the model with the lowest criterion value (CV / WAIC / PSIS / etc.). It discards info about relative model accuracy and more specifically the difference in criterion value between these models. The difference could be real or due to chance because of the sample used. We can compute the Std Error of the difference to evaluate if the difference between the models is significant or due to chance.
With Model Comparison we try to understand how different variables influence predictions.
7M3
When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
IC is calculated pointwise and then summed over all the observations to get the value for the model, so changing the points will lead to different values.
We can make a model on divorce dataset with different sets of points and see the WAIC for the two models.
library(rethinking)
data(WaffleDivorce)
<- WaffleDivorce
d $A <- standardize( d$MedianAgeMarriage )
d$D <- standardize( d$Divorce )
d$M <- standardize( d$Marriage )
d
<- alist(
model_spec ~ dnorm( mu , sigma ) ,
D <- a + bM*M + bA*A ,
mu ~ dnorm( 0 , 0.2 ) ,
a ~ dnorm( 0 , 0.5 ) ,
bM ~ dnorm( 0 , 0.5 ) ,
bA ~ dexp( 1 )
sigma
)
set.seed(123)
<- d[sample(nrow(d), size = 20),]
d1 <- d[sample(nrow(d), size = 20),]
d2
<- quap(model_spec, data = d1)
m31 <- quap(model_spec, data = d2)
m32
set.seed(456)
WAIC(m31)
## WAIC lppd penalty std_err
## 1 35.36769 -14.20154 3.482303 4.802176
WAIC(m32)
## WAIC lppd penalty std_err
## 1 43.88227 -17.04094 4.900192 6.158421
The two samples with even the same model spec lead to different WAIC values.
7M4
What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
Let us again work on the divorce data set.
set.seed(1234)
<- quap(
m1 alist(
~ dnorm( mu , sigma ) ,
D <- a + bM*M + bA*A ,
mu ~ dnorm( 0 , 1 ) ,
a ~ dnorm( 0 , 10 ) ,
bM ~ dnorm( 0 , 10 ) ,
bA ~ dexp( 1 )
sigma data = d
),
)
<- quap(
m2 alist(
~ dnorm( mu , sigma ) ,
D <- a + bM*M + bA*A ,
mu ~ dnorm( 0 , 0.5 ) ,
a ~ dnorm( 0 , 0.5 ) ,
bM ~ dnorm( 0 , 0.5 ) ,
bA ~ dexp( 1 )
sigma data = d
),
)
set.seed(876)
PSIS(m1)
## PSIS lppd penalty std_err
## 1 131.6418 -65.82091 7.129508 16.54336
PSIS(m2)
## PSIS lppd penalty std_err
## 1 130.5846 -65.29231 6.516369 15.53061
WAIC(m1)
## WAIC lppd penalty std_err
## 1 130.1132 -58.67062 6.385955 15.25715
WAIC(m2)
## WAIC lppd penalty std_err
## 1 130.3105 -58.7935 6.361742 15.03298
The penalty term does decrease with narrower priors (though not that much).
7M5
Provide an informal explanation of why informative priors reduce overfitting.
More informative priors, i.e. narrower/constrictive priors allow the coefficients to take values from a smaller sample space, leading to less susceptiblity to noise present in the particular sample used to model.
7M6
Provide an informal explanation of why overly informative priors result in underfitting.
Overly informative priors are even narrower, more constrictive. They do not allow the model to learn the features of the data leading to underfitting.
Hard Questions
7H1
In 2007, The Wall Street Journal published an editorial (“We’re Number One, Alas”) with a graph of corporate tax rates in 29 countries plotted against tax revenue. A badly fit curve was drawn in (reconstructed at right), seemingly by hand, to make the argument that the relationship between tax rate and tax revenue increases and then declines, such that higher tax rates can actually produce less tax revenue. I want you to actually fit a curve to these data, found in
data(Laffer)
. Consider models that use tax rate to predict tax revenue. Compare, using WAIC or PSIS, a straight-line model to any curved models you like. What do you conclude about the relationship between tax rate and tax revenue?
data("Laffer")
<- Laffer
d
plot(d, pch = 20)
$T <- standardize(d$tax_rate)
d$R <- standardize(d$tax_revenue)
d
$T2 <- d$T ^ 2
d$T3 <- d$T ^ 3
d
set.seed(176)
# Linear model
<- quap(
m1 alist(
~ dnorm(mu, sigma),
R <- a + bT * T,
mu ~ dnorm(0, 1),
a ~ dnorm(0, 2),
bT ~ dexp(1)
sigma data = d
),
)
# model with 2 degree polynomial
<- quap(
m2 alist(
~ dnorm(mu, sigma),
R <- a + bT * T + bT2*T2,
mu ~ dnorm(0, 1),
a c(bT, bT2) ~ dnorm(0, 2),
~ dexp(1)
sigma data = d
),
)
# model with 3 degree polynomial
<- quap(
m3 alist(
~ dnorm(mu, sigma),
R <- a + bT * T + bT2*T2 + bT3 * T3,
mu ~ dnorm(0, 1),
a c(bT, bT2, bT3) ~ dnorm(0, 2),
~ dexp(1)
sigma data = d
),
)
set.seed(123)
plot(coeftab(m1, m2, m3))
compare(m1, m2, m3, func = WAIC)
## WAIC SE dWAIC dSE pWAIC weight
## m2 90.11047 24.37958 0.000000 NA 7.748140 0.4331979
## m1 90.40645 23.02251 0.295971 2.642355 6.764565 0.3736087
## m3 91.72548 24.09099 1.615006 1.923569 8.952587 0.1931934
We have three models of degrees 1, 2, and 3. They all have very close WAIC. The largest difference being 1.6 units and the Std Error of the difference is larger than this difference. Hence we can conlude that the three models are not expected to perform differently on out of sample data. Hence we can conclude that the relationship between tax rate and tax revenue is linear rather than curved.
7H2
In the Laffer data, there is one country with a high tax revenue that is an outlier. Use PSIS and WAIC to measure the importance of this outlier in the models you fit in the previous problem. Then use robust regression with a Student’s t distribution to revisit the curve fitting problem. How much does a curved relationship depend upon the outlier point?
<- PSIS(m1, pointwise = TRUE)
m1_psis <- WAIC(m1, pointwise = TRUE)
m1_waic
plot(m1_psis$k, m1_waic$penalty, pch = 20, col = "blue")
identify(m1_psis$k, m1_waic$penalty)
## integer(0)
There seems to be only one significant outlier (the 12th row datapoint). Using Student’s t distribution to create a linear model.
set.seed(123456789)
# Linear model
<- quap(
mt1 alist(
~ dstudent(1, mu, sigma),
R <- a + bT * T,
mu ~ dnorm(0, 1),
a ~ dnorm(0, 2),
bT ~ dexp(1)
sigma data = d
),
)
# model with 2 degree polynomial
<- quap(
mt2 alist(
~ dstudent(1, mu, sigma),
R <- a + bT * T + bT2*T2,
mu ~ dnorm(0, 1),
a c(bT, bT2) ~ dnorm(0, 2),
~ dexp(1)
sigma data = d
),
)
# model with 3 degree polynomial
<- quap(
mt3 alist(
~ dstudent(1, mu, sigma),
R <- a + bT * T + bT2*T2 + bT3 * T3,
mu ~ dnorm(0, 1),
a c(bT, bT2, bT3) ~ dnorm(0, 2),
~ dexp(1)
sigma data = d
),
)
plot(coeftab(mt1, mt2, mt3))
set.seed(123)
compare(mt1, mt2, mt3, func = WAIC)
## WAIC SE dWAIC dSE pWAIC weight
## mt2 73.58933 12.87005 0.0000000 NA 4.724156 0.4647543
## mt3 73.89431 13.43047 0.3049781 2.480088 5.124605 0.3990233
## mt1 76.04377 12.81175 2.4544395 5.453572 3.472395 0.1362224
compare(mt1, mt2, mt3, func = PSIS)
## PSIS SE dPSIS dSE pPSIS weight
## mt2 73.67734 13.14025 0.0000000 NA 4.837084 0.4702054
## mt3 74.00320 13.74874 0.3258558 2.679828 5.193882 0.3995112
## mt1 76.24426 12.96454 2.5669143 5.428816 3.585048 0.1302835
With shape \(\nu=1\), we don’t get any high k values in Pareto smoothing. The results are still similar though, although the WAIC/PSIS values have decreased.
7H3
Consider three fictional Polynesian islands. On each there is a Royal Ornithologist charged by the king with surveying the bird population. They have each found the following proportions of 5 important bird species: | | Species.A| Specis.B| Species.C| Species.D| Species.E| |:——-|———:|——–:|———:|———:|———:| |Island1 | 0.20| 0.20| 0.20| 0.200| 0.200| |Island2 | 0.80| 0.10| 0.05| 0.025| 0.025| |Island3 | 0.05| 0.15| 0.70| 0.050| 0.050|
Notice that each row sums to 1, all the birds. This problem has two parts. It is not computationally complicated. But it is conceptually tricky. First, compute the entropy of each island’s bird distribution. Interpret these entropy values. Second, use each island’s bird distribution to predict the other two. This means to compute the KL divergence of each island from the others, treating each island as if it were a statistical model of the other islands. You should end up with 6 different KL divergence values. Which island predicts the others best? Why?
First let us calculate the entropy of the three islands.
<- c(0.2 , 0.2 , 0.2 , 0.2 , 0.2)
island1 <- c(0.8 , 0.1 , 0.05, 0.025, 0.025)
island2 <- c(0.05, 0.15, 0.70, 0.050, 0.050)
island3
<- function(p) -sum(p * log(p))
entropy
sapply(list(island1, island2, island3), entropy)
## [1] 1.6094379 0.7430039 0.9836003
Island 1 has the highest entropy and thereby the greatest uncertainty in determining to which species a random bird could belong. This is since they all have equal probabilities.
Island 2, which has a very high proportion of species A (80%), has the lowest entropy, since there’s less uncertainty with such a high proportion of only one species.
Island 3 falls just behind island 2 (70% vs 80%).
Now let us calculate the KL divergence of these. These values indicate the uncertainty introduced due to using a different distribution ‘q’ to model ‘p’. So smaller values indicate better prediction.
<- function(p, q) sum(p * (log(p) - log(q)))
kld
<- list(island1 = island1,
islands island2 = island2,
island3 = island3)
round(
outer(islands, islands, Vectorize(kld)),
2
)
## island1 island2 island3
## island1 0.00 0.97 0.64
## island2 0.87 0.00 2.01
## island3 0.63 1.84 0.00
Using Island 1 to predict others leads to lowest additional uncertainty (since 1 already has high entropy). Using 2 to predict 3 is worst, followed by using 3 to predict 2.
7H4
Recall the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (page 178). Compare these two models using WAIC(or PSIS, they will produce identical results). Which model is expected to make better predictions? Which model provides the correct causal inference about the influence of age on happiness? Can you explain why the answers to these two questions disagree?
<- sim_happiness(seed = 1977, N_years = 1000)
d
<- d[ d$age > 17,] # adults
d2 $A <- (d2$age - 18) / (65 - 18)
d2$mid <- d2$married + 1
d2
.9 <- quap(
m6alist(
~ dnorm(mu, sigma),
happiness <- a[mid] + bA * A,
mu ~ dnorm(0, 1),
a[mid] ~ dnorm(0, 2),
bA ~ dexp(1)
sigma data = d2
),
).10 <- quap(
m6alist(
~ dnorm(mu, sigma),
happiness <- a + bA * A,
mu ~ dnorm(0, 1),
a ~ dnorm(0, 2),
bA ~ dexp(1)
sigma data = d2
),
)compare(m6.9, m6.10, func = WAIC)
## WAIC SE dWAIC dSE pWAIC weight
## m6.9 2713.971 37.54465 0.0000 NA 3.738532 1.000000e+00
## m6.10 3101.906 27.74379 387.9347 35.40032 2.340445 5.768312e-85
Model m6.9 has lower WAIC, and the difference is large and significant enough (dSE = 35, dWAIC = 388). Thus m6.9 has better predictive accuracy.
There was a strong association between Age and Happiness in m6.9 (since we were conditioning on the collider variable Married Status) leading to a better predictive accuracy for happiness.
In m6.10, we don’t condition on the collider variable (to remove influence), which leaves us with the truth that there is no relationship between age and happiness. This is causally correct, but naturally leads to worse predictions, which is why we get the WAIC results.
7H5
Revisit the urban fox data, data(foxes), from the previous chapter’s practice problems. Use WAIC or PSIS based model comparison on five different models, each using weight as the outcome, and containing these sets of predictor variables: 1. avgfood + groupsize + area 2. avgfood + groupsize 3. groupsize + area 4. avgfood 5. area Can you explain the relative differences in WAIC scores, using the fox DAG from the previous chapter? Be sure to pay attention to the standard error of the score differences (dSE).
Reproducing the DAG from previous chapter.
library(dagitty)
library(ggdag)
<- dagitty("dag{
foxdag area -> avgfood
weight <- avgfood -> groupsize
groupsize -> weight
}")
coordinates(foxdag) <- list(
x = c(area = 1, avgfood = 0, groupsize = 2, weight = 1),
y = c(area = 2, avgfood = 1, groupsize = 1, weight = 0)
)
ggdag(foxdag, layout = 'circle', node_size = 0, text_col = "black") + theme_void()
Creating models:
library(dplyr)
data(foxes)
<- foxes %>%
d mutate(across( c(avgfood:weight), standardize)) # not standardising groupsize
<- quap(
m1 alist(
~ dnorm(mu, sigma),
weight <- a + ba*avgfood + bg*groupsize + br*area,
mu ~ dnorm(0, 1),
a c(ba, bg, br) ~ dnorm(0, 2),
~ dexp(1)
sigma data = d
),
)<- quap(
m2 alist(
~ dnorm(mu, sigma),
weight <- a + ba*avgfood + bg*groupsize,
mu ~ dnorm(0, 1),
a c(ba, bg) ~ dnorm(0, 2),
~ dexp(1)
sigma data = d
),
)<- quap(
m3 alist(
~ dnorm(mu, sigma),
weight <- a + bg*groupsize + br*area,
mu ~ dnorm(0, 1),
a c(bg, br) ~ dnorm(0, 2),
~ dexp(1)
sigma data = d
),
)<- quap(
m4 alist(
~ dnorm(mu, sigma),
weight <- a + ba*avgfood,
mu ~ dnorm(0, 1),
a ~ dnorm(0, 2),
ba ~ dexp(1)
sigma data = d
),
)<- quap(
m5 alist(
~ dnorm(mu, sigma),
weight <- a + br*area,
mu ~ dnorm(0, 1),
a ~ dnorm(0, 2),
br ~ dexp(1)
sigma data = d
),
)set.seed(123)
<- compare(m1, m2, m3, m4, m5, func = "WAIC")
comparison comparison
## WAIC SE dWAIC dSE pWAIC weight
## m1 323.9864 16.99723 0.0000000 NA 5.512473 0.353823861
## m2 324.1560 16.76150 0.1695620 3.744313 4.214765 0.325062749
## m3 324.2095 16.03360 0.2231018 4.213333 4.018860 0.316476289
## m4 333.8187 13.82894 9.8322370 8.639805 2.601301 0.002592651
## m5 334.2938 13.89293 10.3073409 8.677788 2.941945 0.002044450
There doesn’t seem to be any significant difference between the models. dWAIC
is small, while dSE
is large enough that assuming gaussian errors, interval N(dWAIC, dSE) would include zero.
It is easier to visualise this.
plot(comparison)
From the fox DAG from the previous chapter, we had derived that increasing groupsize reduces weight, while increasing avgfood increases weight. And area did not seem to have any effect. This was derived after accounting for the DAG and blocking necessary variables. Since groupsize
and avgfood
work in opposite direction, total predictive power was less. Which is why all five models seem to have similar predictive accuracy.