Statistical Rethinking Chapter 11
Binomial Regression
\[ y \sim Binomial(n, p) \]
- y is a count (zero or a positive whole number)
- p is the probability any particular “trial” is a success
- n is the number of trials
Binomial Distribution has the max entropy when each trial must result in one of two events and the expected value is constant.
Two common GLM flavours that use binomial probability functions:
- LOGISTIC REGRESSION - when outcome variable can only take 0/1
- AGGREGATED BINOMIAL REGRESSION - individual trials with the same covariate values are aggregated together. Outcome can take the value zero or any positive integer, upto n (the no of trials).
Both use logit link function. convertible to each other by aggregating the former or exploding the outcome in the latter.
Like other GLMs, binomial regression is not guaranteed to produce a nice multivariate Gaussian posterior distribution. Thus, quadratic approximation is not always satisfactory.
Adding log-probability calculations to a Stan Model
adding log_lik=TRUE
adds a block of code to the Stan model that calculates the log-prob for each outcome. These are returned as samples in the posterior - one log-prob for each observation and sample. We can extract this by setting clean = FALSE
in extract.samples
.
<- extract.samples(m11.4, clean = FALSE)
post str(post)
## List of 3
## $ log_lik: num [1:2000, 1:504] -0.553 -0.465 -0.576 -0.435 -0.515 ...
## $ a : num [1:2000, 1:7] -0.819 -0.444 0.112 -0.712 -0.23 ...
## $ b : num [1:2000, 1:4] 0.517 -0.08 -0.362 0.105 -0.166 ...
## - attr(*, "source")= chr "ulam posterior: 2000 samples from object"
running stan code via stan
.4_stan_code <- stancode(m11.4) m11
## data{
## int pulled_left[504];
## int treatment[504];
## int actor[504];
## }
## parameters{
## vector[7] a;
## vector[4] b;
## }
## model{
## vector[504] p;
## b ~ normal( 0 , 0.5 );
## a ~ normal( 0 , 1.5 );
## for ( i in 1:504 ) {
## p[i] = a[actor[i]] + b[treatment[i]];
## p[i] = inv_logit(p[i]);
## }
## pulled_left ~ binomial( 1 , p );
## }
## generated quantities{
## vector[504] log_lik;
## vector[504] p;
## for ( i in 1:504 ) {
## p[i] = a[actor[i]] + b[treatment[i]];
## p[i] = inv_logit(p[i]);
## }
## for ( i in 1:504 ) log_lik[i] = binomial_lpmf( pulled_left[i] | 1 , p[i] );
## }
.4_stan <- stan(model_code = m11.4_stan_code, data = dat_list, chains = 4, refresh = 0)
m11compare(m11.4_stan, m11.4)
## Warning in compare(m11.4_stan, m11.4): Not all model fits of same class.
## This is usually a bad idea, because it implies they were fit by different algorithms.
## Check yourself, before you wreck yourself.
## WAIC SE dWAIC dSE pWAIC weight
## m11.4_stan 531.9190 18.88619 0.000000 NA 8.309466 0.5240454
## m11.4 532.1116 18.95189 0.192512 0.1240161 8.460971 0.4759546
Relative shark and absolute deer
Above we focused on Absolute Effects – how much difference does treatment make in the probability of pulling a lever. This focuses on the counter-factual change in a variable might make on an absolute scale of measurement, like P(event).
Relative Effects are more commonly seen in logistic regressions, which are the proportional changes in the odds of an outcome. These proportional odds relative effect sizes can be calculated by exponentiating the parameter of interest.
E.g. - proportional odds of swiching from treatment 2 to 4 (adding a partner):
<- extract.samples(m11.4)
post mean(exp(post$b[,4] - post$b[,2]))
## [1] 0.9275437
The switch from 2 to 4, multiplies odds by 0.92, an 8% reduction in odds.
- Proportional odds don’t tell us if a variable is important or not. Other parameters could make the outcome very unlikely in which case even large p-odds for some variable wont not impact the outcome much.
- Relative effects are needed to make causal inferences and
- can be conditionally important, when other baseline rates change
Example - Deer kill more people than sharks, but people are more afraid of sharks. In terms of absolute risk, deers are more dangerous, but when you are in ocean (conditioning on ocean), then risk of being killed by deers becomes irrelevant.
For general advice, absolute risk often makes more sense. But to make general predictions, conditional on specific circumstances, we need relative risk.
Aggregated Binomial: Chimpanzees again, condensed
In the chimpanzees
data, we had one row describing the outcome of a single pull. If we don’t care about the order of pull, then we could also aggregate for each chimpanzee the number of left pulls for each combination of predictor variables.
data(chimpanzees)
<- chimpanzees
d $treatment <- 1 + d$prosoc_left + 2 * d$condition
d$side <- d$prosoc_left + 1 # right 1, left 2
d$cond <- d$condition + 1 # no partner 1, partner 2
d<- aggregate(
d_aggregated $pulled_left,
dlist(treatment = d$treatment, actor = d$actor,
side = d$side, cond = d$cond),
sum
)colnames(d_aggregated)[5] <- "left_pulls"
head(d_aggregated)
## treatment actor side cond left_pulls
## 1 1 1 1 1 6
## 2 1 2 1 1 18
## 3 1 3 1 1 5
## 4 1 4 1 1 6
## 5 1 5 1 1 6
## 6 1 6 1 1 14
Creating a list for ulam
.
<- with(d_aggregated, list(
dat left_pulls = left_pulls,
treatment = treatment,
actor = actor,
side = side,
cond = cond
))
.6 <- ulam(
m11alist(
~ dbinom(18, p),
left_pulls logit(p) <- a[actor] + b[treatment],
~ dnorm(0, 1.5),
a[actor] ~ dnorm(0, 0.5)
b[treatment] data = dat, chains = 4, log_lik = TRUE
), )
for the outcome, there are now 18 trials on each row, and the likelihood defines the porb of each count left_pulls
out of 18 trials. (18 is size, or the no of trials).
precis(m11.6, 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] -0.43524282 0.3349244 -0.99271485 0.06985630 595.8587 1.004768
## a[2] 3.88166443 0.7439820 2.76928308 5.12654859 1215.1551 1.002619
## a[3] -0.74222146 0.3458104 -1.30417690 -0.20455884 671.6294 1.003079
## a[4] -0.74156929 0.3356084 -1.27927854 -0.21416583 640.1288 1.005457
## a[5] -0.44151691 0.3314547 -0.99411022 0.06296176 660.2078 1.003795
## a[6] 0.49004328 0.3369249 -0.05754514 1.03988955 747.9984 1.001492
## a[7] 1.97812909 0.4312653 1.30113963 2.68229867 933.4373 1.002767
## b[1] -0.05208483 0.2904467 -0.50211857 0.42994952 606.1239 1.004669
## b[2] 0.48006684 0.2943675 0.01948375 0.96513710 640.4628 1.004369
## b[3] -0.39414992 0.2823222 -0.84744524 0.06605805 622.6605 1.005406
## b[4] 0.36502775 0.2839415 -0.08428650 0.80945103 586.8508 1.005540
We have the same posterior distribution as in m11.4
.
compare(m11.6, m11.4, func = PSIS)
## Warning in compare(m11.6, m11.4, func = PSIS): Different numbers of observations found for at least two models.
## Model comparison is valid only for models fit to exactly the same observations.
## Number of observations for each model:
## m11.6 28
## m11.4 504
## Some Pareto k values are high (>0.5). Set pointwise=TRUE to inspect individual points.
## PSIS SE dPSIS dSE pPSIS weight
## m11.6 114.1010 8.402184 0.0000 NA 8.372782 1.000000e+00
## m11.4 532.1668 18.973287 418.0658 9.444121 8.488602 1.652605e-91
The PSIS is very different for the two models, even though they have the same posterior distribution. This is because the aggregated model contains an extra factor in its log-probabilities.
For example, when calculating dbinom(6, 9, 0.2)
(6 successes in 9 trials):
\[ Pr(6|9, p) = \frac{9!}{6! (9-6)!} p^6 (1-p)^{9-6} \] The \({n \choose k}\) term in the beginning is not present when we are considering these as 9 separate 0/1 trials, which is why the aggregated probability gets larger.
\[ Pr(1, 1, 1, 1, 1, 1, 0, 0, 0|p) = p^6 (1-p)^{9-6} \]
This is why the PSIS/WAIC scores end up being smaller.
A simple example:
# deviance of aggregated 6 in 9
-2 * dbinom(6, 9, 0.2, log = TRUE)
# deviance of non-aggregated
-2 * sum(dbern(c(1, 1, 1, 1, 1, 1, 0, 0, 0), 0.2, log = TRUE))
## [1] 11.79048
## [1] 20.65212
The other two warnings above were regarding:
- high
Pareto k
values : this is since each row corresponds to 18 observations now, so Pareto k imagines instead of LOOCV, Leave out 18 CV - different observation sizes
For calculating
Pareto k
, use non-aggregated logistic regression
Aggregated Binomial: Graduate school admissions
No of trials will be different on different rows here.
library(rethinking)
data("UCBadmit")
<- UCBadmit
d d
## dept applicant.gender admit reject applications
## 1 A male 512 313 825
## 2 A female 89 19 108
## 3 B male 353 207 560
## 4 B female 17 8 25
## 5 C male 120 205 325
## 6 C female 202 391 593
## 7 D male 138 279 417
## 8 D female 131 244 375
## 9 E male 53 138 191
## 10 E female 94 299 393
## 11 F male 22 351 373
## 12 F female 24 317 341
Dataset description - graduate school applications to 6 different academic departments at UC Berkeley. admit
,reject
describe the number of candidates accepted/rejected. applications
is their sum. Each appliation is a 0/1 outcome, but aggregated by dept and gender in the above dataframe. 12 rows represnt 4526 applications.
Goal evaluate if there is any evidence of bias due to gender
in admission. Thus we have the following model:
\[ \begin{align} A_i &\sim \text{Binomial(N_i, p_i)} \\ \operatorname{logit}(p_i) &= \alpha_{GID[i]} \\ \alpha_j &\sim \operatorname{Normal}(0, 1.5) \end{align} \]
- \(N_i\) indicates
applications[i]
, no of applications on row i - \(GID[i]\) indexes gender of an applicant with 1 as male and 2 as female
<- list(
dat_list admit = d$admit,
applications = d$applications,
gid = ifelse(d$applicant.gender == 'male', 1, 2)
).7 <- ulam(
m11alist(
~ dbinom(applications, p),
admit logit(p) <- a[gid],
~ dnorm(0, 1.5)
a[gid] data = dat_list, chains = 4, refresh = 0
),
)precis(m11.7, depth = 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] -0.2211682 0.03840252 -0.2852569 -0.1608678 1245.137 1.004073
## a[2] -0.8305841 0.05073940 -0.9122689 -0.7496837 1571.675 1.000126
Posterior for males (a[1]) is higher than that of female applicants (a[2]). To study this difference we need to compute contrast. Next we will calculate this contrast for both the logit scale as well as the outcome scale.
<- extract.samples(m11.7)
post <- post$a[,1] - post$a[,2]
diff_a <- inv_logit(post$a[,1]) - inv_logit(post$a[,2])
diff_p precis(list(diff_a = diff_a, diff_p = diff_p))
## mean sd 5.5% 94.5% histogram
## diff_a 0.6094160 0.06368576 0.5062891 0.7129079 ▁▁▁▃▇▇▃▂▁▁
## diff_p 0.1413241 0.01432651 0.1182072 0.1643099 ▁▁▂▃▇▇▅▂▁▁
The log odds difference is positive, implying higher prob of admission for male applicants. On the probability scale the difference is between 12%-16%.
Posterior predictions of the model using the function postcheck
.
postcheck(m11.7)
# draw lines connecting points from same dept (6 dept A-F)
for(i in 1:6){
<- 1 + 2 * (i - 1)
x <- d$admit[x] / d$applications[x]
y1 <- d$admit[x + 1] / d$applications[x+1]
y2 lines( c(x, x + 1), c(y1, y2), col = rangi2, lwd = 2)
text( x + 0.5, (y1 + y2)/2 + 0.05, d$dept[x], cex = 0.8, col = rangi2)
}
In above plot, blue points are observed proportions admitted for each row in the data. Points in same dept are connected by line.
Open points, the tiny vertical black lines within them and the crosses are expected proportions, 89% intervals of the expectation and 89% interval of the simulated samples respectively.
Drawing the above via ggplot:
<- link(m11.7) # get posterior means
pred <- sim(m11.7) # get posterior observations
sim <- colMeans(pred)
mu.mean <- apply(pred, 2, PI)
mu.PI <- apply(sim, 2, PI)
y.PI <- tibble(mu = mu.mean,
plot_data mu_min = mu.PI[1,], mu_max = mu.PI[2,],
y_min = y.PI[1,], y_max = y.PI[2,]) %>%
bind_cols(d) %>%
mutate(pct = admit / applications,
y_min = y_min / applications,
y_max = y_max / applications)
%>%
plot_data ggplot(aes(x = 1:12, y = pct)) +
geom_point(col = rangi2, size = 2) +
geom_line(aes(group = dept), col = rangi2, size = 1.3) +
geom_point(aes(y = mu)) +
geom_linerange(aes(ymin = mu_min, ymax = mu_max)) +
geom_point(aes(y = y_max), shape = 3) +
geom_point(aes(y = y_min), shape = 3) +
scale_x_continuous(breaks = 1:12) +
expand_limits(y = c(0,1)) +
# geom_text(aes(x = seq(1, 12, 2), y = pct,
# label = dept), data = plot_data[seq(1, 12, 2),]) +
geom_text(aes(label = dept, x = seq(1, 12, 2) + .5, y = pct),
data = group_by(plot_data, dept) %>% summarise(pct = mean(pct)),
vjust = 0, nudge_y = 0.02) +
theme_classic() +
labs(x = "case",
y = "admit",
title = "Posterior validation check")
Only 2 dept where women had a lower rate of admission than men (C & E), however, model predicts lower rate of admission for women in all depts, with 14% lower prob for women. Why is this?
We asked the model the question: What are the average probabilities of admisison for women and men across all departments? This misleads us, since men and women did not apply to the same departments and different departments had different admission rates. Admission rates declined from dept A to F, and women applied more to dept C-F which had lower admission rates.
library(tidyr)
%>%
d select(dept, applicant.gender, applications) %>%
pivot_wider(names_from = dept, values_from = applications) %>%
::adorn_percentages(denominator = "col") %>%
janitor::adorn_pct_formatting(digits = 0) janitor
## applicant.gender A B C D E F
## male 88% 96% 35% 53% 33% 52%
## female 12% 4% 65% 47% 67% 48%
Just inspecting the posterior would not have revealed this. We need to therefore ask the model the avg prob of admission for women and men within each department. The model that asks this is:
\[ \begin{align} A_i &\sim \text{Binomial(N_i, p_i)} \\ \operatorname{logit}(p_i) &= \alpha_{GID[i]} + \delta_{DEPT[i]} \\ \alpha_j &\sim \operatorname{Normal}(0, 1.5) \\ \delta_k &\sim \operatorname{Normal}(0, 1.5) \end{align} \] DEPT indexes departments 1..6. Each dept gets its own log-odds of admission.
$dept_id <- coerce_index(d$dept)
dat_list
.8 <- ulam(
m11alist(
~ dbinom(applications, p),
admit logit(p) <- a[gid] + delta[dept_id],
~ dnorm(0, 1.5),
a[gid] ~ dnorm(0, 1.5)
delta[dept_id] data = dat_list, chains = 4, iter = 4000, refresh = 0 # running for only 1k chains leads to low n_eff
),
)precis(m11.8, depth = 2, digits = 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] -0.5520104 0.5117017 -1.3725927 0.2682022 626.3210 1.012220
## a[2] -0.4560839 0.5131792 -1.2761948 0.3561463 624.2397 1.011948
## delta[1] 1.1322041 0.5143820 0.3084477 1.9562326 640.2520 1.012186
## delta[2] 1.0882261 0.5183216 0.2649670 1.9274819 637.1130 1.011943
## delta[3] -0.1269123 0.5139436 -0.9474742 0.6987928 627.5999 1.012040
## delta[4] -0.1600542 0.5143283 -0.9762738 0.6684237 629.1758 1.011977
## delta[5] -0.6041416 0.5175813 -1.4367327 0.2243923 625.2480 1.011818
## delta[6] -2.1573141 0.5295754 -3.0065894 -1.3076195 667.1045 1.011010
The intercept for males is now a little smaller on avg than the one for female applicants. Let us calculate the contrasts on both logit and outcome scale again.
<- extract.samples(m11.8)
post <- post$a[,1] - post$a[,2]
diff_a <- inv_logit(post$a[,1]) - inv_logit(post$a[,2])
diff_p precis(list(diff_a = diff_a, diff_p = diff_p), digits = 2)
## mean sd 5.5% 94.5% histogram
## diff_a -0.09592655 0.08218244 -0.2254343 0.033872044 ▁▁▁▁▂▅▇▇▅▂▁▁▁▁
## diff_p -0.02144790 0.01869915 -0.0516093 0.007442745 ▁▁▂▇▇▂▁▁
Very small difference now, with male candidates worse off by about 2% on average.
The delta posterior means show lower admission rates as we go from dept A to F.
Gender influences choice of dept and dept influences chance of admission, so it is a pipe
from G -> D -> A.
library(dagitty)
library(ggdag)
<- dagitty("dag{
admission_dag G -> A;
G -> D -> A}")
coordinates(admission_dag) <- list( x = c(G = 0, D = 1, A = 2),
y = c(G = 0, D = 1, A = 0))
ggdag(admission_dag) + theme_void()
There is an indirect causal path G -> D -> A from gender to acceptance. So to infer G -> A, we need to condition on D to close the indirect path, which m11.8 does. This is an example of MEDIATION analysis.
postcheck(m11.8)
The model now lines up much better with the actual proportions of acceptances.
Poisson regression
Binomial GLMs are appropriate when outcome is a count from zero to some known upper bound (like 1). When probability of event p is very small and number of trials N is large, then it takes special shape. The expected value \(Np\) and variance \(Np(1-p)\) approach each other. This special shape is called POISSON DISTRIBUTION. It allows modelling events where N is very large or unknown.
Poisson distribution has only a single parameter.
\[ y_i \sim \operatorname{Poisson}(\lambda) \] \(\lambda\) is the expected value of the outcome y, as well as the expected variance of the counts y. The conventional link function is the log link.
\[ \begin{align} y_i &\sim \operatorname{\lambda_i} \\ log(\lambda_i) &= \alpha + \beta(x_i - \bar x) \end{align} \] The log link ensures that \(\lambda_i\) is always positive (required for expected value of a count outcome). It also implies an exponential relationship between predictors and the expected value. Exp relationships grow very quickly, so need to check if log link makes sense at all ranges of the predictor variables.
Example: Oceanic tool complexity
Theories predict that larger populations will both develop and sustain more complex tool kits. The islands of Oceania provide natural variation in population size due to differences in island size, and thus can be used to test these ideas. Another suggestion is that contact rates among populations effectively increase population size (as it’s relevant to tech evolution).
We will work with data on counts of unique tool types for 10 historical Oceanic societies.
library(rethinking)
data("Kline")
<- Kline
d d
## culture population contact total_tools mean_TU
## 1 Malekula 1100 low 13 3.2
## 2 Tikopia 1500 low 22 4.7
## 3 Santa Cruz 3600 low 24 4.0
## 4 Yap 4791 high 43 5.0
## 5 Lau Fiji 7400 high 33 5.0
## 6 Trobriand 8000 high 19 4.0
## 7 Chuuk 9200 high 40 3.8
## 8 Manus 13000 low 28 6.6
## 9 Tonga 17500 high 55 5.4
## 10 Hawaii 275000 low 71 6.6
Note: no of rows is not the same as the “sample size” in a count model.
Any rules you’ve been taught about minimum sample sizes for inference are just non-Bayesian superstitions. If you get the prior back, then the data aren’t enough. It’s that simple.
total_tools
is the outcome variable. We will model the following ideas:
- no of tools increases with the log population size (theory says log, that’s why log)
- no of tools increases with the
contact
rate among islands. Islands that are better networked may acquire or sustain more tool types - impact of
population
on tool counts is moderated by highcontact
, i.e. we will look for a positive interaction between log population and contact rate.
some pre-processing first:
$P <- scale(log(d$population))
d$contact_id <- ifelse(d$contact == "high", 2, 1) d
Mathematically, the model is:
\[ \begin{align} T_i &\sim \operatorname{Poisson}(\lambda_i) \\ \log \lambda_i &= \alpha _{CID[i]} + \beta_{CID[i]} \log P_i \\ \alpha_j &\sim TBD \\ \beta_j &\sim TBD \end{align} \]
P is population
, and CID is contact_id
Again for the priors, due to the link function, a flat prior on the linear scale will not be flat on the outcome scale. For instance, for a model with an intercept and a vague Normal(0, 1) prior:
\[ \begin{align} T_i &\sim \operatorname{Poisson}(\lambda_i) \\ \log \lambda_i &= \alpha \\ \alpha &\sim \operatorname{Normal}(0, 10) \end{align} \] Since \(\alpha\) has a normal distribution, \(\lambda\) will have a log-normal distribution. Plotting the above prior on the outcome scale:
curve( dlnorm(x, 0, 10), from = 0, to = 100, n = 200)
horizontal range is 0-100 based upon the knowledge that historical tool kits in the Pacific were in this range. For the above prior, there is a huge spike around zero (implying zero tools on average) and then a long tail. The mean of a log-normal distribution is \(\exp (\mu + \sigma^2 / 2)\), which evaluates to exp(50) which is very large.
# some simulation of above point
<- rnorm(1e4, 0, 10)
a <- exp(a)
lambda mean(lambda)
## [1] 5.04663e+11
Slightly better prior is Normal(3, 0.5)
curve( dlnorm(x, 3, 0.5), from = 0, to = 100, n = 200)
Mean is now \(\exp (3 + 0.5^2/2) \approx 20\). We have not looked at the mean of the total_tools
column, and we should not since we are deciding priors.
Another prior is required for \(\beta\). We can check a conventional flat prior like Normal(0, 10) first.
We will simulate together with the intercept and plot 100 trends of standardized log population against total tools:
<- 100
N <- rnorm(N, 3, 0.5)
a <- rnorm(N, 0, 10)
b plot(NULL, xlim = c(-2, 2), ylim = c(0, 100), xlab = "log population(std)", ylab = "total tools")
mtext("b ~ dnorm(0, 10)")
for(i in 1:N)
curve( exp(a[i] + b[i] * x), add = TRUE, col = grau())
The majority of the priors here imply explosive growth right after the mean (0) or similar decline just before the mean log population.
Let’s again try with a Normal(0, 0.2):
set.seed(10)
<- 100
N <- rnorm(N, 3, 0.5)
a <- rnorm(N, 0, 0.2)
b plot(NULL, xlim = c(-2, 2), ylim = c(0, 100), xlab = "log population(std)", ylab = "total tools")
mtext("b ~ dnorm(0, 0.2)")
for(i in 1:N)
curve( exp(a[i] + b[i] * x), add = TRUE, col = grau())
Strong relationships are still possible, but most of the mass is for rather flat relationships between total tools and log population.
Viewing these outcomes on the original scale is easier to understand, so we will do that now. (population also has a natural zero which we want to keep in sight). First we will view on the unstandardized log scale.
<- seq(from = log(100), to = log(200000), length.out = 100)
x_seq <- sapply(x_seq, function(x) exp(a + b*x))
lambda plot(NULL, xlim = range(x_seq), ylim = c(0, 500), xlab = 'log population', ylab = "total tools")
for(i in 1:N)
lines(x_seq, lambda[i,], col = grau(), lwd = 1.5)
mtext("b ~ dnorm(3, 0.5), b ~ dnorm(0, 0.2)")
We expect to see mostly until 100 tools in the data. Most of the trends are in that range, although a few explosive options still remain.
The same on the original population scale:
plot(NULL, xlim = range(exp(x_seq)), ylim = c(0, 500), xlab = 'log population', ylab = "total tools")
for(i in 1:N)
lines(exp(x_seq), lambda[i,], col = grau(), lwd = 1.5)
mtext("b ~ dnorm(3, 0.5), b ~ dnorm(0, 0.2)")
On the raw population scale, the curves bend the other direction. (Diminishing returns with increasing population).
Next we create the 2 models - simple intercept only model + interaction model.
<- list(
dat T = d$total_tools,
P = d$P,
cid = d$contact_id
)
# intercept only
.9 <- ulam(
m11alist(
~ dpois(lambda),
T log(lambda) <- a,
~ dnorm(3, 0.5)
a data = dat, chains = 4, log_lik = TRUE
),
)
# interaction model
.10 <- ulam(
m11alist(
~ dpois(lambda),
T log(lambda) <- a[cid] + b[cid] * P,
~ dnorm(3, 0.5),
a[cid] ~ dnorm(0, 0.2)
b[cid] data = dat, chains = 4, log_lik = TRUE
), )
comparing the two with PSIS:
<- PSIS(m11.10, pointwise = TRUE)$k k
## Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.
compare(m11.9, m11.10, func = PSIS)
## Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.
## PSIS SE dPSIS dSE pPSIS weight
## m11.10 85.87362 13.45394 0.00000 NA 7.243008 1.000000e+00
## m11.9 140.94833 33.31489 55.07471 32.71973 7.956396 1.098195e-12
- We get warning for Pareto k, suggesting influenctial points (this data only has 10 points, so highly likely).
- Intercept only model has a worse score than the interaction model
- the intercept only model has higher “effective number of parameters” (
pPSIS
)
Model complexity (model’s tendency to overfit) and parameter count only have a clear relationship in a simple linear regression with flat priors. Once a distribution is bounded, then parameter values near the boundary produce less overfitting that those far from the boundary. This also applies to data distributions and any count near zero is harder to overfit. Overfitting risk thus depends both upon structural details of the model and the composition of the sample.
In this sample, a major source of overfitting risk is the highly influential point flagged by PSIS. We next plot the data along with the posterior predictions for the expected number of tools at each population size and contact rate:
plot(dat$P, dat$T, xlab = "log population (std)", ylab = "total tools",
col = rangi2, pch = ifelse(dat$cid == 1, 1, 16), lwd = 2,
ylim = c(0, 75), cex = 1 + normalize(k))
# set up the horizontal zxis values to compute predictions at
<- 100
ns <- seq(from = -1.4, to = 3, length.out = ns)
P_seq
# predictions for cid = 1 (low contact)
<- link(m11.10, data = data.frame(P = P_seq, cid = 1))
lambda <- apply(lambda, 2, mean)
lmu <- apply(lambda, 2, PI)
lci shade(lci, P_seq, xpd = TRUE)
lines(P_seq, lmu, lty = 2, lwd = 1.5)
# predictions for cid = 2 (high contact)
<- link(m11.10, data = data.frame(P = P_seq, cid = 2))
lambda2 <- apply(lambda2, 2, mean)
lmu2 <- apply(lambda2, 2, PI)
lci2 shade(lci2, P_seq, xpd = TRUE)
lines(P_seq, lmu2, lty = 1, lwd = 1.5)
The same plot in ggplot2.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(glue)
library(patchwork)
library(scales)
# set up the horizontal zxis values to compute predictions at
<- 100
ns <- seq(from = -1.4, to = 3, length.out = ns)
P_seq
# predictions for cid = 1 (low contact)
<- link(m11.10, data = data.frame(P = P_seq, cid = 1))
lambda <- apply(lambda, 2, mean)
lmu <- apply(lambda, 2, PI)
lci
# predictions for cid = 2 (high contact)
<- link(m11.10, data = data.frame(P = P_seq, cid = 2))
lambda2 <- apply(lambda2, 2, mean)
lmu2 <- apply(lambda2, 2, PI)
lci2
<- attributes(dat$P)$'scaled:scale'
P_scale <- attributes(dat$P)$'scaled:center'
P_center
<- function(x) exp(x * P_scale + P_center)
real_scale
<- d %>%
d2 mutate(cid = factor(contact_id, labels = c("Low", "High")),
P_real = real_scale(P),
Pareto_k = k,
label = ifelse(k < 0.5, culture, glue("{culture} ({round(Pareto_k, 2)})")))
<- tibble(P_seq,
plot_data lmu1_mu = lmu, lci1_min = lci[1,], lci1_max = lci[2,],
lmu2_mu = lmu2, lci2_min = lci2[1,], lci2_max = lci2[2,]) %>%
pivot_longer(cols = -P_seq, names_pattern = ".{3}(\\d)_(.*)",
names_to = c("cid", ".value")) %>%
mutate(cid = factor(cid, labels = c("Low", "High")),
cid = as.character(cid),
P_real = real_scale(P_seq))
<- plot_data %>%
p1 ggplot(aes(x = P_seq, y = mu, fill = cid)) +
geom_line(aes(linetype = cid), show.legend = FALSE) +
geom_ribbon(aes(ymin =min, ymax = max), alpha = 0.3) +
geom_point(aes(P, total_tools, shape = cid),
data = d2,
shape = ifelse(dat$cid == 1, 1, 16),
size = 3 + normalize(k),
stroke = 1.5, colour = '#1E59AE') +
geom_text(aes(label = label, x= P, y = total_tools), data = d2,
check_overlap = TRUE, vjust = 0, hjust = 0) +
labs(x = "log population (std)", y = "total tools",
fill = "Contact", shape = "Contact",
title = "Posterior predictions by `m11.10`",
subtitle = "High Pareto k values plotted")
<- plot_data %>%
p2 ggplot(aes(x = P_real, y = mu, fill = cid)) +
geom_line(aes(linetype = cid), show.legend = FALSE) +
geom_ribbon(aes(ymin = min, ymax = max), alpha = 0.3) +
geom_point(aes(P_real, total_tools, shape = cid),
data = d2,
shape = ifelse(dat$cid == 1, 1, 16),
size = 3 + normalize(k), stroke = 1.5, colour = '#1E59AE') +
labs(x = "population", y = "total tools",
fill = "Contact", shape = "Contact",
title = "Posterior predictions by `m11.10` with predictor on real scale") +
scale_x_continuous(labels = comma,
limits = c(real_scale(c(-2, 2.4))))
/ p2 p1
Hawaii, Tonga, Yap and Trobriand have have Pareto k values and are influential points. Most are not too influential, but Hawaii is very influential, which is most obvious in the plot with population on the natural scale. From the data, it has extreme population size and the most tools.
This does not mean that Hawaii is outlier and should be removed, only that it strongly influences the posterior distribution.
Also from the figure we can see that High contact societies have a higher trend than low contact societies, when population size is low, but at high population the means cross and high contact can even be lower according to the model. The model believes this since there are not high contact societies with population as high as Hawaii in the data. However, a counter-factual Hawaii with same population but high contact should have at least as many tools as the real Hawaii.
The model allows this because intercept can be a free parameter (can take any value depending on data), whereas we would like it to have our trend for \(\lambda\) pass through 0, since total tools at zero population should be zero!
Tools: Scientific Model
A scientific model would say that tools develop over time and not at once. Innovation adds them and innovation is proportional to population but with diminishing returns. Processes of loss removes the tools and this will be proportional to the number of tools without diminishing returns. These forces balance out to produce a tool kit of some size.
Thus we can write that teh change in the expected no of tools in one time step is:
\[ \Delta T=\alpha P^\beta - \gamma T \] P is population size, T is the number of tools, and \(\alpha\), \(\beta\), \(\gamma\) are parameters to be estimated.
At equilibrium, \(\Delta T=0\), solving for T:
\[ \hat T = \frac{\alpha P ^ \beta}{\gamma} \] We will use this inside a Poisson model, which is the max entropy model for a count variable with no upper bound. But we don’t have a linear model now nor a link function:
\[ \begin{align} T_i &\sim \operatorname{Poisson}(\lambda_i) \\ \lambda_i &= \alpha P_i^\beta / \gamma \end{align} \] We need to ensure that \(\lambda\) remains positive, and for that we need to ensure that the three parameters remain positive. We will use exponential and log-normal priors for this. Exponential for \(\beta\) and \(\gamma\) and log-normal for \(\alpha\). We also want to allow interaction with contact rate, and we will allow \(\alpha\) and \(\beta\) to vary with it.
<- list(T = d$total_tools, P = d$population, cid = d$contact_id)
dat2 .11 <- ulam(
m11alist(
~ dpois(lambda),
T <- exp(a[cid]) * P^b[cid] / g,
lambda ~ dnorm(1, 1),
a[cid] ~ dexp(1),
b[cid] ~ dexp(1)
g data = dat2, chains = 4, log_lik = TRUE
), )
Plotting the posterior predictions for above:
<- real_scale(P_seq)
P_seq2
<- function(model, cid){
get_mu_ci <- link(model, data = data.frame(P = P_seq2, cid = cid))
mu <- colMeans(mu)
lmu <- apply(mu, 2, PI)
ci tibble(P = P_seq2, mu = lmu, min = ci[1,], max = ci[2,], Contact = cid)
}<- bind_rows(get_mu_ci(m11.11, 1), get_mu_ci(m11.11, 2)) %>%
plot_data2 mutate(Contact = factor(Contact, labels = c("Low", "High")))
%>%
plot_data2 ggplot(aes(x = P, y = mu)) +
geom_line(aes(linetype = Contact), show.legend = TRUE) +
geom_ribbon(aes(ymin = min, ymax = max, fill = Contact), alpha = 0.3) +
geom_point(aes(P_real, total_tools),
data = d2,
shape = ifelse(dat$cid == 1, 1, 16), size = 3 + normalize(k), colour = '#1E59AE') +
labs(x = "population", y = "total tools",
title = "Posterior predictions by `m11.11` with predictor on real scale") +
scale_x_continuous(labels = comma, limits = c(0, 3e5)) +
ylim(0, 200)
This time trends for both High and Low contact type go through origin by design and thus the High contact trend always remains above Low.
Negative Binomial (gamma-Poisson) models
Typically there is a lot of unexplained variation in Poisson models, possibly from unobserved influences (case to case), generating variation in the true \(\lambda\)’s. This variation is called rate heterogeneity and ignoring it can cause confounds. A common extension to Poisson GLMs is to use a Negative Binomial distribution. Also called Gamma-Poisson, it is a mixture of different Poisson distributions and is the Poisson analogue of Student-t model (which is a mixture of different normal distributions).
Example: Exposure and the offset
\(\lambda\) in Poisson model is both interpreted as the expected value as well as the rate (no of events per unit time or distance). This dual intrepretation allows modelling cases where EXPOSURE varies across cases i. For instance if we have info of two sets of records, one where rate is in terms of events / day and in second events/week, then we can model in the following manner:
\(\lambda = \mu / \tau\), (lambda = exp no of events mu / per unit time or distance tau). Thus the link now becomes:
\[ \begin{align} y_i &\sim \operatorname{Poisson}(\lambda_i) \\ \log \lambda_i &= \log \frac{\mu_i}{\tau_i} = \alpha + \beta x_i \\ \log \lambda_i &= \log \mu_i- \log\tau_i = \alpha + \beta x_i \\ \text{thus we have:}\\ \log \mu_i &= \log\tau_i + \alpha + \beta x_i \\ \end{align} \] When \(\tau_i=1\), then equation becomes as before. But in other cases where values differ, we can now model them as:
\[ \begin{align} y_i &\sim \operatorname{Poisson}(\mu_i) \\ \log \mu_i &= \log \tau_i + \alpha + \beta x_i \end{align} \] \(\tau\) is a column in the data. We can also put a parameter in front of it to model the hypothesis that it is not constant with time.
Poisson example where exposure varies across observations
Poisson distribution assumes rate of events is constant in time (or space). So when length of observations, area of sampling or intensity of sampling varies, counts also very naturally. To handle this we need to use the above method of adding log of exposure to the model.
We will use simulation for such an example.
Example: Assume there’s a monastery, where monks write/copy manuscripts each day. We have the data on the number of manuscripts completed each day. Suppose that the true rate is \(\lambda = 1.5\) manuscripts per day.
Simulating a month of daily counts:
<- 30
num_days <- rpois(num_days, 1.5) y
Assume that that monastery is turning a tidy profit, so there are considerations regarding purchase of another monastery, but before purchase we would like to know how productive it would be. Unfortunately, the target monastery does not keep daily records but instead weekly totals. Suppose the daily rate at the new monastery is actually \(\lambda = 0.5\) manuscripts per day. To simulate data on a weekly basis, we multiply this average by 7, the exposure:
<- 4
num_weeks <- rpois(num_weeks, 0.5 * 7) y_new
To analyze both y and y_new, the daily and weekly totals for the two monasteries, we add the logarithm of the exposure to linear model.
# organising the data in a data.frame
<- c(y, y_new)
y_all <- c(rep(1, 30), rep(7,4))
exposure <- c(rep(0, 30), rep(1, 4))
monastery <- data.frame(y = y_all, days = exposure, monastery = monastery)
d
# viewing few rows from top and bottom
c(1:5, 30:34),] d[
## y days monastery
## 1 4 1 0
## 2 3 1 0
## 3 1 1 0
## 4 1 1 0
## 5 3 1 0
## 30 0 1 0
## 31 1 7 1
## 32 4 7 1
## 33 3 7 1
## 34 3 7 1
Building the model:
#compute the offset
$log_days <- log(d$days)
d
# fit the model
.12 <- quap(
m11alist(
~ dpois(lambda),
y log(lambda) <- log_days + a + b*monastery,
~ dnorm(0, 1),
a ~ dnorm(0, 1)
b data = d
), )
To compute the posterior distribution of \(\lambda\) in each monastery, we use our usual approach of sampling from the posterior and then we just use the linear model, but without the offset now. We don’t use the offset again when computing predictions, because now the parameters are already on the daily scale for both monasteries.
<- extract.samples(m11.12)
post <- exp(post$a)
lambda_old <- exp(post$a + post$b)
lambda_new precis(data.frame(lambda_old, lambda_new))
## mean sd 5.5% 94.5% histogram
## lambda_old 1.7540189 0.2406760 1.402250 2.1600524 ▁▁▅▇▅▂▁▁▁▁
## lambda_new 0.4581979 0.1296616 0.286542 0.6862387 ▁▂▇▇▅▂▁▁▁▁▁▁▁
We can see that the new monastery produces about half as many manuscripts per day, so not worth buying it.
Multinomial and categorical models
If instead of only two possible events we have more than 2, with probability of each event constant across trials, then the max entroy distribution id the MULTINOMIAL DISTRIBUTION.
If there are K types of events with probabilities \(p1, ...,p_K\), then the probability of observing \(y_1, ...,y_K\) events of each type out of n trials is:
\[ \large Pr(y_1, ..., y_K |n, p1, ..., p_K) = \frac{n!}{\prod_i y_i!} \prod_{i=1}^{K} p_i^{y_i} \]
the fraction on left just gives the number of different combinations of observing the same counts for \(y_1, ..., y_K\) (multiplicity).
If we separate each event in its separate row (with some outcome y taking k values to signify the k events), then the model built can also be called a Categorical regression (in ML, called Max Entropy Classifier).
Building a GLM from a multinomial is complicated, since modeling choices increase with increase in event types. There are two different approaches to construcdt the likelihoods as well in such cases: the first is based directly on the multinomial likelihood and uses a generalisation of the logit link (called the explicit approach from hereon), the second transforms the multinomial likelihod into a series of Poisson likelihoods.
The conventional and natural link is the MULTINOMIAL LOGIT, also known as the SOFTMAX fucntion. It takes a vector of scores for each event type and computes the prob of a particular type of event k as:
\[ Pr(k|s_1, s_2, ..., s_K) = \frac{\exp (s_k)}{\sum_{i=1}^K \exp(s_i)} \]
Combined with the conventional link, this type of GLM may be called multinomial logistic regression. We get multiple linear models, \(K-1\) linear models for K types of events. One of the outcome values is chosen as a “pivot” and others are modeled relative to it. In each K - 1, linear models, the predictors can be different.
There are two basic cases:
- predictors have different values for different values of the outcome - useful when each type of event has its own quantitative traits and we want to estimate the association between those traits and the prob with which each event type appears in the data
- parameters are distinct for each value of the outcome - useful when we are interested in features of some entity that produce each event
Predictors matched to outcomes
Example - modelling career choice for a number of young adults with income as a relevant predictor. The same parameter \(\beta_{Income}\) appears in each linear model but a different income value multiplies the parameter in each linear model.
Simulation below with 3 career, each with its own income trait. These are used to assign a score to each type of event. During the model fit, one of these scores is held constant and the other 2 scores are estimated using the known income traits (one as base category).
# simulate career choices among 500 individuals
<- 500
N <- c(1, 2, 5)
income <- 0.5 * income
score <- softmax(score[1], score[2], score[3])
p
# now simulate career choice
<- rep(NA, N)
career set.seed(34302)
for(i in 1:N)
<- sample(1:3, size = 1, prob = p) career[i]
We will use dcategorical
likelihood, which is the multinomial logistic regression distribution. It works when each value in the outcome variable (career
), contains the individual event types on each row. To convert all the scores to probabilities, we will use softmax
. Each possible career will get its own linear model with its own features. If income doesn’t predict career choice, we will want an intercept to account for differences in frequency.
We model the following directly in stan:
.13 <- "
code_m11data{
int N; //number of individuals
int K; //no of possible careers
int career[N]; //outcome
vector[K] career_income;
}
parameters{
vector[K-1] a; //intercepts
real<lower=0> b; // association of income with choice
}
model{
vector[K] p;
vector[K] s;
a ~ normal(0, 1);
b ~ normal(0, 0.5);
s[1] = a[1] + b * career_income[1];
s[2] = a[2] + b * career_income[2];
s[3] = 0; //pivot
p = softmax(s);
career ~ categorical(p);
}
"
# setting up the data list
<- list(N = N, K = 3, career = career, career_income = income)
dat_list
# invoking stan
.13 <- stan(model_code = code_m11.13, data = dat_list, chains = 4, refresh = 0) m11
## Warning: There were 20 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
precis(m11.13, 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] -2.1334023 0.1825217 -2.43719901 -1.8484860 936.1140 1.002283
## a[2] -1.7836026 0.2438831 -2.22753073 -1.4580678 694.6702 1.004127
## b 0.1309601 0.1081770 0.00876241 0.3434542 662.6389 1.003789
Since parameters are relative to the pivot it is hard to interpret them. Depending upon pivot, they could end up positive/negative. But we will get same results on the outcome scale. From the above we get that coefficient of career income b
is positive, but effect size is not clear.
We will now conduct a counterfactual simulation with the goal to compare a counterfactual career in which the income is changed. How does the prob change in presence of these competing careers. Let’s imagine doubling the income of career 2:
<- extract.samples(m11.13)
post
# set up logit scores
<- with(post, a[,1] + b*income[1])
s1 <- with(post, a[,2] + b*income[2])
s2_orig <- with(post , a[,2] + b * income[2] * 2)
s2_new
# compute probabilities for original and counterfactual
<- sapply( 1:length(post$b), function(i) softmax( c( s1[i], s2_orig[i], 0) ) )
p_orig <- sapply( 1:length(post$b), function(i) softmax( c( s1[i], s2_new[i] , 0) ) )
p_new
# summarize
<- p_new[2,] - p_orig[2,]
p_diff precis(p_diff)
## mean sd 5.5% 94.5% histogram
## p_diff 0.04092539 0.03737854 0.002381389 0.1160494 ▇▅▃▂▁▁▁▁▁▁▁▁▁
An average 4% increase in probability of choosing the career, when the income is doubled. This value is conditional on comparing to the other careers in the calculation as these models do not produce independent predictions for a specific set of options.
Predictors matched to observations
This time we will have each observed outcome with unique predictor values. For the career choice example, suppose we not want to estimate the association between each person’s family income and which career they choose. So the predictor variable must have the same value in each linear model for each row in the data. But now there is a unique parameter multiplying it in each linear model.
<- 500
N # simulate family incomes for each individual
<- runif(N)
family_income
# assign a unique coefficient for each type of event
<- c(-2, 0, 2)
b <- rep(NA, N)
career for(i in 1:N){
<- 0.5 * (1:3) + b*family_income[i]
score <- softmax(score[1], score[2], score[3])
p <- sample(1:3, size = 1, prob = p)
career[i]
}
.14 <- "
code_m11data{
int N;
int K;//
int career[N]; //outcome
real family_income[N];
}
parameters{
vector[K-1] a;
vector[K-1] b;
}
model{
vector[K] p;
vector[K] s;
a ~ normal(0, 1.5);
b ~ normal(0, 1);
for(i in 1:N){
for(j in 1:(K-1))
s[j] = a[j] + b[j] + b[j] * family_income[i];
s[K] = 0; //the pivot
p = softmax(s);
career[i] ~ categorical(p);
}
}
"
<- list(N = N, K = 3, career = career, family_income = family_income)
dat_list .14 <- stan(model_code = code_m11.14, data = dat_list, chains = 4, refresh = 0)
m11precis(m11.14, 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] 1.518171 0.7384046 0.3329661 2.702179 1234.954 1.004278
## a[2] 1.621640 0.5477794 0.7401744 2.510460 1176.257 1.000157
## b[1] -2.785628 0.5377749 -3.6638384 -1.933436 1227.779 1.004176
## b[2] -2.147225 0.3854539 -2.7639371 -1.529621 1226.078 1.000071
Again it is hard to interpret these. So we will use posterior predictions and see the results on the outcome scale.
<- extract.samples(m11.14)
post
# set up logit scores, use the same family income for s1 and s2_orig, double for s2_new
<- with(post, a[,1] + b[,1] * family_income[1])
s1 <- with(post, a[,2] + b[,2] * family_income[1])
s2_orig <- with(post, a[,2] + b[,2] * family_income[1] * 2)
s2_new
# compute probabilities for original and counterfactual
<- sapply( 1:nrow(post$b), function(i) softmax( c( s1[i], s2_orig[i], 0) ) )
p_orig <- sapply( 1:nrow(post$b), function(i) softmax( c( s1[i], s2_new[i] , 0) ) )
p_new
# summarize
<- p_new[2,] - p_orig[2,]
p_diff precis(p_diff)
## mean sd 5.5% 94.5% histogram
## p_diff -0.1154827 0.02817352 -0.1557814 -0.06510914 ▁▁▃▇▇▃▂▁▁
The difference is 9% negative, implying doubling family income decreases probability of career 2 by 9%.
Multinomial in disguise as Poisson
Another way to fit a multinomial/categorical is to refactor it into a series of Poisson likelihoods. It is usually computationally easier. We will reuse the binomial example - UCB admissions (binomial is a special case of multinomial).
library(rethinking)
data(UCBadmit)
<- UCBadmit d
We will now use Poisson regression to model both the rate of admission and rejection and compare the inference to teh binomial model’s probability of admission.
Both the binomial and poisson modles constructed below:
# binomial model of overall admission probability
<- quap(
m_binom alist(
~ dbinom(applications, p),
admit logit(p) <- a,
~ dnorm(0, 1.5)
a data = d
),
)
# Poisson model of overall admission rate and rejection rate
# 'reject' is a reserved keyword in Stan
<- list(admit = d$admit, rej = d$reject)
dat <- ulam(
m_pois alist(
~ dpois(lambda1),
admit ~ dpois(lambda2),
rej log(lambda1) <- a1,
log(lambda2) <- a2,
c(a1, a2) ~ dnorm(0, 1.5)
data = dat, chains = 3, cores = 3
), )
We will consider only the posterior means for simplicity (the entire posterior matters).
First, the inferred binomial probability of admission, across the entire data set, is:
inv_logit(coef(m_binom))
## a
## 0.3878045
In the Poisson model, the implied probability of admission is given by:
\[ p_{ADMIT} = \frac{\lambda_1}{\lambda_1 + \lambda_2} = \frac{\exp (a_1)}{\exp(a_1) + \exp(a_2)} \]
In code form:
<- coef(m_pois)
k <- k['a1']
a1 <- k['a2']
a2 exp(a1) / (exp(a1) + exp(a2))
## a1
## 0.3876592
We get the same inference as in the binomial model.