Mini-Languages for Statistical Models
Lesson 6 with Ian Carroll
Contents
- Objectives for this lesson
- R’s formula notation
- Linear models
- Generalized linear models
- Linear Mixed Models
- Meet Stan
Objectives for this lesson
- Learn about R functions and extensions for statistical modeling
- Understand the “formula” part of model specification
- Meet Stan, a programming language for probability statements
- Pay no attention to these important topics:
- Experimental/Sampling design
- Model validation
- Hypothesis tests
- Model comparison
Specific achievements
- Write a
lm
formula with an interaction term - Use a non-gaussian family in the
glm
function - Add a “random effect” to a
lmer
formula - Understand a fully customizeable Stan formula
R’s formula notation
The basic function for fitting a regression in R is the lm
function, standing for linear model.
wt_len <- weight ~ hindfoot_length
wt_len.fit <- lm(formula = wt_len, data = animals)
The lm
function uses the given formula and the data types of portal
to compute the best fitting model in this family (i.e. determine the optimum coefficients)
Mini-language
The implementation of lm
is a specific solution to a general problem: how can a human easily write down a statistical model that a machine can interpret, relate to data, and optimize through a given fitting procedure?
For the lm
function, the answers are:
- Use a unquoted “formula” expression, mixing variable names and algebra-like operators, that maps to a design matrix.
- Check the variables used in the formula to complete the design matrix.
- Compute the linear least-squares estimator and more.
Note that “the details” are left to the lm
function where possible. For example, whether a variable is a factor or numeric is read from the data frame; the formula does not distinguish between data types.
Mini-language
Across different packages in R, this formula “mini-language” has evolved to allow complex model specification. Each one adds syntax to the formula or new arguments to the fitting function. On a far-distant branch of this evolution is the Stan modeling language, which provides the greatest flexibility but demands in return the most descriptive language.
Linear models
The formula requires a response variable left of a “~” and any number of predictors to its right.
Formula | Equivalent | Description |
y ~ a |
y ~ 1 + a |
constant and one predictor |
y ~ -1 + a |
y ~ 0 + a |
one predictor with no constant |
y ~ a + b |
y ~ 1 + a + b |
constant and two predictors |
y ~ a:b |
constant and one predictor, the interaction of (at least) one factor and another variable | |
y ~ a*b |
y ~ 1 + a + b + a:b |
constant and three predictors |
y ~ a*b - a |
y ~ 1 + b + a:b |
constant and two predictors |
y ~ (a + b + ... )^n |
constant and all combinations of predictors up to order n |
Linear model
In addition, certain functions are allowed within the formula definition.
animals <- read.csv('data/animals.csv', stringsAsFactors = FALSE, na.strings = '')
fit <- lm(
log(weight) ~ hindfoot_length,
data = animals)
summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: log(weight) ~ (1 | species_id) + hindfoot_length
Data: animals
REML criterion at convergence: -11966.7
Scaled residuals:
Min 1Q Median 3Q Max
-11.7117 -0.4802 0.1040 0.6123 7.4102
Random effects:
Groups Name Variance Std.Dev.
species_id (Intercept) 0.25479 0.5048
Residual 0.03943 0.1986
Number of obs: 30738, groups: species_id, 24
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.2138725 0.1051703 21.05
hindfoot_length 0.0436769 0.0007761 56.28
Correlation of Fixed Effects:
(Intr)
hndft_lngth -0.175
Exercise 1
Regress hindfoot_length against weight and species_id. Does it appear that the Chihuahuan Desert’s common kangaroo rat (DM) have inordinately large feet for their weight?
Pay attention to factors
Data type matters in statistical modelling. For the predictors in a linear model, the most important distinction is discrete versus continuous.
animals$species_id <- factor(animals$species_id)
fit <- lm(
log(weight) ~ species_id,
data = animals)
summary(fit)
Call:
lm(formula = log(weight) ~ species_id, data = animals)
Residuals:
Min 1Q Median 3Q Max
-2.28157 -0.10063 0.02803 0.12574 1.48272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.12857 0.03110 68.448 < 2e-16 ***
species_idDM 1.62159 0.03117 52.031 < 2e-16 ***
species_idDO 1.74519 0.03134 55.690 < 2e-16 ***
species_idDS 2.63791 0.03139 84.024 < 2e-16 ***
species_idNL 2.89645 0.03170 91.373 < 2e-16 ***
species_idOL 1.29724 0.03181 40.780 < 2e-16 ***
species_idOT 1.04031 0.03142 33.110 < 2e-16 ***
species_idOX 0.91176 0.09066 10.056 < 2e-16 ***
species_idPB 1.30426 0.03135 41.609 < 2e-16 ***
species_idPE 0.92374 0.03165 29.188 < 2e-16 ***
species_idPF -0.07717 0.03155 -2.446 0.014447 *
species_idPH 1.28769 0.04869 26.446 < 2e-16 ***
species_idPI 0.82629 0.08004 10.323 < 2e-16 ***
species_idPL 0.79433 0.04665 17.029 < 2e-16 ***
species_idPM 0.90396 0.03189 28.349 < 2e-16 ***
species_idPP 0.69278 0.03133 22.114 < 2e-16 ***
species_idPX 0.81448 0.15075 5.403 6.61e-08 ***
species_idRF 0.45257 0.03934 11.505 < 2e-16 ***
species_idRM 0.20908 0.03137 6.665 2.70e-11 ***
species_idRO 0.18447 0.08004 2.305 0.021191 *
species_idRX 0.56824 0.15075 3.769 0.000164 ***
species_idSF 1.87613 0.04504 41.656 < 2e-16 ***
species_idSH 2.10024 0.03572 58.803 < 2e-16 ***
species_idSO 1.80152 0.04504 40.000 < 2e-16 ***
species_idSS 2.32672 0.15075 15.434 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2086 on 32258 degrees of freedom
(3266 observations deleted due to missingness)
Multiple R-squared: 0.9216, Adjusted R-squared: 0.9215
F-statistic: 1.58e+04 on 24 and 32258 DF, p-value: < 2.2e-16
The difference between 1 and 24 degrees of freedom between the last two models—with one fixed effect each—arises from the discreteness of species_id
.
Generalized linear models
The lm
function treats the response variable as numeric—the glm
function lifts this restriction and others. Not through the formula
syntax, which is the same for calls to lm
and glm
, but through addition of the family
argument.
GLM families
The family
argument determines the family of probability distributions in which the response variable belongs. A key difference between families is the data type and range.
Family | Support | Default link |
---|---|---|
gaussian |
numeric |
identity, log, inverse |
binomial |
2-level factor |
logit, probit, cauchit, log, cloglog |
poisson |
non-negative integer |
log, identity, sqrt |
Gamma |
positive numeric |
inverse, identity, log |
inverse.gaussian |
positive numeric |
“1/mu^2”, inverse, identity, log |
Exercise 2
The default value for family
is gaussian(link = 'identity')
. Replace lm
with glm
(changing nothing else), to again fit the formula log(weight) ~ species_id
. Compare the summary()
between lm()
and glm()
, and identify something that is the same and something that is different in the output.
Link with care
Correct use of the link
argument to the family requires in-depth knowledge about generallize linear models—not our objective here. A common mistake to avoid, however, is assuming that glm
applies the transfromation given as link
to the response variable.
fit <- glm(weight ~ hindfoot_length,
family = gaussian(link = `log`),
data = animals)
Logistic regression
Calling glm
with familly = binomial
using the default “logit” link performs logistic regression.
animals$sex <- factor(animals$sex)
fit <- glm(sex ~ hindfoot_length,
family = binomial,
data = animals)
summary(fit)
Call:
glm(formula = sex ~ hindfoot_length, family = binomial, data = animals)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.366 -1.207 1.057 1.124 1.246
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.179215 0.036495 -4.911 9.08e-07 ***
hindfoot_length 0.009571 0.001186 8.068 7.17e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43408 on 31369 degrees of freedom
Residual deviance: 43343 on 31368 degrees of freedom
(4179 observations deleted due to missingness)
AIC: 43347
Number of Fisher Scoring iterations: 3
Weight
Both the lm
and glm
function allow a vector of weights
the same length as the response. Weights can be necessary for logistic regression, depending on the format of the data. The binomial
family glm
works with three different response variable formats.
factor
with two levels, as shown in the table abovematrix
of typeinteger
with two columns for the count of “successes” and “failures”.numeric
between 0 and 1, specifying the proprtion of “successes” out ofweights
trials.
Exercise 3
You are standing in the Chihuahuan desert, when a pocket mouse (genus Perognathus) suddenly runs up your pant leg. It doesn’t weigh your pocket down much, relative to your many similar pocket mouse experiences. Run a binomial family GLM on the two Perognathus species in the Portal data that may help you predict to which species it belongs. Hint: The second level()
of a factor is the “success” in a binomial GLM.
Linear Mixed Models
The lme4 package expands the formula “mini-language” to allow descriptions of “random effects”. In the context of this package, variables added to the right of the ~
in the usual way are “fixed effects”—they consume a well-defined number of degrees of freedom. Variables added within (...|...)
are “random effects”.
The “variable intercepts” and “random slopes” classes of model are the two most common extensions to a formula with one variable.
Formula | Description |
---|---|
y ~ a |
constant and one fixed effect |
y ~ (1 | b) + a |
random intercept for each level in b and one fixed effect |
y ~ (1 + a | b) |
random intercepts and “slopes” w.r.t. a for each level in b |
Random intercept
The lmer
and glmer
functions fit linear and generalized linear models with the lme4
formula syntax.
library(lme4)
fit <- lmer(log(weight) ~ (1 | species_id) + hindfoot_length,
data = animals)
summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: log(weight) ~ (1 | species_id) + hindfoot_length
Data: animals
REML criterion at convergence: -11966.7
Scaled residuals:
Min 1Q Median 3Q Max
-11.7117 -0.4802 0.1040 0.6123 7.4102
Random effects:
Groups Name Variance Std.Dev.
species_id (Intercept) 0.25479 0.5048
Residual 0.03943 0.1986
Number of obs: 30738, groups: species_id, 24
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.2138725 0.1051703 21.05
hindfoot_length 0.0436769 0.0007761 56.28
Correlation of Fixed Effects:
(Intr)
hndft_lngth -0.175
The familiar assessment of model residuals is absent from the summary due to the lack of a widely accepted measure of null and residual deviance. The notions of model saturation, degrees of freedom, and independence of observations have all crossed onto thin ice.
Lack of independence
Models with random effects should be understood as specifying multiple, overlapping probability statements about the observations.
In a lm
or glm
fit, each response is conditionally independent, given it’s predictors and the model coefficients. Each observation corresponds to it’s own probability statement. In a model with random effects, each response is no longer conditionally independent, given it’s predictors and model coefficients.
Exercise 4
Adjust the formula log(weight) ~ hindfoot_length
to fit a “random intercepts” model, grouping by a different categorical variable (i.e. not species_id) in the animals
data frame.
Random slopes
Adding a numeric variable after the constant within a grouping specified by (...|...)
produces a “random slope” model. Here, separate coefficients for hindfoot_length are allowed for each species.
fit <- lmer(log(weight) ~ (1 + hindfoot_length | species_id),
data = animals)
summary(fit)
Call:
lm(formula = log(weight) ~ species_id, data = animals)
Residuals:
Min 1Q Median 3Q Max
-2.28157 -0.10063 0.02803 0.12574 1.48272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.12857 0.03110 68.448 < 2e-16 ***
species_idDM 1.62159 0.03117 52.031 < 2e-16 ***
species_idDO 1.74519 0.03134 55.690 < 2e-16 ***
species_idDS 2.63791 0.03139 84.024 < 2e-16 ***
species_idNL 2.89645 0.03170 91.373 < 2e-16 ***
species_idOL 1.29724 0.03181 40.780 < 2e-16 ***
species_idOT 1.04031 0.03142 33.110 < 2e-16 ***
species_idOX 0.91176 0.09066 10.056 < 2e-16 ***
species_idPB 1.30426 0.03135 41.609 < 2e-16 ***
species_idPE 0.92374 0.03165 29.188 < 2e-16 ***
species_idPF -0.07717 0.03155 -2.446 0.014447 *
species_idPH 1.28769 0.04869 26.446 < 2e-16 ***
species_idPI 0.82629 0.08004 10.323 < 2e-16 ***
species_idPL 0.79433 0.04665 17.029 < 2e-16 ***
species_idPM 0.90396 0.03189 28.349 < 2e-16 ***
species_idPP 0.69278 0.03133 22.114 < 2e-16 ***
species_idPX 0.81448 0.15075 5.403 6.61e-08 ***
species_idRF 0.45257 0.03934 11.505 < 2e-16 ***
species_idRM 0.20908 0.03137 6.665 2.70e-11 ***
species_idRO 0.18447 0.08004 2.305 0.021191 *
species_idRX 0.56824 0.15075 3.769 0.000164 ***
species_idSF 1.87613 0.04504 41.656 < 2e-16 ***
species_idSH 2.10024 0.03572 58.803 < 2e-16 ***
species_idSO 1.80152 0.04504 40.000 < 2e-16 ***
species_idSS 2.32672 0.15075 15.434 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2086 on 32258 degrees of freedom
(3266 observations deleted due to missingness)
Multiple R-squared: 0.9216, Adjusted R-squared: 0.9215
F-statistic: 1.58e+04 on 24 and 32258 DF, p-value: < 2.2e-16
Allowing each animal to use a constant and hindfoot_length coefficient shared only by other members of its species
is a lot like fitting separate regressions for each species. Actual utility arises when combining this random slope predictor with other predictors not grouped by species.
Generalized Mixed Models
The glmer
function merely adds to lmer
the option to specify an exponential family distribution for the response variable.
Meet Stan
Stan is the name of a program that performs an entirely different kind of model fitting, and the Stan modelling language is yet another way of writing a “formula”. Stan is similar to JAGS and BUGS: it is a “sampler”, but does not use the Gibbs algorithm.
Sampling vs. Fitting
The result of model fitting through lm
and its descendents in R is an estimated coefficient coupled with the uncertainty in that estimate.
The results produced by Stan are tables of numbers with a column for each coefficient in the model. Each row represents an equally likely set of coefficients for the model, as judged by its resulting fit to the data.
An example
The formulas we use previously to describe a “random intercepts” model are directly included in a Stan formula.
data {
int N;
vector[N] log_weight;
vector[N] hindfoot_length;
int species_id[N];
int M;
}
parameters {
real beta0;
vector[M] beta1;
real beta2;
real<lower=0> sigma;
real<lower=0> sigma_genus;
}
transformed parameters {
vector[N] mu;
for (i in 1:N)
mu = beta0 + beta1[species_id]+ hindfoot_length * beta2;
}
model {
log_weight ~ normal(mu, sigma);
beta0 ~ normal(0, 10);
beta1 ~ normal(0, sigma_genus);
beta2 ~ normal(0, 10);
sigma_genus ~ cauchy(0, 5);
}
RStan
The rstan package is a wrapper for sending data to the Stan program and getting back results.
library(rstan)
stanimals <- animals %>%
select(weight, species_id, hindfoot_length) %>%
na.omit() %>%
mutate(log_weight = log(weight),
species_idx = as.integer(factor(species_id))) %>%
select(-weight, -species_id)
stanimals <- c(
N = nrow(stanimals),
M = max(stanimals$species_idx),
as.list(stanimals))
samp <- stan(file = 'worksheet-6.stan',
data = stanimals,
iter = 1000, chains = 3)
saveRDS(samp, 'stanimals.RDS')
Parameter “samples”
An example samp
output, saved to “RDS” after a long-running Stan job, is available in the handout’s data folder. Each row represents an equally likely set of coefficients for the model, as judged by its resulting fit to the data.
Summary
- Several “formula” languages co-exist in R packages
- Linear models and generalized linear models are similar
- Mixed effects models use grouping to add “random effects”
-
“Generalized” modeling functiongs add a
family
specification. - Sampling with Stan lets you specify any structure and use any probability distribution. You pay a great cost in speed and difficulty.