Mini-Languages for Statistical Models
Lesson 2 with Ian Carroll
Contents
- Objectives for this lesson
- R’s formula notation
- Linear models
- Generalized linear models
- Linear Mixed Models
- Where to now
- Solutions
Objectives for this lesson
- Learn about R functions and extensions for statistical modeling
- Understand the “formula” part of model specification
- Introduce increasingly complex “linear models”
- Pay no attention to these important topics:
- Experimental/Sampling design
- Model validation
- Hypothesis tests
- Model comparison
Specific achievements
- Write a
lm
formula with an interaction term - Use a non-gaussian family in the
glm
function - Add a “random effect” to a
lmer
formula
R’s formula notation
The basic function for fitting a regression in R is the lm()
function, standing for linear model.
wt_len <- weight ~ hindfoot_length
wt_len.fit <- lm(formula = wt_len,
data = animals)
The lm()
function uses the given formula and the data types of animals
to
compute the best fitting model in this family (i.e. determine the optimum
coefficients)
Formulas need Data
The implementation of lm()
is a specific solution to a general problem: how
can a human easily write down a statistical model that a machine can interpret,
relate to data, and optimize through a given fitting procedure?
The lm()
function requires both careful human input and correct metadata:
- Use a unquoted “formula” expression that mixes variable names and algebra-like operators, to define a design matrix.
- Check the variables used in the formula to complete the design matrix.
- Compute the linear least-squares estimator and more.
Note that “the details” are left to the lm()
function where possible. For
example, whether a variable is a factor or numeric is read from the data frame;
the formula does not distinguish between data types.
The Mini-language(s)
Across different packages in R, this formula “mini-language” has evolved to allow complex model specification and fitting. Each package adds syntax to the formula or new arguments to the fitting function. On a far-distant branch of this evolution is the Stan modeling language, which provides the greatest flexibility but demands in return the most descriptive language.
Linear models
The formula requires a response variable left of a ~
and any number of predictors to its right.
Formula | Description |
y ~ a |
constant and one predictor |
y ~ -1 + a |
one predictor with no constant |
y ~ a + b |
constant and two predictors |
y ~ a:b |
constant and one predictor, the interaction of a factor and another variable |
y ~ a*b |
constant and three predictors |
y ~ a*b - a |
constant and two predictors |
y ~ (a + b + ... )^n |
constant and all combinations of predictors up to order n |
In addition, certain functions are allowed within the formula definition.
animals <- read.csv('data/animals.csv',
na.strings = '')
fit <- lm(
hindfoot_length ~ log(weight),
data = animals)
summary(fit)
Call:
lm(formula = hindfoot_length ~ log(weight), data = animals)
Residuals:
Min 1Q Median 3Q Max
-26.573 -2.976 0.987 3.483 33.738
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.69213 0.14048 -61.88 <2e-16 ***
log(weight) 10.95644 0.03973 275.80 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.119 on 30736 degrees of freedom
(4811 observations deleted due to missingness)
Multiple R-squared: 0.7122, Adjusted R-squared: 0.7122
F-statistic: 7.607e+04 on 1 and 30736 DF, p-value: < 2.2e-16
Metadata matters
For the predictors in a linear model, there is a big difference between factors and numbers.
fit <- lm(
log(weight) ~ species_id,
data = animals)
summary(fit)
Call:
lm(formula = log(weight) ~ species_id, data = animals)
Residuals:
Min 1Q Median 3Q Max
-2.28157 -0.10063 0.02803 0.12574 1.48272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.12857 0.03110 68.448 < 2e-16 ***
species_idDM 1.62159 0.03117 52.031 < 2e-16 ***
species_idDO 1.74519 0.03134 55.690 < 2e-16 ***
species_idDS 2.63791 0.03139 84.024 < 2e-16 ***
species_idNL 2.89645 0.03170 91.373 < 2e-16 ***
species_idOL 1.29724 0.03181 40.780 < 2e-16 ***
species_idOT 1.04031 0.03142 33.110 < 2e-16 ***
species_idOX 0.91176 0.09066 10.056 < 2e-16 ***
species_idPB 1.30426 0.03135 41.609 < 2e-16 ***
species_idPE 0.92374 0.03165 29.188 < 2e-16 ***
species_idPF -0.07717 0.03155 -2.446 0.014447 *
species_idPH 1.28769 0.04869 26.446 < 2e-16 ***
species_idPI 0.82629 0.08004 10.323 < 2e-16 ***
species_idPL 0.79433 0.04665 17.029 < 2e-16 ***
species_idPM 0.90396 0.03189 28.349 < 2e-16 ***
species_idPP 0.69278 0.03133 22.114 < 2e-16 ***
species_idPX 0.81448 0.15075 5.403 6.61e-08 ***
species_idRF 0.45257 0.03934 11.505 < 2e-16 ***
species_idRM 0.20908 0.03137 6.665 2.70e-11 ***
species_idRO 0.18447 0.08004 2.305 0.021191 *
species_idRX 0.56824 0.15075 3.769 0.000164 ***
species_idSF 1.87613 0.04504 41.656 < 2e-16 ***
species_idSH 2.10024 0.03572 58.803 < 2e-16 ***
species_idSO 1.80152 0.04504 40.000 < 2e-16 ***
species_idSS 2.32672 0.15075 15.434 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2086 on 32258 degrees of freedom
(3266 observations deleted due to missingness)
Multiple R-squared: 0.9216, Adjusted R-squared: 0.9215
F-statistic: 1.58e+04 on 24 and 32258 DF, p-value: < 2.2e-16
The difference between 1 and 24 degrees of freedom between the last two
models—with one fixed effect each—arises because species_id
is a factor
while weight is a numeric vector.
Exercise 1
Regress “hindfoot_length” against “species_id” and the log of “weight”. Does it appear that the Chihuahuan Desert’s common kangaroo rats (DM) have largish feet for their weight?
Generalized linear models
The lm
function treats the response variable as numeric—the glm
function lifts this restriction and others. Not through the formula
syntax, which is the same for calls to lm
and glm
, but through addition of the family
argument.
GLM families
The family
argument determines the family of probability distributions in which the response variable belongs. A key difference between families is the data type and range.
Family | Data Type | Default link |
---|---|---|
gaussian |
double |
identity, log, inverse |
binomial |
boolean |
logit, probit, cauchit, log, cloglog |
poisson |
integer |
log, identity, sqrt |
Gamma |
double |
inverse, identity, log |
inverse.gaussian |
double |
“1/mu^2”, inverse, identity, log |
The model fit in exercise 1 is (almost) identically produced with a GLM using the Gaussian family and identity link.
fit <- glm(log(weight) ~ species_id,
family = gaussian,
data = animals)
summary(fit)
Call:
glm(formula = log(weight) ~ species_id, family = gaussian, data = animals)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.28157 -0.10063 0.02803 0.12574 1.48272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.12857 0.03110 68.448 < 2e-16 ***
species_idDM 1.62159 0.03117 52.031 < 2e-16 ***
species_idDO 1.74519 0.03134 55.690 < 2e-16 ***
species_idDS 2.63791 0.03139 84.024 < 2e-16 ***
species_idNL 2.89645 0.03170 91.373 < 2e-16 ***
species_idOL 1.29724 0.03181 40.780 < 2e-16 ***
species_idOT 1.04031 0.03142 33.110 < 2e-16 ***
species_idOX 0.91176 0.09066 10.056 < 2e-16 ***
species_idPB 1.30426 0.03135 41.609 < 2e-16 ***
species_idPE 0.92374 0.03165 29.188 < 2e-16 ***
species_idPF -0.07717 0.03155 -2.446 0.014447 *
species_idPH 1.28769 0.04869 26.446 < 2e-16 ***
species_idPI 0.82629 0.08004 10.323 < 2e-16 ***
species_idPL 0.79433 0.04665 17.029 < 2e-16 ***
species_idPM 0.90396 0.03189 28.349 < 2e-16 ***
species_idPP 0.69278 0.03133 22.114 < 2e-16 ***
species_idPX 0.81448 0.15075 5.403 6.61e-08 ***
species_idRF 0.45257 0.03934 11.505 < 2e-16 ***
species_idRM 0.20908 0.03137 6.665 2.70e-11 ***
species_idRO 0.18447 0.08004 2.305 0.021191 *
species_idRX 0.56824 0.15075 3.769 0.000164 ***
species_idSF 1.87613 0.04504 41.656 < 2e-16 ***
species_idSH 2.10024 0.03572 58.803 < 2e-16 ***
species_idSO 1.80152 0.04504 40.000 < 2e-16 ***
species_idSS 2.32672 0.15075 15.434 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.04351748)
Null deviance: 17905.2 on 32282 degrees of freedom
Residual deviance: 1403.8 on 32258 degrees of freedom
(3266 observations deleted due to missingness)
AIC: -9551.9
Number of Fisher Scoring iterations: 2
- Question
- Shouldn’t the coeficient estimates between this
glm()
and the previouslm()
be different? - Answer
- True, the
lm()
function uses a different (less general) fitting procedure than theglm()
function, which uses IWLS. But on 64-bit machines, it’s rare to encounter an actual difference.
Exercise 2
Weight is actually a positive integer in this dataset. Fit the log of weight
against species ID using lm()
, and fit raw weight against species ID using
glm()
with the Poisson family. According the the anova()
table, are both
models plausible? Hint: you may need to provide additional arguments to
anova()
.
Learn to link
Correct use of the link
argument to the family requires in-depth knowledge
about generalized linear models—not our objective here. A common mistake to
avoid, however, is assuming that glm
applies the transfromation given as
link
to the response variable.
## THIS IS PROBABLY NOT WHAT YOU WANT!
fit <- glm(weight ~ hindfoot_length,
family = gaussian(link = `log`),
data = animals)
Logistic regression
Calling glm
with familly = binomial
using the default “logit” link performs logistic regression.
animals$sex <- factor(animals$sex)
fit <- glm(sex ~ hindfoot_length,
family = binomial,
data = animals)
summary(fit)
Call:
glm(formula = sex ~ hindfoot_length, family = binomial, data = animals)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.366 -1.207 1.057 1.124 1.246
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.179215 0.036495 -4.911 9.08e-07 ***
hindfoot_length 0.009571 0.001186 8.068 7.17e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43408 on 31369 degrees of freedom
Residual deviance: 43343 on 31368 degrees of freedom
(4179 observations deleted due to missingness)
AIC: 43347
Number of Fisher Scoring iterations: 3
Observation weights
Both the lm()
and glm()
function allow a vector of weights the same length
as the response. Weights can be necessary for logistic regression, depending on
the format of the data. The binomial
family glm()
works with three different
response variable formats.
factor
with two levels (binary)matrix
of typeinteger
with two columns for the count of “successes” and “failures”.numeric
between 0 and 1, specifying the proprtion of “successes” out ofweights
trials.
Depending on your predictors, these three formats are not interchangeable.
Exercise 3
You are standing in the Chihuahuan desert, when a pocket mouse (genus
Perognathus) suddenly runs up your pant leg. It weighs down your pocket quite
a bit, relative to your many similar pocket mouse experiences. Run a binomial
family GLM on the two Perognathus species in the animals table that may help you
predict to which species it belongs. Hint: Begin by mutating species_id
into a
factor()
with levels = c("PF", "PH")
.
Linear Mixed Models
The lme4 package expands the formula “mini-language” to allow descriptions of “random effects”.
- normal predictors are “fixed effects”
(...|...)
expressions describe “random effects”
In the context of this package, variables
added to the right of the ~
in the usual way are “fixed effects”—they
consume a well-defined number of degrees of freedom. Variables added within
(...|...)
are “random effects”.
The “random intercepts” and “random slopes” models are the two most common extensions to a formula with one variable.
Formula | Description |
---|---|
y ~ a |
constant and one fixed effect |
y ~ (1 | b) + a |
random intercept for each level in b and one fixed effect |
y ~ a + (a | b) |
random intercept and slope w.r.t. a for each level in b |
Random intercept
The lmer
and glmer
functions fit linear and generalized linear models with
the lme4
formula syntax.
library(lme4)
fit <- lmer(
hindfoot_length ~ (1|species_id) + log(weight),
data = animals)
summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: hindfoot_length ~ (1 | species_id) + log(weight)
Data: animals
REML criterion at convergence: 107596.2
Scaled residuals:
Min 1Q Median 3Q Max
-17.3183 -0.5004 0.0254 0.5731 21.4700
Random effects:
Groups Name Variance Std.Dev.
species_id (Intercept) 47.377 6.883
Residual 1.927 1.388
Number of obs: 30738, groups: species_id, 24
Fixed effects:
Estimate Std. Error t value
(Intercept) 16.77000 1.41229 11.87
log(weight) 2.13065 0.03799 56.09
Correlation of Fixed Effects:
(Intr)
log(weight) -0.087
The familiar assessment of model residuals is absent from the summary due to the lack of a widely accepted measure of null and residual deviance. The notions of model saturation, degrees of freedom, and independence of observations have all crossed onto thin ice.
Non-independence
Models with random effects should be understood as specifying multiple, overlapping probability statements about the observations.
In a lm
or glm
fit, each response is conditionally independent, given it’s predictors and the model coefficients. Each observation corresponds to it’s own probability statement. In a model with random effects, each response is no longer conditionally independent, given it’s predictors and model coefficients.
Exercise 4
Write down the formula for a random intercepts model with a fixed effect of sex and a random effect of plot on each animal’s weight.
Random slope
Adding a numeric variable on the left of a grouping specified with (...|...)
produces a “random slope” model. Here, separate coefficients for hindfoot_length are allowed for each species.
fit <- lmer(
hindfoot_length ~
log(weight) + (log(weight) | species_id),
data = animals)
summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: hindfoot_length ~ log(weight) + (log(weight) | species_id)
Data: animals
REML criterion at convergence: 107030.8
Scaled residuals:
Min 1Q Median 3Q Max
-17.4953 -0.4753 0.0349 0.5259 21.6529
Random effects:
Groups Name Variance Std.Dev. Corr
species_id (Intercept) 19.7197 4.4407
log(weight) 0.6731 0.8205 0.76
Residual 1.8905 1.3750
Number of obs: 30738, groups: species_id, 24
Fixed effects:
Estimate Std. Error t value
(Intercept) 17.7170 0.9367 18.914
log(weight) 1.6920 0.1840 9.196
Correlation of Fixed Effects:
(Intr)
log(weight) 0.573
Generalized Mixed Models
The glmer
function merely adds to lmer
the option to specify an exponential
family distribution for the response variable.
Where to now
- Don’t start with non-linear mixed models! Work your way up through
lm()
glm()
lmer()
glmer()
- Take time to understand the whole
summary()
, the hazards of secondary data compel you. The vignettes for lme4 provide excellent documentation. - When all else fails, ask on Cross Validated (and Ben Bolker very well might answer).
Solutions
Solution 1
fit <- lm(
hindfoot_length ~ species_id + log(weight),
data = animals)
The estimated coefficient for “species_idDM” and associated “***” suggest “yes”.
summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: hindfoot_length ~ log(weight) + (log(weight) | species_id)
Data: animals
REML criterion at convergence: 107030.8
Scaled residuals:
Min 1Q Median 3Q Max
-17.4953 -0.4753 0.0349 0.5259 21.6529
Random effects:
Groups Name Variance Std.Dev. Corr
species_id (Intercept) 19.7197 4.4407
log(weight) 0.6731 0.8205 0.76
Residual 1.8905 1.3750
Number of obs: 30738, groups: species_id, 24
Fixed effects:
Estimate Std. Error t value
(Intercept) 17.7170 0.9367 18.914
log(weight) 1.6920 0.1840 9.196
Correlation of Fixed Effects:
(Intr)
log(weight) 0.573
Solution 2
fit_lm <- lm(log(weight) ~ species_id,
data = animals)
fit_glm <- glm(weight ~ species_id,
family = poisson,
data = animals)
anova(fit_lm)
Analysis of Variance Table
Response: log(weight)
Df Sum Sq Mean Sq F value Pr(>F)
species_id 24 16501.4 687.56 15800 < 2.2e-16 ***
Residuals 32258 1403.8 0.04
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit_glm, test = 'Chisq')
Analysis of Deviance Table
Model: poisson, link: log
Response: weight
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 32282 775950
species_id 24 718546 32258 57404 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Solution 3
perognathus <- animals
perognathus$species_id <- factor(
perognathus$species_id, levels = c('PF', 'PH'))
fit <- glm(species_id ~ weight,
family = binomial,
data = perognathus)
The negative intercept is due to the much higher frequency of “failures” (the first level or P. flavus). The positive effect of weight means you’ve probably got a P hispidus in your pocket.
summary(fit)
Call:
lm(formula = hindfoot_length ~ species_id + log(weight), data = animals)
Residuals:
Min 1Q Median 3Q Max
-24.0407 -0.6951 0.0352 0.7954 29.8038
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.4710 0.2222 38.127 < 2e-16 ***
species_idDM 19.5411 0.2164 90.316 < 2e-16 ***
species_idDO 18.8751 0.2189 86.234 < 2e-16 ***
species_idDS 31.3797 0.2320 135.262 < 2e-16 ***
species_idNL 13.0945 0.2382 54.967 < 2e-16 ***
species_idOL 4.7893 0.2176 22.011 < 2e-16 ***
species_idOT 5.0538 0.2129 23.743 < 2e-16 ***
species_idOX 5.4411 0.6553 8.303 < 2e-16 ***
species_idPB 10.3330 0.2144 48.196 < 2e-16 ***
species_idPE 5.2348 0.2137 24.499 < 2e-16 ***
species_idPF 2.7361 0.2101 13.023 < 2e-16 ***
species_idPH 10.0344 0.3277 30.622 < 2e-16 ***
species_idPI 7.3669 0.5336 13.807 < 2e-16 ***
species_idPL 5.3377 0.3119 17.115 < 2e-16 ***
species_idPM 5.5085 0.2152 25.601 < 2e-16 ***
species_idPP 7.2761 0.2102 34.623 < 2e-16 ***
species_idPX 4.7670 1.0036 4.750 2.05e-06 ***
species_idRF 3.5411 0.2637 13.430 < 2e-16 ***
species_idRM 2.9976 0.2090 14.343 < 2e-16 ***
species_idRO 1.9825 0.5327 3.722 0.000198 ***
species_idRX 4.2909 1.0034 4.276 1.90e-05 ***
species_idSF 9.7021 0.3100 31.298 < 2e-16 ***
species_idSH 11.1301 0.2535 43.909 < 2e-16 ***
species_idSO 8.7729 0.3093 28.364 < 2e-16 ***
log(weight) 2.1277 0.0380 55.998 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.388 on 30713 degrees of freedom
(4811 observations deleted due to missingness)
Multiple R-squared: 0.9789, Adjusted R-squared: 0.9788
F-statistic: 5.924e+04 on 24 and 30713 DF, p-value: < 2.2e-16
Solution 4
weight ~ (1 | plot_id) + sex