Regression
Lesson 6 with Ian Carroll
Lesson Objectives
- Learn about R functions and extensions for statistical modeling
- Understand the “formula” part of model specification
- Introduce increasingly complex “linear models”
- Meet Stan, the MCMC sampler
Lesson Non-objectives
(Pay no attention to these very important topics.)
- Experimental/sampling design
- Model validation
- Hypothesis tests
- Model comparison
Specific Achievements
- Write a
lm
formula with an interaction term - Use a non-gaussian family in the
glm
function - Add a “random effect” to a
lmer
formula - Sample parameter values for a GLM with rstan
Formula Notation
The basic function for fitting a regression in R is the lm()
function,
standing for linear model.
> wt_len <- weight ~ hindfoot_length
> wt_len.fit <- lm(formula = wt_len,
+ data = animals)
The lm()
function takes a formula argument and a data argument, and computes
the best fitting linear model (i.e. determine the optimum coefficients via
least-squared-error).
Formulas + Data
The implementation of lm()
is a specific solution to a general problem: how
can a human easily write down a statistical model that a machine can interpret,
relate to data, and optimize through a given fitting procedure?
The lm()
function requires both careful human input and correct metadata:
- Use a unquoted “formula” expression that mixes variable names and algebra-like operators, to define a design matrix.
- Check the variables used in the formula to complete the design matrix.
- Compute the linear least-squares estimator and more.
Note that “the details” are left to the lm()
function where possible. For
example, whether a variable is a factor or numeric is read from the data frame;
the formula does not distinguish between data types.
Formula Mini-language(s)
- “base R” function
lm()
- “base R” function
glm()
- lme4 function
lmer()
- lme4 function
glmer()
- rstan models for Stan
Across different packages in R, the formula “mini-language” has evolved to allow complex model specification and fitting. Each package adds syntax to the formula or new arguments to the fitting function. On a far-distant branch of this evolution is the Stan modeling language, which provides the greatest flexibility but demands in return the most descriptive language.
Linear Models
The formula requires a response variable left of a ~
and any number of
predictors to its right.
Formula | Description |
y ~ a |
constant and one predictor |
y ~ -1 + a |
one predictor with no constant |
y ~ a + b |
constant and two predictors |
y ~ a:b |
constant and one predictor, the interaction of a factor and another variable |
y ~ a*b |
constant and three predictors |
y ~ a*b - a |
constant and two predictors |
y ~ (a + b + ... )^n |
constant and all combinations of predictors up to order n |
In addition, certain functions are allowed within the formula definition.
animals <- read.csv('data/animals.csv',
na.strings = '')
fit <- lm(
hindfoot_length ~ log(weight),
data = animals)
> summary(fit)
Call:
lm(formula = hindfoot_length ~ log(weight), data = animals)
Residuals:
Min 1Q Median 3Q Max
-26.573 -2.976 0.987 3.483 33.738
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.69213 0.14048 -61.88 <2e-16 ***
log(weight) 10.95644 0.03973 275.80 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.119 on 30736 degrees of freedom
(4811 observations deleted due to missingness)
Multiple R-squared: 0.7122, Adjusted R-squared: 0.7122
F-statistic: 7.607e+04 on 1 and 30736 DF, p-value: < 2.2e-16
Metadata Matters
For the predictors in a linear model, there is a big difference between factors and numbers.
fit <- lm(
log(weight) ~ species_id,
data = animals)
> summary(fit)
Call:
lm(formula = log(weight) ~ species_id, data = animals)
Residuals:
Min 1Q Median 3Q Max
-2.28157 -0.10063 0.02803 0.12574 1.48272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.12857 0.03110 68.448 < 2e-16 ***
species_idDM 1.62159 0.03117 52.031 < 2e-16 ***
species_idDO 1.74519 0.03134 55.690 < 2e-16 ***
species_idDS 2.63791 0.03139 84.024 < 2e-16 ***
species_idNL 2.89645 0.03170 91.373 < 2e-16 ***
species_idOL 1.29724 0.03181 40.780 < 2e-16 ***
species_idOT 1.04031 0.03142 33.110 < 2e-16 ***
species_idOX 0.91176 0.09066 10.056 < 2e-16 ***
species_idPB 1.30426 0.03135 41.609 < 2e-16 ***
species_idPE 0.92374 0.03165 29.188 < 2e-16 ***
species_idPF -0.07717 0.03155 -2.446 0.014447 *
species_idPH 1.28769 0.04869 26.446 < 2e-16 ***
species_idPI 0.82629 0.08004 10.323 < 2e-16 ***
species_idPL 0.79433 0.04665 17.029 < 2e-16 ***
species_idPM 0.90396 0.03189 28.349 < 2e-16 ***
species_idPP 0.69278 0.03133 22.114 < 2e-16 ***
species_idPX 0.81448 0.15075 5.403 6.61e-08 ***
species_idRF 0.45257 0.03934 11.505 < 2e-16 ***
species_idRM 0.20908 0.03137 6.665 2.70e-11 ***
species_idRO 0.18447 0.08004 2.305 0.021191 *
species_idRX 0.56824 0.15075 3.769 0.000164 ***
species_idSF 1.87613 0.04504 41.656 < 2e-16 ***
species_idSH 2.10024 0.03572 58.803 < 2e-16 ***
species_idSO 1.80152 0.04504 40.000 < 2e-16 ***
species_idSS 2.32672 0.15075 15.434 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2086 on 32258 degrees of freedom
(3266 observations deleted due to missingness)
Multiple R-squared: 0.9216, Adjusted R-squared: 0.9215
F-statistic: 1.58e+04 on 24 and 32258 DF, p-value: < 2.2e-16
The difference between 1 and 24 degrees of freedom between the last two
models—with one fixed effect each—arises because species_id
is a factor
while weight is a numeric vector.
Generalized Linear Models
The lm
function treats the response variable as numeric—the glm
function lifts this restriction and others. Not through the formula
syntax, which is the same for calls to lm
and glm
, but through addition of the family
argument.
GLM Families
The family
argument determines the family of probability distributions in which the response variable belongs. A key difference between families is the data type and range.
Family | Data Type | Default link |
---|---|---|
gaussian |
double |
identity, log, inverse |
binomial |
boolean |
logit, probit, cauchit, log, cloglog |
poisson |
integer |
log, identity, sqrt |
Gamma |
double |
inverse, identity, log |
inverse.gaussian |
double |
“1/mu^2”, inverse, identity, log |
The model fit in exercise 1 is (almost) identically produced with a GLM using the Gaussian family and identity link.
fit <- glm(log(weight) ~ species_id,
family = gaussian,
data = animals)
> summary(fit)
Call:
glm(formula = log(weight) ~ species_id, family = gaussian, data = animals)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.28157 -0.10063 0.02803 0.12574 1.48272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.12857 0.03110 68.448 < 2e-16 ***
species_idDM 1.62159 0.03117 52.031 < 2e-16 ***
species_idDO 1.74519 0.03134 55.690 < 2e-16 ***
species_idDS 2.63791 0.03139 84.024 < 2e-16 ***
species_idNL 2.89645 0.03170 91.373 < 2e-16 ***
species_idOL 1.29724 0.03181 40.780 < 2e-16 ***
species_idOT 1.04031 0.03142 33.110 < 2e-16 ***
species_idOX 0.91176 0.09066 10.056 < 2e-16 ***
species_idPB 1.30426 0.03135 41.609 < 2e-16 ***
species_idPE 0.92374 0.03165 29.188 < 2e-16 ***
species_idPF -0.07717 0.03155 -2.446 0.014447 *
species_idPH 1.28769 0.04869 26.446 < 2e-16 ***
species_idPI 0.82629 0.08004 10.323 < 2e-16 ***
species_idPL 0.79433 0.04665 17.029 < 2e-16 ***
species_idPM 0.90396 0.03189 28.349 < 2e-16 ***
species_idPP 0.69278 0.03133 22.114 < 2e-16 ***
species_idPX 0.81448 0.15075 5.403 6.61e-08 ***
species_idRF 0.45257 0.03934 11.505 < 2e-16 ***
species_idRM 0.20908 0.03137 6.665 2.70e-11 ***
species_idRO 0.18447 0.08004 2.305 0.021191 *
species_idRX 0.56824 0.15075 3.769 0.000164 ***
species_idSF 1.87613 0.04504 41.656 < 2e-16 ***
species_idSH 2.10024 0.03572 58.803 < 2e-16 ***
species_idSO 1.80152 0.04504 40.000 < 2e-16 ***
species_idSS 2.32672 0.15075 15.434 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.04351748)
Null deviance: 17905.2 on 32282 degrees of freedom
Residual deviance: 1403.8 on 32258 degrees of freedom
(3266 observations deleted due to missingness)
AIC: -9551.9
Number of Fisher Scoring iterations: 2
- Question
- Should the coeficient estimates between this Gaussian
glm()
and the previouslm()
be different? - Answer
- It’s possible. The
lm()
function uses a different (less general) fitting procedure than theglm()
function, which uses IWLS. But with 64 bits used to store very precise numbers, it’s rare to encounter an noticeable difference.
Logistic Regression
Calling glm
with familly = binomial
using the default “logit” link performs
logistic regression.
animals$sex <- factor(animals$sex)
fit <- glm(sex ~ log(hindfoot_length),
family = binomial,
data = animals)
> summary(fit)
Call:
glm(formula = sex ~ log(hindfoot_length), family = binomial,
data = animals)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.313 -1.214 1.075 1.120 1.423
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.73611 0.11123 -6.618 3.64e-11 ***
log(hindfoot_length) 0.25205 0.03333 7.563 3.93e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43408 on 31369 degrees of freedom
Residual deviance: 43351 on 31368 degrees of freedom
(4179 observations deleted due to missingness)
AIC: 43355
Number of Fisher Scoring iterations: 3
Observation Weights
Both the lm()
and glm()
function allow a vector of weights the same length
as the response. Weights can be necessary for logistic regression, depending on
the format of the data. The binomial
family glm()
works with three different
response variable formats.
factor
with only or coerced to two levels (binary)matrix
with two columns for the count of “successes” and “failures”numeric
proprtion of “successes” out ofweights
trials
Depending on your model, these three formats are not necesarilly interchangeable.
Linear Mixed Models
The lme4 package expands the formula “mini-language” to allow descriptions of “random effects”.
- normal predictors are “fixed effects”
(...|...)
expressions describe “random effects”
In the context of this package, variables
added to the right of the ~
in the usual way are “fixed effects”—they
consume a well-defined number of degrees of freedom. Variables added within
(...|...)
are “random effects”.
The “random intercepts” and “random slopes” models are the two most common extensions to a formula with one variable.
Formula | Description |
---|---|
y ~ a |
constant and one fixed effect |
y ~ (1 | b) + a |
random intercept for each level in b and one fixed effect |
y ~ a + (a | b) |
random intercept and slope w.r.t. a for each level in b |
Random Intercept
The lmer
and glmer
functions fit linear and generalized linear models with
the lme4
formula syntax.
library(lme4)
fit <- lmer(
hindfoot_length ~ (1|species_id) + log(weight),
data = animals)
Warning: 'rBind' is deprecated.
Since R version 3.2.0, base's rbind() should work fine with S4 objects
> summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: hindfoot_length ~ (1 | species_id) + log(weight)
Data: animals
REML criterion at convergence: 107596.2
Scaled residuals:
Min 1Q Median 3Q Max
-17.3183 -0.5004 0.0254 0.5731 21.4700
Random effects:
Groups Name Variance Std.Dev.
species_id (Intercept) 47.377 6.883
Residual 1.927 1.388
Number of obs: 30738, groups: species_id, 24
Fixed effects:
Estimate Std. Error t value
(Intercept) 16.77000 1.41230 11.87
log(weight) 2.13065 0.03799 56.09
Correlation of Fixed Effects:
(Intr)
log(weight) -0.087
The familiar assessment of model residuals is absent from the summary due to the lack of a widely accepted measure of null and residual deviance. The notions of model saturation, degrees of freedom, and independence of observations have all crossed onto thin ice.
Non-independence
Models with random effects should be understood as specifying multiple, overlapping probability statements about the observations.
In a lm
or glm
fit, each response is conditionally independent, given it’s
predictors and the model coefficients. Each observation corresponds to it’s own
probability statement. In a model with random effects, each response is no
longer conditionally independent, given it’s predictors and model coefficients.
Random Slope
Adding a numeric variable on the left of a grouping specified with (...|...)
produces a “random slope” model. Here, separate coefficients for hindfoot_length
are allowed for each species.
fit <- lmer(
hindfoot_length ~
log(weight) + (log(weight) | species_id),
data = animals)
> summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: hindfoot_length ~ log(weight) + (log(weight) | species_id)
Data: animals
REML criterion at convergence: 107030.8
Scaled residuals:
Min 1Q Median 3Q Max
-17.4953 -0.4753 0.0349 0.5259 21.6529
Random effects:
Groups Name Variance Std.Dev. Corr
species_id (Intercept) 19.7199 4.4407
log(weight) 0.6731 0.8204 0.76
Residual 1.8905 1.3750
Number of obs: 30738, groups: species_id, 24
Fixed effects:
Estimate Std. Error t value
(Intercept) 17.7170 0.9367 18.914
log(weight) 1.6920 0.1840 9.196
Correlation of Fixed Effects:
(Intr)
log(weight) 0.573
Generalized Mixed Models
The glmer
function merely adds to lmer
the option to specify several
exponential family distributions for the response variable.
Meet Stan
Stan is the name of a program that performs an entirely different kind of model fitting, and the Stan modelling language is yet another way of writing a “formula”. Stan is similar to JAGS and BUGS—it is a “sampler”, but does not use the Gibbs algorithm.
Sampling vs. Fitting
The result of model fitting through lm
and its descendents in R is an estimated coefficient coupled with the uncertainty in that estimate.
The results produced by Stan are tables of numbers with a column for each coefficient in the model. Each row represents an equally likely set of coefficients for the model, as judged by its resulting fit to the data.
An Example
The formulas we use previously to describe a “random intercepts” model are directly included in a Stan model file, along with necessary information on the data.
data {
int N;
int M;
vector[N] hindfoot_length;
int species_id[N];
vector[N] log_weight;
}
The parameters section of the model defines the parameters for which samples will be returned.
parameters {
real beta0;
vector[M] beta1;
real beta2;
real<lower=0> sigma0;
real<lower=0> sigma1;
}
The model section specifies all the probability distributions used in the likelihood and the priors.
model {
vector[N] mu;
mu = beta0 + beta1[species_id] + beta2 * log_weight;
hindfoot_length ~ normal(mu, sigma0);
beta1 ~ normal(0, sigma1);
beta0 ~ normal(0, 5);
beta2 ~ normal(0, 5);
sigma0 ~ cauchy(0, 5);
sigma1 ~ cauchy(0, 5);
}
RStan
The rstan package is a wrapper for sending data to the Stan program and getting back results.
library(dplyr)
stanimals <- animals %>%
select(hindfoot_length, species_id, weight) %>%
na.omit() %>%
mutate(
log_weight = log(weight),
species_id = as.integer(factor(species_id))) %>%
select(-weight)
stanimals <- c(
N = nrow(stanimals),
M = max(stanimals$species_id),
as.list(stanimals))
The stan
function reads the model file, compiles an hidden executable based on the model and data, runs it and reads the results back into R.
library(rstan)
samp <- stan(file = 'worksheet-6.stan',
data = stanimals,
iter = 1000, chains = 2, cores = 2)
saveRDS(samp, 'stanimals.RDS')
Parameter Inference
An example samp
output, saved to “RDS” after a long-running Stan job, is
available in the handout’s data folder. Each row represents an equally likely
set of coefficients for the model, as judged by its resulting fit to the data.
> plot(samp, pars = c('beta0', 'beta2', 'sigma0', 'sigma1'))
ci_level: 0.8 (80% intervals)
outer_level: 0.95 (95% intervals)
The “random intercepts” due to species_id are also parameters in the
model—which merely claimed they are normally distributed with variance
sigma1
.
> plot(samp, pars = 'beta1')
ci_level: 0.8 (80% intervals)
outer_level: 0.95 (95% intervals)
Perhaps we should revisit that claim, and choose a asymmetric distribution.
Pre-compiled Stan
Sampling with Stan lets you specify any structure and use any probability distribution. You pay quite a cost in speed and difficulty, but both are going down with rstanarm.
The pre-compiled Stan analog of the example above takes a similar model formula as lme4, the usual data frame, and adds “default” priors.
library(rstanarm)
fit <- stan_lmer(
hindfoot_length ~ (1 | species_id) + log(weight),
data = animals,
chains = 2, cores = 2, iter = 1000)
The stan_glmer
function allows you to additionally specify a family, same as the glm
function employed above.
A little advice
- Don’t start with generalized linear mixed models! Work your way up through
lm()
glm()
lmer()
glmer()
stan_lm()
- Take time to understand the
summary()
—the hazards of secondary data compel you! The vignettes for lme4 provide excellent documentation.
-
When all else fails with lme4, ask on Cross Validated (and Ben Bolker very well might answer).
-
When all else fails with Stan, ask the Stan users/devs community.
Exercises
Exercise 1
Regress “hindfoot_length” against “species_id” and the log of “weight”. Does it appear that the Chihuahuan Desert’s common kangaroo rats (DM) have largish feet for their weight?
Exercise 2
Weight is actually a positive integer in this dataset. Fit the log of weight
against species ID using lm()
, and fit raw weight against species ID using
glm()
with the Poisson family. According the the anova()
table, are both
models plausible? Hint: you may need to provide additional arguments to
anova()
.
Exercise 3
You are standing in the Chihuahuan desert, when a pocket mouse (genus
Perognathus) suddenly runs up your pant leg. It weighs down your pocket quite
a bit, relative to your many similar pocket mouse experiences. Run a binomial
family GLM on the two Perognathus species in the animals table that may help you
predict to which species it belongs. Hint: Begin by mutating species_id
into a
factor()
with levels = c("PF", "PH")
.
Exercise 4
Write down the formula for a random intercepts model with a fixed effect of sex and a random effect of plot on each animal’s weight.
Exercise 5
Use stan_glm
to evaluate the logistic regression performed earlier with glm
on sex as predicted by the log of hindfoot_length.
Solutions
Solution 1
fit <- lm(
hindfoot_length ~ species_id + log(weight),
data = animals)
The estimated coefficient for “species_idDM” and associated “***” suggest “yes”.
> summary(fit)
Call:
lm(formula = hindfoot_length ~ species_id + log(weight), data = animals)
Residuals:
Min 1Q Median 3Q Max
-24.0407 -0.6951 0.0352 0.7954 29.8038
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.4710 0.2222 38.127 < 2e-16 ***
species_idDM 19.5411 0.2164 90.316 < 2e-16 ***
species_idDO 18.8751 0.2189 86.234 < 2e-16 ***
species_idDS 31.3797 0.2320 135.262 < 2e-16 ***
species_idNL 13.0945 0.2382 54.967 < 2e-16 ***
species_idOL 4.7893 0.2176 22.011 < 2e-16 ***
species_idOT 5.0538 0.2129 23.743 < 2e-16 ***
species_idOX 5.4411 0.6553 8.303 < 2e-16 ***
species_idPB 10.3330 0.2144 48.196 < 2e-16 ***
species_idPE 5.2348 0.2137 24.499 < 2e-16 ***
species_idPF 2.7361 0.2101 13.023 < 2e-16 ***
species_idPH 10.0344 0.3277 30.622 < 2e-16 ***
species_idPI 7.3669 0.5336 13.807 < 2e-16 ***
species_idPL 5.3377 0.3119 17.115 < 2e-16 ***
species_idPM 5.5085 0.2152 25.601 < 2e-16 ***
species_idPP 7.2761 0.2102 34.623 < 2e-16 ***
species_idPX 4.7670 1.0036 4.750 2.05e-06 ***
species_idRF 3.5411 0.2637 13.430 < 2e-16 ***
species_idRM 2.9976 0.2090 14.343 < 2e-16 ***
species_idRO 1.9825 0.5327 3.722 0.000198 ***
species_idRX 4.2909 1.0034 4.276 1.90e-05 ***
species_idSF 9.7021 0.3100 31.298 < 2e-16 ***
species_idSH 11.1301 0.2535 43.909 < 2e-16 ***
species_idSO 8.7729 0.3093 28.364 < 2e-16 ***
log(weight) 2.1277 0.0380 55.998 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.388 on 30713 degrees of freedom
(4811 observations deleted due to missingness)
Multiple R-squared: 0.9789, Adjusted R-squared: 0.9788
F-statistic: 5.924e+04 on 24 and 30713 DF, p-value: < 2.2e-16
Solution 2
fit_lm <- lm(log(weight) ~ species_id,
data = animals)
fit_glm <- glm(weight ~ species_id,
family = poisson,
data = animals)
> anova(fit_lm)
Analysis of Variance Table
Response: log(weight)
Df Sum Sq Mean Sq F value Pr(>F)
species_id 24 16501.4 687.56 15800 < 2.2e-16 ***
Residuals 32258 1403.8 0.04
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> anova(fit_glm, test = 'Chisq')
Analysis of Deviance Table
Model: poisson, link: log
Response: weight
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 32282 775950
species_id 24 718546 32258 57404 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Solution 3
perognathus <- animals
perognathus$species_id <- factor(
perognathus$species_id, levels = c('PF', 'PH'))
fit <- glm(species_id ~ weight,
family = binomial,
data = perognathus)
The negative intercept is due to the much higher frequency of “failures” (the first level or P. flavus). The positive effect of weight means you’ve probably got a P hispidus in your pocket.
> summary(fit)
Call:
glm(formula = species_id ~ weight, family = binomial, data = perognathus)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.83312 -0.03294 -0.02487 -0.01878 2.25500
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.57789 1.94715 -6.460 1.05e-10 ***
weight 0.56207 0.09342 6.017 1.78e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 305.082 on 1578 degrees of freedom
Residual deviance: 23.771 on 1577 degrees of freedom
(33970 observations deleted due to missingness)
AIC: 27.771
Number of Fisher Scoring iterations: 10
Solution 4
> weight ~ (1 | plot_id) + sex
Solution 5
library(rstanarm)
fit <- stan_glm(
sex ~ log(hindfoot_length),
data = animals,
family = binomial(),
chains = 2, cores = 2, iter = 1000)
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!