Mini-Languages for Statistical Models

Lesson 6 with Ian Carroll

Contents


Objectives for this lesson

Specific achievements

Top of Section


R’s formula notation

The basic function for fitting a regression in R is the lm function, standing for linear model.

wt_len <- weight ~ hindfoot_length
wt_len.fit <- lm(formula = wt_len, data = animals)

The lm function uses the given formula and the data types of portal to compute the best fitting model in this family (i.e. determine the optimum coefficients)

Mini-language

The implementation of lm is a specific solution to a general problem: how can a human easily write down a statistical model that a machine can interpret, relate to data, and optimize through a given fitting procedure?

For the lm function, the answers are:

  1. Use a unquoted “formula” expression, mixing variable names and algebra-like operators, that maps to a design matrix.
  2. Check the variables used in the formula to complete the design matrix.
  3. Compute the linear least-squares estimator and more.

Note that “the details” are left to the lm function where possible. For example, whether a variable is a factor or numeric is read from the data frame; the formula does not distinguish between data types.

Mini-language

Across different packages in R, this formula “mini-language” has evolved to allow complex model specification. Each one adds syntax to the formula or new arguments to the fitting function. On a far-distant branch of this evolution is the Stan modeling language, which provides the greatest flexibility but demands in return the most descriptive language.

Top of Section


Linear models

The formula requires a response variable left of a “~” and any number of predictors to its right.

Formula Equivalent Description
y ~ a y ~ 1 + a constant and one predictor
y ~ -1 + a y ~ 0 + a one predictor with no constant
y ~ a + b y ~ 1 + a + b constant and two predictors
y ~ a:b   constant and one predictor, the interaction of (at least) one factor and another variable
y ~ a*b y ~ 1 + a + b + a:b constant and three predictors
y ~ a*b - a y ~ 1 + b + a:b constant and two predictors
y ~ (a + b + ... )^n   constant and all combinations of predictors up to order n

Linear model

In addition, certain functions are allowed within the formula definition.

animals <- read.csv('data/animals.csv', stringsAsFactors = FALSE, na.strings = '')
fit <- lm(
  log(weight) ~ hindfoot_length,
  data = animals)
summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: log(weight) ~ (1 | species_id) + hindfoot_length
   Data: animals

REML criterion at convergence: -11966.7

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-11.7117  -0.4802   0.1040   0.6123   7.4102 

Random effects:
 Groups     Name        Variance Std.Dev.
 species_id (Intercept) 0.25479  0.5048  
 Residual               0.03943  0.1986  
Number of obs: 30738, groups:  species_id, 24

Fixed effects:
                 Estimate Std. Error t value
(Intercept)     2.2138725  0.1051703   21.05
hindfoot_length 0.0436769  0.0007761   56.28

Correlation of Fixed Effects:
            (Intr)
hndft_lngth -0.175

Exercise 1

Regress hindfoot_length against weight and species_id. Does it appear that the Chihuahuan Desert’s common kangaroo rat (DM) have inordinately large feet for their weight?

Pay attention to factors

Data type matters in statistical modelling. For the predictors in a linear model, the most important distinction is discrete versus continuous.

animals$species_id <- factor(animals$species_id)
fit <- lm(
  log(weight) ~ species_id,
  data = animals)
summary(fit)

Call:
lm(formula = log(weight) ~ species_id, data = animals)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.28157 -0.10063  0.02803  0.12574  1.48272 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.12857    0.03110  68.448  < 2e-16 ***
species_idDM  1.62159    0.03117  52.031  < 2e-16 ***
species_idDO  1.74519    0.03134  55.690  < 2e-16 ***
species_idDS  2.63791    0.03139  84.024  < 2e-16 ***
species_idNL  2.89645    0.03170  91.373  < 2e-16 ***
species_idOL  1.29724    0.03181  40.780  < 2e-16 ***
species_idOT  1.04031    0.03142  33.110  < 2e-16 ***
species_idOX  0.91176    0.09066  10.056  < 2e-16 ***
species_idPB  1.30426    0.03135  41.609  < 2e-16 ***
species_idPE  0.92374    0.03165  29.188  < 2e-16 ***
species_idPF -0.07717    0.03155  -2.446 0.014447 *  
species_idPH  1.28769    0.04869  26.446  < 2e-16 ***
species_idPI  0.82629    0.08004  10.323  < 2e-16 ***
species_idPL  0.79433    0.04665  17.029  < 2e-16 ***
species_idPM  0.90396    0.03189  28.349  < 2e-16 ***
species_idPP  0.69278    0.03133  22.114  < 2e-16 ***
species_idPX  0.81448    0.15075   5.403 6.61e-08 ***
species_idRF  0.45257    0.03934  11.505  < 2e-16 ***
species_idRM  0.20908    0.03137   6.665 2.70e-11 ***
species_idRO  0.18447    0.08004   2.305 0.021191 *  
species_idRX  0.56824    0.15075   3.769 0.000164 ***
species_idSF  1.87613    0.04504  41.656  < 2e-16 ***
species_idSH  2.10024    0.03572  58.803  < 2e-16 ***
species_idSO  1.80152    0.04504  40.000  < 2e-16 ***
species_idSS  2.32672    0.15075  15.434  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2086 on 32258 degrees of freedom
  (3266 observations deleted due to missingness)
Multiple R-squared:  0.9216,	Adjusted R-squared:  0.9215 
F-statistic: 1.58e+04 on 24 and 32258 DF,  p-value: < 2.2e-16

The difference between 1 and 24 degrees of freedom between the last two models—with one fixed effect each—arises from the discreteness of species_id.

Top of Section


Generalized linear models

The lm function treats the response variable as numeric—the glm function lifts this restriction and others. Not through the formula syntax, which is the same for calls to lm and glm, but through addition of the family argument.

GLM families

The family argument determines the family of probability distributions in which the response variable belongs. A key difference between families is the data type and range.

Family Support Default link
gaussian numeric identity, log, inverse
binomial 2-level factor logit, probit, cauchit, log, cloglog
poisson non-negative integer log, identity, sqrt
Gamma positive numeric inverse, identity, log
inverse.gaussian positive numeric “1/mu^2”, inverse, identity, log

Exercise 2

The default value for family is gaussian(link = 'identity'). Replace lm with glm (changing nothing else), to again fit the formula log(weight) ~ species_id. Compare the summary() between lm() and glm(), and identify something that is the same and something that is different in the output.

Correct use of the link argument to the family requires in-depth knowledge about generallize linear models—not our objective here. A common mistake to avoid, however, is assuming that glm applies the transfromation given as link to the response variable.

fit <- glm(weight ~ hindfoot_length,
    family = gaussian(link = `log`),
    data = animals)

Logistic regression

Calling glm with familly = binomial using the default “logit” link performs logistic regression.

animals$sex <- factor(animals$sex)
fit <- glm(sex ~ hindfoot_length,
           family = binomial,
           data = animals)
summary(fit)

Call:
glm(formula = sex ~ hindfoot_length, family = binomial, data = animals)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.366  -1.207   1.057   1.124   1.246  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -0.179215   0.036495  -4.911 9.08e-07 ***
hindfoot_length  0.009571   0.001186   8.068 7.17e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43408  on 31369  degrees of freedom
Residual deviance: 43343  on 31368  degrees of freedom
  (4179 observations deleted due to missingness)
AIC: 43347

Number of Fisher Scoring iterations: 3

Weight

Both the lm and glm function allow a vector of weights the same length as the response. Weights can be necessary for logistic regression, depending on the format of the data. The binomial family glm works with three different response variable formats.

  1. factor with two levels, as shown in the table above
  2. matrix of type integer with two columns for the count of “successes” and “failures”.
  3. numeric between 0 and 1, specifying the proprtion of “successes” out of weights trials.

Exercise 3

You are standing in the Chihuahuan desert, when a pocket mouse (genus Perognathus) suddenly runs up your pant leg. It doesn’t weigh your pocket down much, relative to your many similar pocket mouse experiences. Run a binomial family GLM on the two Perognathus species in the Portal data that may help you predict to which species it belongs. Hint: The second level() of a factor is the “success” in a binomial GLM.

Top of Section


Linear Mixed Models

The lme4 package expands the formula “mini-language” to allow descriptions of “random effects”. In the context of this package, variables added to the right of the ~ in the usual way are “fixed effects”—they consume a well-defined number of degrees of freedom. Variables added within (...|...) are “random effects”.

The “variable intercepts” and “random slopes” classes of model are the two most common extensions to a formula with one variable.

Formula Description
y ~ a constant and one fixed effect
y ~ (1 | b) + a random intercept for each level in b and one fixed effect
y ~ (1 + a | b) random intercepts and “slopes” w.r.t. a for each level in b

Random intercept

The lmer and glmer functions fit linear and generalized linear models with the lme4 formula syntax.

library(lme4)
fit <- lmer(log(weight) ~ (1 | species_id) + hindfoot_length,
            data = animals)
summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: log(weight) ~ (1 | species_id) + hindfoot_length
   Data: animals

REML criterion at convergence: -11966.7

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-11.7117  -0.4802   0.1040   0.6123   7.4102 

Random effects:
 Groups     Name        Variance Std.Dev.
 species_id (Intercept) 0.25479  0.5048  
 Residual               0.03943  0.1986  
Number of obs: 30738, groups:  species_id, 24

Fixed effects:
                 Estimate Std. Error t value
(Intercept)     2.2138725  0.1051703   21.05
hindfoot_length 0.0436769  0.0007761   56.28

Correlation of Fixed Effects:
            (Intr)
hndft_lngth -0.175

The familiar assessment of model residuals is absent from the summary due to the lack of a widely accepted measure of null and residual deviance. The notions of model saturation, degrees of freedom, and independence of observations have all crossed onto thin ice.

Lack of independence

Models with random effects should be understood as specifying multiple, overlapping probability statements about the observations.

In a lm or glm fit, each response is conditionally independent, given it’s predictors and the model coefficients. Each observation corresponds to it’s own probability statement. In a model with random effects, each response is no longer conditionally independent, given it’s predictors and model coefficients.

Exercise 4

Adjust the formula log(weight) ~ hindfoot_length to fit a “random intercepts” model, grouping by a different categorical variable (i.e. not species_id) in the animals data frame.

Random slopes

Adding a numeric variable after the constant within a grouping specified by (...|...) produces a “random slope” model. Here, separate coefficients for hindfoot_length are allowed for each species.

fit <- lmer(log(weight) ~ (1 + hindfoot_length | species_id),
            data = animals)
summary(fit)

Call:
lm(formula = log(weight) ~ species_id, data = animals)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.28157 -0.10063  0.02803  0.12574  1.48272 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.12857    0.03110  68.448  < 2e-16 ***
species_idDM  1.62159    0.03117  52.031  < 2e-16 ***
species_idDO  1.74519    0.03134  55.690  < 2e-16 ***
species_idDS  2.63791    0.03139  84.024  < 2e-16 ***
species_idNL  2.89645    0.03170  91.373  < 2e-16 ***
species_idOL  1.29724    0.03181  40.780  < 2e-16 ***
species_idOT  1.04031    0.03142  33.110  < 2e-16 ***
species_idOX  0.91176    0.09066  10.056  < 2e-16 ***
species_idPB  1.30426    0.03135  41.609  < 2e-16 ***
species_idPE  0.92374    0.03165  29.188  < 2e-16 ***
species_idPF -0.07717    0.03155  -2.446 0.014447 *  
species_idPH  1.28769    0.04869  26.446  < 2e-16 ***
species_idPI  0.82629    0.08004  10.323  < 2e-16 ***
species_idPL  0.79433    0.04665  17.029  < 2e-16 ***
species_idPM  0.90396    0.03189  28.349  < 2e-16 ***
species_idPP  0.69278    0.03133  22.114  < 2e-16 ***
species_idPX  0.81448    0.15075   5.403 6.61e-08 ***
species_idRF  0.45257    0.03934  11.505  < 2e-16 ***
species_idRM  0.20908    0.03137   6.665 2.70e-11 ***
species_idRO  0.18447    0.08004   2.305 0.021191 *  
species_idRX  0.56824    0.15075   3.769 0.000164 ***
species_idSF  1.87613    0.04504  41.656  < 2e-16 ***
species_idSH  2.10024    0.03572  58.803  < 2e-16 ***
species_idSO  1.80152    0.04504  40.000  < 2e-16 ***
species_idSS  2.32672    0.15075  15.434  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2086 on 32258 degrees of freedom
  (3266 observations deleted due to missingness)
Multiple R-squared:  0.9216,	Adjusted R-squared:  0.9215 
F-statistic: 1.58e+04 on 24 and 32258 DF,  p-value: < 2.2e-16

Allowing each animal to use a constant and hindfoot_length coefficient shared only by other members of its species is a lot like fitting separate regressions for each species. Actual utility arises when combining this random slope predictor with other predictors not grouped by species.

Generalized Mixed Models

The glmer function merely adds to lmer the option to specify an exponential family distribution for the response variable.

Top of Section


Meet Stan

Stan is the name of a program that performs an entirely different kind of model fitting, and the Stan modelling language is yet another way of writing a “formula”. Stan is similar to JAGS and BUGS: it is a “sampler”, but does not use the Gibbs algorithm.

Sampling vs. Fitting

The result of model fitting through lm and its descendents in R is an estimated coefficient coupled with the uncertainty in that estimate.

The results produced by Stan are tables of numbers with a column for each coefficient in the model. Each row represents an equally likely set of coefficients for the model, as judged by its resulting fit to the data.

An example

The formulas we use previously to describe a “random intercepts” model are directly included in a Stan formula.

data {
  int N;
  vector[N] log_weight;
  vector[N] hindfoot_length;
  int species_id[N];
  int M;
} 
parameters {
  real beta0;
  vector[M] beta1;
  real beta2;
  real<lower=0> sigma;
  real<lower=0> sigma_genus;
}
transformed parameters {
  vector[N] mu;
  for (i in 1:N)
    mu = beta0 + beta1[species_id]+ hindfoot_length * beta2;
}
model {
  log_weight ~ normal(mu, sigma);
  beta0 ~ normal(0, 10);
  beta1 ~ normal(0, sigma_genus);
  beta2 ~ normal(0, 10);
  sigma_genus ~ cauchy(0, 5);
}

RStan

The rstan package is a wrapper for sending data to the Stan program and getting back results.

library(rstan)

stanimals <- animals %>%
  select(weight, species_id, hindfoot_length) %>%
  na.omit() %>%
  mutate(log_weight = log(weight),
         species_idx = as.integer(factor(species_id))) %>%
  select(-weight, -species_id)
stanimals <- c(
  N = nrow(stanimals),
  M = max(stanimals$species_idx),
  as.list(stanimals))

samp <- stan(file = 'worksheet-6.stan',
             data = stanimals,
             iter = 1000, chains = 3)
saveRDS(samp, 'stanimals.RDS')

Parameter “samples”

An example samp output, saved to “RDS” after a long-running Stan job, is available in the handout’s data folder. Each row represents an equally likely set of coefficients for the model, as judged by its resulting fit to the data.

Summary

Top of Section