Week 5: Correlations and Regressions

Yiling Huo

2024-06-04

So far we’ve talked about statistical tests that compare groups/conditions. But you might wonder what if my Independent Variable is not categorical (e.g. time, age, temperature)? Sure we can try to force those continuous measures to categories (e.g. young vs. old, cold vs. warm), but the classification might not be straightforward in many cases, and moreover, converting those continuous measures to categories might make you fail to detect the underlying effect. So today we will talk about ways to deal with continuous Independent Variables (and eventually see that categorical variables can just be seen as special cases of continuous variables).

1 Correlation

Informally, correlation refers to an association or relationship between two variables; this relationship can go in either direction. For example, body weight is associated with height; general fitness is related to the time taken to run 100 yards; drowning accidents are associated with the consumption of ice cream; shoe size is associated with the speed of doing jigsaw puzzles in children; …1 Quick Q: Try to think about the ‘reason’ behind the last cases, and think about what correlations can truly tell you.

The strength of a correlation tells you how different values of one variable are related to the value of the other. If two variables are strongly correlated, you can readily predict one value from the other; if they are not correlated, knowing one doesn’t tell you anything about the value of the other.

1.1 An example

Let’s look at an example of two correlated variables. Fourteen pre-schoolers are given a test of verbal reasoning (max score = 30) and a picture vocabulary test (max score =20). Here are their scores:

task s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14
verbal reasoning 25 21 15 16 23 29 13 19 23 21 15 26 11 20
picture vocabulary 15 11 9 12 14 18 12 12 17 14 12 13 10 12

It looks like there is a tendency that as verbal reasoning score increases, picture vocabulary scores also increases. If we fit a straight line with these data, we can see that the actual data points are distributed quite closely to the line. In fact, the closer the data are to this line, the stronger the correlation.

1.2 Pearson’s correlation coefficient (\(r\))

But how to quantify this correlation? We do that with the Pearson product-moment correlation coefficient (or just Pearson’s correlation coefficient). Pearson’s correlation coefficient (\(r\)) lies between -1 and +1, and greater magnitudes (closer to +/-1) indicate stronger relationships.

A positive correlation means that increases in one variable are accompanied by increases in the other; while a negative correlation means that increases in one variable are accompanied by decreases in the other. Finally, zero correlation means that changes (either increases or decreases) in one variable are unrelated to the changes in the other variable.

In the above example, \(r=0.72\), this is a rather strong positive correlation.

The magnitude of a correlation can be roughly classified as:

\[ 0 < \text{very weak} < 0.2 < \text{weak} < 0.4 < \text{moderate} < 0.6 < \text{strong} < 0.8 < \text{very strong} < 1 \]

Correlation can also be a statistical test. In this case, the null hypothesis is that there is no significant correlation between the two variables, and the alternative hypothesis is that there is a significant correlation.

Note that in a correlation, there is no distinction between the predictor and the outcome (there is no distinction between the Independent and the Dependent variables).

Pearson’s correlation requires that:

Here are some examples of what data might look like with different correlation coefficients (with the best-fitting line drawn). Here is also an online tool for you to develop some intuition about Pearson’s \(r\).

2024-04-30T16:08:47.029960 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
Perfect positive correlation.
2024-04-30T16:08:47.281887 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
Trivial positive correlation.
2024-04-30T16:08:47.737447 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
Strong positive correlation.
2024-04-30T16:08:48.160203 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
Nearly perfect negative correlation.
2024-04-30T16:08:47.500185 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
Trivial negative correlation.
2024-04-30T16:08:47.943944 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
Strong negative correlation.

But correlation coefficients can be misleading! The reason, again, is that correlation coefficients assume something about the relationship. For example, Pearson’s \(r\) assumes linear relationships between variables, which might not always be the case. So before you rush to calculate the correlation coefficient, always use the scatter plot to look at the relationship first. For example, here is an example with nearly-perfect relationship but a very small \(r\):

2024-04-30T16:07:57.466833 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
\(y = x^2 + \text{noise}\).

1.3 Spearman’s correlation coefficient (\(\rho\))

Unlike Pearson’s correlation, Spearman’s rank-order correlation (\(\rho\)) doesn’t assume linear relationship2 But it does assume monotonic relationship between the two variables. A monotonic relationship means that the relationship is either positive or negative, and at no point in the data this changes. In the \(y = x^2\) example, x and y has a non-monotonic relationship (y decreases then increases as x increases), so Spearman’s \(\rho\) cannot help us either. , and does not require data to be normally distributed. In some cases, Spearman’s correlation can help you quantify the relationship better.

2024-04-30T16:08:46.016839 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
2024-04-30T16:08:46.219445 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
2024-04-30T16:08:46.418921 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
2024-04-30T16:08:46.623419 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/
2024-04-30T16:08:46.829163 image/svg+xml Matplotlib v3.7.1, https://matplotlib.org/

1.4 Correlation ≠ Causation!

If one variable is genuinely causaly correlated with another (A causes B), there will be some kind of correlation between them. But you can’t be sure the other way around.

A correlation between two variables can be caused by their relationshiop to other variables. For example, children run faster when they get older; children also learn more words as they get older. But you cannot say running causes children’s vocabulary size to increase, or vice versa.

Here is another example: Per capita consumption of mozzarella cheese is correlated with civil engineering doctorates awarded. But you should know that eating cheese doesn’t make you a doctor.

1.5 Do it in R: Pearson’s correlation coefficient

Let’s use the above example.

  1. Prepare data: Data should be in short form, each row represents one participant (in this example, one year).
  2. Check assumptions
  3. Run the correlation test using cor.test().
##    year mozzarella_consumed engineering_doctorates
## 1  2000                 9.3                    480
## 2  2001                 9.7                    501
## 3  2002                 9.7                    540
## 4  2003                 9.7                    552
## 5  2004                 9.9                    547
## 6  2005                10.2                    622
## 7  2006                10.5                    655
## 8  2007                11.0                    701
## 9  2008                10.6                    712
## 10 2009                10.6                    708
# Check for normality
shapiro.test(df$mozzarella_consumed)
## 
##  Shapiro-Wilk normality test
## 
## data:  df$mozzarella_consumed
## W = 0.94053, p-value = 0.559
shapiro.test(df$engineering_doctorates)
## 
##  Shapiro-Wilk normality test
## 
## data:  df$engineering_doctorates
## W = 0.89536, p-value = 0.1947
# Run the Pearson's correlation test
cor.test(df$mozzarella_consumed, df$engineering_doctorates, method='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  df$mozzarella_consumed and df$engineering_doctorates
## t = 9.5274, df = 8, p-value = 1.217e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8300027 0.9904491
## sample estimates:
##       cor 
## 0.9586478
  1. Report the correlation

Pearson’s correlation coefficient revealed that the amount of mozzarella cheese consumed is significantly correlated with the number of enigneering doctorates granted (\(r(8)=0.96, p<0.001\)).

2 Regression

A related idea to correlation is regression. You can think about correlation as indicating how much knowledge about one variable tells you about another. In regression, we make that explicit: how well we can predict a \(y\) variable from \(x\)? And often people’s intuition is to think about \(x\) as ‘causing’ \(y\); but again this is not always justified (a signifcant predictor in a regression also does NOT tell you anything about causality).

2.1 Simple linear regression

Pearson’s correlation quantifies the strength of a linear relationship between a pair of variables, whereas linear regression expresses the relationshiop in the form of a equation. You might have noticed the ‘best-fitting’ lines I added to some of the scatter plots, and in the plainest language, regression is all about calculating that line.

2.1.1 Simple linear regression: Example

Suppose a researcher measured the mean length of utterance (MLU) in a group of young children (2-4 years) in a neighbourhood, and found that it depends on age (\(r=0.85\)). In other words, you can predict a particular child’s MLU from their age reasonably well, or at least better than if you don’t know it.

Now a new family moves into the neighbourhood, how would you go about making a guess for their child’s MLU? Without any information about the child’s age, your best guess can only bee the average MLU. But if you know their age, you can predict their MLU based on age (i.e. MLU as a function of age).

In a simple linear regression, the model function can be written as

\[ y = \alpha + {\beta}x \]

where \(\alpha\) is the intercept (where the line croses the y-axis), and \(\beta\) is the gradient or slope (how steep the line is). Data can be described by

\[ y_i = \alpha + {\beta}x_i + \epsilon_i \]

where \(\epsilon\) is the error or residual (the difference between predicted and observed value).

2.1.2 Assumptions of simple linear regression

  • Linearity: the relationship between the prediction (\(x\)) and the outcome (\(y\)) is assumed to be linear.
  • Normality of residuals: the residuals (errors) are assumed to be normally distributed.
  • Homoscedasticity (Homogeneity of residual variance): The residuals are assumed to have a constant variance.
  • Independence of residuals error terms: There is not a relationship between the residuals and the \(y\) variable.

2.1.3 Effect size of a linear regression: R-squared

The \(R^2\) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable (aka variance accounted for). In other words, the \(R^2\) shows how well the data fit the regression model (the goodness of fit).

In a simple linear regression, the \(R^2\) is simply the square of the Pearson correlation coefficient. The \(R^2\) can be interpreted in similar ways as the correlation coefficient: a \(R^2\) closer to 1 indicates that a larger proportion of the variability in the outcome canbe explained by the regression model, a number closer to 0 indicates less variance accounted for.

2.1.4 Do it in R: Simple linear regression

  1. Prepare data3 You can download the data here.: data should be in short form: the two measures for the same observation should be in the same row.
# import data
simple_regression_exp <- read.csv('mlu.csv', header = TRUE)
head(simple_regression_exp)
##   participant      age       MLU
## 1           1 3.245548 11.704914
## 2           2 3.085281 12.035840
## 3           3 2.624829  7.582783
## 4           4 3.426735 11.920344
## 5           5 2.923928 11.801822
## 6           6 2.438216  7.292018
# visualise the data using a scatter plot
library(ggplot2)
ggplot(simple_regression_exp, aes(x=age, y=MLU)) + geom_point() + 
  geom_smooth(method=lm, se=FALSE, color = 'darkgrey') + 
  theme_light()

  1. Compute the linear model
simple_regression_model <- lm(MLU ~ age, data = simple_regression_exp)
summary(simple_regression_model)
## 
## Call:
## lm(formula = MLU ~ age, data = simple_regression_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9116 -0.9599  0.2080  0.8839  3.0391 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.6878     1.7405  -2.693   0.0118 *  
## age           4.9756     0.5784   8.603 2.39e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.502 on 28 degrees of freedom
## Multiple R-squared:  0.7255, Adjusted R-squared:  0.7157 
## F-statistic:    74 on 1 and 28 DF,  p-value: 2.39e-09
  1. Interpret the summary output

The summary output shows 6 components, including:

  • call: the function used to compute the regression model.
  • residuals: a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the menimum and maximum should be roughly equal in absolute value.
  • coefficients: the regression model’s \(\beta\) coefficients and their statistical significance.
  • residual standard error (RSE), \(R^2\), and the F-statistics: metrics that are used to check how well the model fits to the data.

How to interpret the linear regression summary table:

The model function: prediction = intercept estimate + predictor estimate * predictor value. In this case, the model function is \(y = -4.6878 + 4.9756*\text{age}\).

Significance of the predictor: read (the t-value and ) the p-value in the coefficients table. The null hypothesis in a regression is no relationship between \(x\) and \(y\). The alternative hypothesis is that there is a relationship between \(x\) and \(y\), or that the value of \(x\) can predict the value of \(y\). In this example, we find a significant predictor age (p<0.001).

In this case, we also have a significant intercept (p = 0.01), which basically means that when age = 0 (when the line crosses \(x=0\)), the MLU is significantly different from 0. This means very little to us in this case (what do you mean a new-born baby is predicted to have a non-zero utterance length?), but a significant intercepts may be interpretable in other cases.

Overall significance: read the F-statistics at the bottom of the output. The F-statistics asseess whether at least one predictor variable has a non-zero coefficient. In a simple linear regression, this is not really interesting because we just have one predictor in total. In fact, in this case, the F test is identical to the square of the t test $74 ^2 $.

The effect size: read the \(R^2\) (for simple regression) or the adjusted \(R^2\) (for multiple regression).

  1. Check the assumptions (using diagnostic plots)
library(performance)
check_model(simple_regression_model)

  1. Report the regression

Simple linear regression was used to predict children’s mean length of utterance (MLU) from their age. Age explained a significant amount of the variance in MLU, F(1,28)=74, p<0.001, R2=0.73. The regression coefficient (β=4.98) indicated that an increase in one year of age corresponded, on average, to an increase in MLU of 4.98 words.

2.2 Multiple linear regression

Multiple linear regression is just a linear regression with more than one predictor variables. The model function just has more terms than the simple linear regression:

\[ y = \alpha + {\beta_1}x_1 + {\beta_2}x_2 \]

And the data can be described by

\[ y_i = \alpha + {\beta_1}x_1i + + {\beta_2}x_2i + \epsilon_i \]

2.2.1 Multiple linear regresion example

Let’s say the researcher in the last example also collected data on those children’s maternal education, measured in years of education (YoE); and they want to know whether MLU can be predicted as a function of age and maternal education.

2.2.2 Do it in R: Multiple linear regression

  1. Prepare data4 You can download the data here.
  2. Run the model (and check assumptions)
multiple_regression_exp <- read.csv('mlu-multi.csv', header = TRUE)
head(multiple_regression_exp)
##   participant      age       MLU  YoE
## 1           1 3.245548 11.704914 13.1
## 2           2 3.085281 12.035840  7.5
## 3           3 2.624829  7.582783 12.3
## 4           4 3.426735 11.920344  9.5
## 5           5 2.923928 11.801822 10.3
## 6           6 2.438216  7.292018 13.8

# compute the linear regression model
## Note that I put down age + YoE because in this example we are not interested in the interaction between age and YoE.
## If you want the interaction to be included in the model, put down age * YoE instead.
multiple_regression_model <- lm(MLU ~ age + YoE, data = multiple_regression_exp)
summary(multiple_regression_model)
## 
## Call:
## lm(formula = MLU ~ age + YoE, data = multiple_regression_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9530 -0.9383  0.2373  0.8541  3.0798 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.87977    1.89934  -2.569    0.016 *  
## age          4.95104    0.59472   8.325  6.2e-09 ***
## YoE          0.02455    0.08812   0.279    0.783    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.527 on 27 degrees of freedom
## Multiple R-squared:  0.7263, Adjusted R-squared:  0.706 
## F-statistic: 35.82 on 2 and 27 DF,  p-value: 2.532e-08
  1. Model selection

From the summary output we can see that YoE is not a significant predictor in the model.

A good model not only needs to fit the data well, it also needs to be parsimonious (i.e. not more complex than necessary). If you are choosing beteween a simple model with one predictor and a complex model with, say, 10 predictors, the complex model needs to provide a much better fit to the data in order to be a better model. If it can’t, then the simpler model should be preferred.

In this case, we can compare whether the model with just one predictor age (the one we got from the simple regression) or the model with both age and YoE as predictors is better. For this we can use the anova()5 Note that the function we used to actually conduct an ANOVA is called aov(), and the anova() function is used to compare regression models. function.

# compare difference models using anova()
# model one (one predictor: age)
model_1 <- lm(MLU ~ age, data = multiple_regression_exp)
# model two (two predictors: age and YoE)
model_2 <- lm(MLU ~ age + YoE, data = multiple_regression_exp)

# compare model one with model two
anova(model_1, model_2)
## Analysis of Variance Table
## 
## Model 1: MLU ~ age
## Model 2: MLU ~ age + YoE
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 63.168                           
## 2     27 62.987  1   0.18105 0.0776 0.7827

From the anova() output, we can see that adding YoE as a predictor did not significantly improve the model fit. Thus we should reject model two and stick to the simpler model one.

2.3 Mixed-effects linear regression

If you go and read recent journal papers, this is perhaps one of the most common statistical analyses you’ll see. In fact, mixed-effects linear regressions are starting to replace t-tests and ANOVAs6 You might wonder how t-tests and ANOVAs are related to regressions. In fact, a t-test can be thought of as a simple linear regression model where the predictor variable can only take on two values (two groups/conditions), and an ANOVA can similarly be thought of as a linear regression model where the predictor(s) can only take on certain values. And in fact, if you code your variable right (by giving the groups/conditions fitting numeric values, for example 1 for the group that received treatment, 0 for the group that did not receive treatment), a regression model will output the same results as a t-test or an ANOVA.
Another important advantage of mixed-effect regressions over t-tests and ANOVAs is that you don’t have to do the by-participant (or by-item) averages. This gives you a larger sample size and more power.
in many fields including psychology and exprimental linguistics. This is because mixed-effects models can account for what’s called random effects: systematic variation in the data that’s not part of the exprimental manipulation.

Say for example you are interested in whether word frequency affects reading speed, and you gathered some participants and made them read a bunch of frequent vs. non-frequent words and measured their reading times. Your data points are not going to be truely independent from each other: data collected from the same person will be correlated in some ways; some people just read faster than others, and the difference between frequency vs. non-frequent words might be slightly larger for some people than others. Mixed-effects models give you a way to determine whether there is a significant effect of frequency once all those random effects are accounted for.

A mixed-effects linear model predicts the outcome based on both fixed-effects (the predictors/IVs) and random-effects7 Some common random factors in experimental psychology etc.:
participant
experimenter
experiment item
participant origin (different schools, neighbourhoods, etc.)
. The data can be describes as

\[ y={\beta}X + uZ + \epsilon \]

where

\({\beta}X\) represents the structure of fixed effects (i.e. \({\beta_1}x_1 + {\beta_2}x_2 + ...\)),

\(uZ\) represents the structure of random effects (all of the non-manipulation factors and their effects), and

\(\epsilon\) represents the errors/residuals.

2.3.1 Random intercepts, random slopes, and the maximal random effect structure

In plain words, a model with random effects allow the model predictions to systematically vary based on a random factor. Now there are different ways that those systematic variation can happen, and they are random intercepts and random slopes.

A model with random intercepts allow each level of the random factor to have a different intercept, while keeping the slopes constant. In contrast, random slopes allow each level of the random factor to have a different slope, while all predictions share the same intercept. A random factor can have random intercepts, random slopes for any number of the fix factors, or both. Here are some example data and what the model predictions look like with by-participant random intercepts, by-participant random slopes, or both.

Who decides what random factors should be in a model? You do. Just like it’s your responsibility to design the experiment (thus decide what the fixed factors are), it’s your responsibility to decide what random factors to include, and whether those should be random intercepts and/or slopes. What random factors to include usually isn’t too difficult: you’ll have a fairly well idea as long as you understand your experiment design. For example, in an experiment where 50 participants each read 100 English sentences and their reading times are measured, it’s a good idea to include participant and sentence (experimental item) as two random factors; as there are likely to be by-participant and by-sentence/item differences in the data that are systematic (e.g. some people read faster than others; some sentences takes longer to read than others.).

But how do I decide whether participant and item should have random intercepts or/and random slopes? You can always plot your data and decide. It sounds stressful, but luckily there are some guidelines we can follow. Barr et al. (2012) suggested that we should always begin with what’s called the maximal random effect structure: including both random intercepts and random slopes for all random factors. And unless the maximal model has problems8 Usually non-convergence or singular-fit problems. Don’t worry R will warn you if those happens.
And of course another possibility is that your computer isn’t powerful enough to handle the maximal random effect structure (and you don’t have access to a more powerful one), which surprisingly isn’t too uncommon in my experience.
(in which case plot your data and see what terms you can take out), we should stick to it.

2.3.2 Coding categorical factors

If you have a categorical factor in a regression (fixed or random), you’ll need to code it (convert it to numbers) before the computer can make sense of it. Sometimes this is straightforward, for example if you have two groups of patients, one underwent treatment and the other didn’t, it’s easy to say that the group that did not undergo treatment is some sort of default/reference (and should receive the value 0), and the group that had treatment is where something actually happened (and should receive the value 1). What I just described is called treatment coding, or dummy coding.

But a lot of the time there is no reference-change relationship between the levels of your factor: male vs. female, north vs. south, for example. There is no default and all levels in those factors are equal, they’re just categorically different. In this case we usually use what’s called sum coding or deviation coding, which makes all levels of the factor equally far away from the reference (0). If you have two levels in a factor, they might be coded as -0.5, +0.5, or -1, +1.

There are a ton of other coding schemes but they’re outside the scope of this class.

2.3.3 Do it in R: Mixed-effects linear regression

Suppose the data from the last example is collected from two different neighbourhoods: one in the north and one in the south. Suppose that we are not interested in the effect of region (north vs. south) on MLU, it just happens that we had to collect data from two regions because there weren’t enough children in one region. We can include region (north vs. south) as a random factor.

  1. Prepare data9 You can download the data here.
mixed_regression_exp <- read.csv('mlu-mix.csv', header=TRUE)
head(mixed_regression_exp)
##   participant      age       MLU region
## 1           1 3.245548 11.704914  south
## 2           2 3.085281 12.035840  south
## 3           3 2.624829  7.582783  north
## 4           4 3.426735 11.920344  south
## 5           5 2.923928 11.801822  south
## 6           6 2.438216  7.292018  north
  1. Code categorical factors.
mixed_regression_exp$region <- as.factor(mixed_regression_exp$region)
contrasts(mixed_regression_exp$region) <- contr.sum(2)
contrasts(mixed_regression_exp$region)
##       [,1]
## north    1
## south   -1
  1. Run the regression model (lme4 package).

The formula for a mixed-effects linear model with a maximal random effect structure is:

lmer(DV ~ IV1 * IV2 + (1 + IV1 * IV2 | RV1) + (1 + IV1 * IV2 | RV2), data = data)

where 1 indicate random intercepts (0 for no random intercepts), and the fixed factor terms in the randon effect structure indicate random slopes.

# the summary table of mixed effect models does not tell you the p value by default
# This is because p-values can be sometimes misleading in a mixed model
# But for this class that's out of our scope
# we can force the p-values to show using the lmerTest package
library(lme4)
library(lmerTest)
mixed_model <- lmer(MLU ~ age + (1 + age | region), data = mixed_regression_exp)
summary(mixed_model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: MLU ~ age + (1 + age | region)
##    Data: mixed_regression_exp
## 
## REML criterion at convergence: 100
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.17690 -0.49386 -0.00459  0.52190  1.92606 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  region   (Intercept) 5.850962 2.41888       
##           age         0.008988 0.09481  -1.00
##  Residual             1.568340 1.25233       
## Number of obs: 30, groups:  region, 2
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)  
## (Intercept)   3.0603     3.1498  1.5322   0.972    0.459  
## age           2.3770     0.8872 18.5487   2.679    0.015 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##     (Intr)
## age -0.875
  1. Check the assumptions (same as linear regression).
library(performance)
check_model(mixed_model)

  1. Report the results: fixed and random effect structures, and effects.

Mixed-effect linear regression with one fixed factor age as well as by-region random intercepts and random slopes revealed a sigficiant main effect of age on the mean length of utterance in children (\(\beta\)=2.35, t=2.68, p=0.007).

2.4 Generalised linear mixed-effects models

The mixed-effect linear regression seems good, but what if my data is not normally distributed?

In some easy cases, you might be able to convert your data to a normal distribution and run the linear model, for example when your data is log-normally distributed. A typical lognormal distribution example is reaction times (RT). This is a highly common measure in psychology but usually its distribution is not exactly normal: RTs generally center around some value, and occasionally people will be very slow (e.g. when they loose focus or when the trials is extra hard), but no one is going to be magically a lot faster in the same way (and obviously RT cannot be negative). If you plot the data, it usually looks right-skewed (a long tail to the right). Data like this may not be normally distributed, but their natural logarithm may be. In that case, we say that the data follows a lognormal distribution. If your data is log-normally distributed, you can simply log-transform your data before running the linear model (i.e. fit the linear model to the logarithm of your data instead of the raw data).

Other times you might not be so lucky. What if my data are yes-no responses? Generalised linear models can help you in many of these cases. Generalised linear models allow data from different distributions, such as binary responses. Some distributions that GLMMs can deal with include:

2.4.1 GLMM example: Mixed-effects logistic regression

In this class I’ll just talk about one case of GLMMs that might be handy for us: the binomial distribution and the logistic regression. Your data follows a binomial distribution if it’s binary (e.g. yes vs. no, success vs. failure); this can be common in surveys and some experimental measures.

There is no time to delve into how logistic regression works10 Basically the linear predictor is linked to the predictions (here probability of success) using a linking function, in this case the inverse logit function., let’s just take an example and learn how to do it in R.

Suppose an experimental phonologist want to know whether a phonological rule in a language A is gradually disappearing (if applying such a rule depends on the speaker’s age). 50 participants varying in age participated in an experiment where they were each asked to read out 20 words where the rule could apply. Data was recorded such that if the speaker applied the rule, the response was coded as 1, otherwise the response was coded as 0.

  1. Prepare data11 You can download the data here.
logit_exp <- read.csv('phonology.csv', header = TRUE)
head(logit_exp)
##   subject age word response
## 1       1  40    1        1
## 2       1  40    2        1
## 3       1  40    3        1
## 4       1  40    4        1
## 5       1  40    5        0
## 6       1  40    6        1

Frequency of each type of response by subject, order by subject age.

Frequency of each type of response by subject, order by subject age.
  1. Run the logistic regression model
# here we are only including by-participant random intercepts
# the optimizer is there to optimize the way the computer finds the model parameters (coefficients), the details are outside of this class's scope
logistic_model <- glmer(response ~ age + (1 | subject), 
  data = logit_exp, 
  family = 'binomial',
  glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5)))
summary(logistic_model)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: response ~ age + (1 | subject)
##    Data: logit_exp
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##   1008.9   1023.6   -501.5   1002.9      997 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.7810  0.2092  0.4178  0.5561  1.2839 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  subject (Intercept) 0.01252  0.1119  
## Number of obs: 1000, groups:  subject, 50
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.41072    0.39618  -6.085 1.17e-09 ***
## age          0.09897    0.01109   8.925  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##     (Intr)
## age -0.979
  1. Check the assumptions:
library(performance)
check_model(logistic_model)

  1. Report the results

Mixed-effect logistic regression with one fixed factor age as well as by-participant random intercepts revealed a significant effect of age on whether a speaker applies the phonological rule in question (\(\beta\)=0.1, z=8.93, p<0.001).

In this case the researcher might also be interested in exactly how fast this phonological rule is disappearing, so reporting the following may be helpful (although you need to do some calculations based on the results):

It was found that with one year of age decreased, the odds of applying the phonological rule decreased by 10.4%12 \(e^{\beta}\) (95% CI [.08, .13])13 \(e^{{\beta}-1.96*\text{sd}}\), \(e^{{\beta}+1.96*\text{sd}}\).

3 And finally if you want to go even further

If you have been to some of the phonology or computational linguistics modules, you might remember learning something called the “Bayes’ rule”. And if you read a lot of journal papers, you might noticed some newest papers using what’s called “Bayesian statistics”.

What we have been talking about, in fact, is all part of what’s called “frequentist statistics”, and Bayesian statistics is a quite different framework for dealing with data (and is obviously beyond the scope of this class).

The main difference between our statistics and Bayesian statistics is that, in plain words, Bayesian statistics allow you to ask the model to incorporate prior beliefs about the underlying effects. Say other labs have done some experiment for 9 times and their overall conclusion is that the is some effect of \(x\) on \(y\) of certain coefficients. Now suppose I’ve done another replication of the same experiment. With Bayesian statistics, I can ask the model to answer the question “given the previous conclusion and my new data, what should be the newest conclusion about the effect of \(x\) on \(y\)?”. If you’re doing some replication work, this will obviously be helpful.

But don’t worry if you can’t wrap your head around this yet. What we learned in this class should be more than enough for your projects/theses. I hope after this class you’re a bit more confident in dealing with data, and of course my code is here for you to take. Good luck!

Further readings

Field, A. P., Miles, J., & Field, Zoë. (2012). 6 Correlation. In Discovering statistics using r.

Field, A. P., Miles, J., & Field, Zoë. (2012). 7 Regression. In Discovering statistics using r.

Field, A. P., Miles, J., & Field, Zoë. (2012). 8 Logistic regression. In Discovering statistics using r.

Phillips, N. D. (2018). Regression. In YaRrr! The pirate’s guide to r.

Poldrack, R. A. (2019). Modeling continuous relationships. In Statistical thinking for the 21st century.

Poldrack, R. A. (2019). The General Linear Model. In Statistical thinking for the 21st century.

More advanced:

Demidenko, E. (2013). 2 MLE for LME Model. In Mixed models theory and applications with R / Eugene Demidenko. (2nd ed.). Wiley.

Demidenko, E. (2013). 7 Generalized Linear Mixed Models. In Mixed models theory and applications with R / Eugene Demidenko. (2nd ed.). Wiley.

Even more advanced:

Nicenboim, B., Schad D., Vasishth S. (2024) An Introduction to Bayesian Data Analysis for Cognitive Science.