All data sets (except for one) in the exercises are randomly-generated fake data.
50 students took an exam (scores 0-100). They were also asked how many hours they spent studying for the exam. A teacher wants to know whether there is a relationship between the hours spent studying and the exam results, and whether they can predict future students’ exam results based on the hours spent studying.
data1 <- read.csv("exam.csv", header=TRUE)
head(data1)
## time_spent test_score participant
## 1 3.6972636 66 1
## 2 3.2692988 71 2
## 3 2.8380752 62 3
## 4 0.5846604 64 4
## 5 1.4220910 72 5
## 6 4.2502780 83 6
library(ggplot2)
ggplot(data1, aes(x=time_spent, y=test_score)) + geom_point() + geom_smooth(method=lm, se=FALSE, color = 'darkgrey') + theme_light()
The teacher would like to answer two questions: whether the two variables are correlated, and whether they can predict exam results from hours spent studying. The first question can be answered using a correlation test, but the second question has to be answered by a regression. Now regression models will give you the correlation coefficient eitherway (just squareroot the \(R^2\)), so you can either do both tests or just the regression. I’ll show you how to do both.
# Pearson's correlation
cor.test(data1$time_spent, data1$test_score, method='pearson')
##
## Pearson's product-moment correlation
##
## data: data1$time_spent and data1$test_score
## t = 3.3965, df = 48, p-value = 0.001379
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1844476 0.6401119
## sample estimates:
## cor
## 0.4401954
# simple linear regression
slm <- lm(test_score ~ time_spent, data=data1)
summary(slm)
##
## Call:
## lm(formula = test_score ~ time_spent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.895 -10.950 1.203 9.144 30.372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.886 5.039 9.304 2.54e-12 ***
## time_spent 5.095 1.500 3.397 0.00138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.49 on 48 degrees of freedom
## Multiple R-squared: 0.1938, Adjusted R-squared: 0.177
## F-statistic: 11.54 on 1 and 48 DF, p-value: 0.001379
library(performance)
check_model(slm)
A simple linear regression model with one predictor time spent studying revealed a significant effect of time spent studying on exam results (\(\beta\)=5.1, t=3.4, p=0.001). It was found that with one hour of more time spent on studying, the exam results increased by 5.1 points.
Lexical decision is a popular task in psycho/neurolinguistics. In a lexical decision task, the participant is presented with a word, and is asked to decide whether this word is a real word or not in their language. Typically, reaction times (RT) needed to make a correct decision are recorded. This time is associated with the difficulty of lexical access (pulling the lexical item out of your mental lexicon).
A psycholinguist wants to study whether difficulty of lexical access (as measured by lexical decision RTs) is affected by the target word’s frequency and its length. 50 participants performed the lexical decision task with words varying in frequency and length. Each participant completed 100 items. Reaction times are log-transformed.
data2 <- read.csv("lexical.csv", header=TRUE)
head(data2)
## subject item frequency length logrt rt
## 1 1 1 109 8 6.809362 906.2921
## 2 1 2 190 5 6.369214 583.5989
## 3 1 3 216 9 6.304127 546.8242
## 4 1 4 118 6 6.995545 1091.7583
## 5 1 5 204 9 6.410343 608.1025
## 6 1 6 201 7 6.354005 574.7901
library(ggplot2)
# let's do two catter plot, one for each predictor
ggplot(data2, aes(x=frequency, y=logrt)) + geom_point() + geom_smooth(method=lm, se=FALSE, color = 'darkgrey') + theme_light()
ggplot(data2, aes(x=length, y=logrt)) + geom_point() + geom_smooth(method=lm, se=FALSE, color = 'darkgrey') + theme_light()
We have two predictors: word frequency and word length, both are continuous. We should also include two random factors: participant and item; as participants were repeatedly measured, and items were also repeatedly measured (each participant completed 100 items), we’d expect to have some systematic variability there. This should be a mixed-effects linear regression with two fixed factors (word frequency, word length) and two random factors (participant, item).
data2$subject <- as.factor(data2$subject)
data2$item <- as.factor(data2$item)
(1|subject) + (1|item)
(This is in case your computer cannot handle the maximal random effects structure (mine couldn’t)).REML=FALSE
and control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5))
while running your model to optimize performance.library(lme4)
library(lmerTest)
library(performance)
mm <- lmer(logrt ~ frequency*length + (1|subject) + (1|item), data=data2, REML=FALSE, control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))
summary(mm)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: logrt ~ frequency * length + (1 | subject) + (1 | item)
## Data: data2
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
##
## AIC BIC logLik deviance df.resid
## -6907.4 -6861.7 3460.7 -6921.4 4993
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5572 -0.6578 0.0239 0.6508 3.6652
##
## Random effects:
## Groups Name Variance Std.Dev.
## item (Intercept) 1.491e-05 0.003861
## subject (Intercept) 1.016e-03 0.031867
## Residual 1.435e-02 0.119789
## Number of obs: 5000, groups: item, 100; subject, 50
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 7.406e+00 2.942e-02 1.039e+02 251.755 <2e-16 ***
## frequency -4.566e-03 1.291e-04 9.924e+01 -35.369 <2e-16 ***
## length 3.991e-03 3.695e-03 9.924e+01 1.080 0.283
## frequency:length -1.336e-05 1.731e-05 9.924e+01 -0.772 0.442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) frqncy length
## frequency -0.953
## length -0.951 0.952
## frqncy:lngt 0.886 -0.956 -0.963
check_model(mm)
Mixed-effects linear regression with two fixed factors Frequency and Length as well as by-participant and by-item random intercepts revealed a significant main effect of Frequency on logRT (\(\beta\)=0.00456, t=-35.37, p<0.001), but there was no significant main effect of Length, and no interaction between Frequency and Length.
Scalar implicature refers to implicatures derived from quantifiers. If I say “James ate some of the apples”, many people will conclude that Jamed in fact did not eat all of the apples. Some adjectives can have the same effect: if I say “It’s warm today”, many will conclude that it’s warm but not hot today.
An experimental pragmatician wants to investigate whether there is a difference between how often people derive implicature derived for the word “some” and for the word “warm”. 100 participants each read 25 sentences that contained “some”, and 25 sentences that contained “warm”, and for each sentence, the participant answered the yes-no question “Would you conclude that not all….” or “Would you conclude that …. is not hot”. If the answer was yes, their response was coded as 1, and if the answer was no, the response was coded as 0.
data3 <- read.csv("scalar.csv", header=TRUE)
head(data3)
## participant item condition response
## 1 1 1 some 1
## 2 1 2 some 1
## 3 1 3 some 1
## 4 1 4 some 1
## 5 1 5 some 1
## 6 1 6 some 1
library(ggplot2)
# we have binomal data and categorical IV. In this case perhaps the best way is to plot the proportions/frequencies
data3_frequency <- as.data.frame(table(subset(data3, select = -c(participant, item))))
ggplot(data=data3_frequency, aes(x=condition, y=Freq, fill=response)) + geom_bar(stat="identity", width=0.4)
We have one categorical predictor, condition (warm vs. some). Most importantly, our data is binomial (yes vs. no, 1 vs. 0). Moreover we should try to include two random factors: participant and item. Although later on you’ll only be asked to include item, this is because in the example data there is not much variance that can be explained by participant, and the model will run into singular fit problems if we force it to include participant (more on this in the last excercise).
data3$participant <- as.factor(data3$participant)
data3$item <- as.factor(data3$item)
data3$condition <- as.factor(data3$condition)
(1+condition|item)
.glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5))
while running your model to optimize performance.The factor is Condition with two levels: warm and some. There’s no obvious default level here, so we should use sum coding.
library(lme4)
library(lmerTest)
library(performance)
# important to sum contrast condition
contrasts(data3$condition) <- contr.sum(2)
contrasts(data3$condition)
## [,1]
## some 1
## warm -1
glm <- glmer(response ~ condition + (1+condition|item), data = data3, family = 'binomial', glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5)))
summary(glm)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: response ~ condition + (1 + condition | item)
## Data: data3
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
##
## AIC BIC logLik deviance df.resid
## 5965.3 5997.8 -2977.6 5955.3 4995
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3938 -1.1624 0.5503 0.6623 0.9887
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## item (Intercept) 8.968e-05 0.00947
## condition1 1.258e-01 0.35466 -0.96
## Number of obs: 5000, groups: item, 50
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.89642 0.05958 15.046 < 2e-16 ***
## condition1 0.26910 0.05958 4.517 6.28e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## condition1 -0.003
check_model(glm)
A mixed-effect logistic model was fitted to the data with one fixed factor Condition (warm vs. some) and one random factor item. The fixed factor was sum contrasted. Results revealed a significant main effect of condition on participant’s response (\(\beta\)=0.27, z=4.51, p<0.001), suggesting that speakers derived significantly less scalar implicatures with the word “warm” than the word “some”.
In the handout I wrote about how to choose your random-effects structure in a mixed-model: always start with the maximal structure and unless the model gives you a problem, stick to it. But will maximal structures often give you problems? In my experience, yes, actually. Often this is because data I collect from an experiment is not big enough to support a maximal random-effects structure. Sure, one participant usually gives me a few dozen trials, but that’s just a few thousand data points if I have a couple dozen participants. That’s not a big data set at all unfortunately, especially if you have more than one fixed factors.
Now what do you do? You should find out what terms are the least important for the model and remove them one by one, until the model can be supported by your data. But how to do that exactly? I’ll give you some of the real data I collected for you to try it out.
Suppose the study was about whether a preceding phonological cue can facilitate lexical access1 The design I described is different from my real design but for our purposes they’re close enough. In each trial, participants were presented with two pictures (e.g. an orange and a pear) and listened to a spoken phrase that identified one of the pictures. The spoken phrase either contained a phonological cue that was informative about the target’s identity (an orange, as opposed to a pear, the informative condition), or uninformative (that orange, the uninformative condition). Participant’s task was to decide which one of the pictures was named, asap. Their RT was recorded and log-transformed. Each participant completed 20 items. I needed to do a statistical test to determine whether participants’ reaction times were faster in the informative condition than the uninformative condition.
library(ggplot2)
library(lme4)
library(lmerTest)
library(performance)
data <- read.csv("rt.csv", header = TRUE)
head(data)
## subject item condition rt logrt
## 1 s07 2 informative 1466 7.290293
## 2 s07 4 informative 1156 7.052721
## 3 s07 5 uninformative 1166 7.061334
## 4 s07 6 informative 1162 7.057898
## 5 s07 7 uninformative 1051 6.957497
## 6 s07 10 informative 780 6.659294
ggplot(data, aes(y = logrt, x = condition)) + geom_boxplot()
subject
, item
, and condition
factors.data$subject <- as.factor(data$subject)
data$item <- as.factor(data$item)
data$condition <- as.factor(data$condition)
condition
, and fit a mixed-effects linear model to the logRT, with one fixed factor Condition, as well as by-participant/subject and by-item random intercepts.contrasts(data$condition) <- contr.sum(2)
contrasts(data$condition)
## [,1]
## informative 1
## uninformative -1
model <- lmer(logrt ~ condition + (1+condition | subject) + (1+condition | item), data=data)
## boundary (singular) fit: see help('isSingular')
The model should give you a singularity warning. Basically, this means that some of your random effects are estimated to have nearly no impact on the model.
We need to identify which random effect(s) is causing the singularity problem and remove them from the model. Read the model summary output and find out which term in the random effects structure has the smallest variance explained.
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: logrt ~ condition + (1 + condition | subject) + (1 + condition |
## item)
## Data: data
##
## REML criterion at convergence: -388.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1331 -0.5617 -0.0508 0.5347 4.5499
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 4.218e-03 0.064947
## condition1 1.232e-05 0.003511 -1.00
## item (Intercept) 1.182e-02 0.108722
## condition1 8.312e-03 0.091169 -0.43
## Residual 2.610e-02 0.161552
## Number of obs: 678, groups: subject, 38; item, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.86631 0.02724 25.66625 252.095 <2e-16 ***
## condition1 -0.01647 0.02136 17.80720 -0.771 0.451
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## condition1 -0.382
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
Looks like the by-subject random slopes for condition has the least amount of variance explained.
# plot by subject
ggplot(data, aes(y = logrt, x = condition, colour = subject, group = subject)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
# plot by item
ggplot(data, aes(y = logrt, x = condition, colour = item, group = item)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
subject
, is there a large variability in intercepts? And in slopes?item
, is there a large variability in intercepts? And in slopes?For Subject, it seems that there is some intercept-wise variability, and perhaps less slope-wise variability. Compare that to Item, it’s clear that for different items, there is a large intercept-wise as well as slope-wise variability.
model <- lmer(logrt ~ condition + (1 | subject) + (1+condition | item), data=data)
summary(model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: logrt ~ condition + (1 | subject) + (1 + condition | item)
## Data: data
##
## REML criterion at convergence: -387.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1319 -0.5597 -0.0604 0.5339 4.5519
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 0.004197 0.06479
## item (Intercept) 0.011821 0.10872
## condition1 0.008302 0.09112 -0.43
## Residual 0.026111 0.16159
## Number of obs: 678, groups: subject, 38; item, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.86622 0.02723 25.63338 252.19 <2e-16 ***
## condition1 -0.01644 0.02134 17.78758 -0.77 0.451
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## condition1 -0.373
check_model(model)
A mixed-effects linear model with one fixed factor Condition (informative vs uninformative) as well as by-participant random intercepts and by-item random intercepts and random slopes was fitted to the data. The factor was sum coded. Results revealed no significant effect of condition on logRT (\(\beta\)=-0.016, t=-0.77, p=0.45), suggesting no difference between conditions in terms of lexical decision RT.