All data sets (except for one) in the exercises are randomly-generated fake data.

Excercise 1: Hours studied and exam scores

50 students took an exam (scores 0-100). They were also asked how many hours they spent studying for the exam. A teacher wants to know whether there is a relationship between the hours spent studying and the exam results, and whether they can predict future students’ exam results based on the hours spent studying.

Download the data Here.
Visualise the data using a scatter plot (with the best-fitting line drawn). Hint: I showed how to do this in the handout.
Decide what analysis/analyses to use to answer the teacher’s questions.
Check the assumptions and run the tests.
Report your findings in a short paragraph.

Exercise 2: Lexical decision

Lexical decision is a popular task in psycho/neurolinguistics. In a lexical decision task, the participant is presented with a word, and is asked to decide whether this word is a real word or not in their language. Typically, reaction times (RT) needed to make a correct decision are recorded. This time is associated with the difficulty of lexical access (pulling the lexical item out of your mental lexicon).

A psycholinguist wants to study whether difficulty of lexical access (as measured by lexical decision RTs) is affected by the target word’s frequency and its length. 50 participants performed the lexical decision task with words varying in frequency and length. Each participant completed 100 items. Reaction times are log-transformed.

Download the data Here.
Visualise the data using scatter plots.
Decide what analysis you should do to answer the research question. Hint: Do you expect there to be systematic variations outside of the experimental manipulations?
Make relevant categorical variables factors.
Check the assumptions and run the tests.
- If you decided to run a mixed-effects model, use the random effects structure (1|subject) + (1|item) (This is in case your computer cannot handle the maximal random effects structure (mine couldn’t)).
- Following the above, specify REML=FALSE and control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)) while running your model to optimize performance.
Report your findings in a short paragraph.

Exercise 3: Scalar implicature

Scalar implicature refers to implicatures derived from quantifiers. If I say “James ate some of the apples”, many people will conclude that Jamed in fact did not eat all of the apples. Some adjectives can have the same effect: if I say “It’s warm today”, many will conclude that it’s warm but not hot today.

An experimental pragmatician wants to investigate whether there is a difference between how often people derive implicature derived for the word “some” and for the word “warm”. 100 participants each read 25 sentences that contained “some”, and 25 sentences that contained “warm”, and for each sentence, the participant answered the yes-no question “Would you conclude that not all….” or “Would you conclude that …. is not hot”. If the answer was yes, their response was coded as 1, and if the answer was no, the response was coded as 0.

Download the data Here.
Visualise the data.
Make relevant categorical variables factors.
Decide what analysis you should do to answer the research question.
Choose the correct coding scheme for categorical factors, check the assumptions and run the test.
- If you decided to run a mixed-effects model, use the random effects structure (1+condition|item).
- Following the above, specify glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5)) while running your model to optimize performance.
Report your findings in a short paragraph.

Advanced excercise: Dealing with problematic mixed-effects models

In the handout I wrote about how to choose your random-effects structure in a mixed-model: always start with the maximal structure and unless the model gives you a problem, stick to it. But will maximal structures often give you problems? In my experience, yes, actually. Often this is because data I collect from an experiment is not big enough to support a maximal random-effects structure. Sure, one participant usually gives me a few dozen trials, but that’s just a few thousand data points if I have a couple dozen participants. That’s not a big data set at all unfortunately, especially if you have more than one fixed factors.

Now what do you do? You should find out what terms are the least important for the model and remove them one by one, until the model can be supported by your data. But how to do that exactly? I’ll give you some of the real data I collected for you to try it out.

Suppose the study was about whether a preceding phonological cue can facilitate lexical access11 The design I described is different from my real design but for our purposes they’re close enough. In each trial, participants were presented with two pictures (e.g. an orange and a pear) and listened to a spoken phrase that identified one of the pictures. The spoken phrase either contained a phonological cue that was informative about the target’s identity (an orange, as opposed to a pear, the informative condition), or uninformative (that orange, the uninformative condition). Participant’s task was to decide which one of the pictures was named, asap. Their RT was recorded and log-transformed. Each participant completed 20 items. I needed to do a statistical test to determine whether participants’ reaction times were faster in the informative condition than the uninformative condition.

Download the data Here.
Read the data into a R data frame and plot the two conditions against logRT.

##   subject item     condition   rt    logrt
## 1     s07    2   informative 1466 7.290293
## 2     s07    4   informative 1156 7.052721
## 3     s07    5 uninformative 1166 7.061334
## 4     s07    6   informative 1162 7.057898
## 5     s07    7 uninformative 1051 6.957497
## 6     s07   10   informative  780 6.659294

Make the variables subject, item, and condition factors.
Sum contrast the factor condition, and fit a mixed-effects linear model to the logRT, with one fixed factor Condition, as well as by-participant/subject and by-item random intercepts.
The model should give you a singularity warning. Basically, this means that some of your random effects are estimated to have nearly no impact on the model.

## boundary (singular) fit: see help('isSingular')

We need to identify which random effect(s) is causing the singularity problem and remove them from the model. Read the model summary output and find out which term in the random effects structure has the smallest variance explained.

## Linear mixed model fit by REML ['lmerMod']
## Formula: logrt ~ condition + (1 + condition | subject) + (1 + condition |  
##     item)
##    Data: data
## 
## REML criterion at convergence: -388.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1331 -0.5617 -0.0508  0.5347  4.5499 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr 
##  subject  (Intercept) 4.218e-03 0.064947      
##           condition1  1.232e-05 0.003511 -1.00
##  item     (Intercept) 1.182e-02 0.108722      
##           condition1  8.312e-03 0.091169 -0.43
##  Residual             2.610e-02 0.161552      
## Number of obs: 678, groups:  subject, 38; item, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  6.86631    0.02724 252.095
## condition1  -0.01647    0.02136  -0.771
## 
## Correlation of Fixed Effects:
##            (Intr)
## condition1 -0.382
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

To verify that’s indeed the term we want to remove from the model, we can plot the data, grouping by each of our random factors (here, subject and item). Here’s how to do it:

# plot by subject
ggplot(data, aes(y = logrt, x = condition, colour = subject, group = subject)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

# plot by item
ggplot(data, aes(y = logrt, x = condition, colour = item, group = item)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

I’ll show you what the plots should look like. Compare them to the plots in the handout that explained what random intercepts and random slopes look like, and answer:

For the random factor subject, is there a large variability in intercepts? And in slopes?
For the random factor item, is there a large variability in intercepts? And in slopes?
If we have to take out one random effect term from the model, which one would you choose? Hint: choose from (1) by-subject random intercepts; (2) by-subject random slopes; (3) by-item random intercepts; (4) by-item random slopes.

Take out the random effect you chose, and run the model again. Does it give you a warning still?
When the model is no longer problematic, report your findings in a short paragraph.

Week 5 Lab sheet

Yiling Huo

2024-06-03

Excercise 1: Hours studied and exam scores

Exercise 2: Lexical decision

Exercise 3: Scalar implicature

Advanced excercise: Dealing with problematic mixed-effects models