Week 4: More ANOVAs, Effect sizes, and Non-parametric tests

Yiling Huo

2024-05-28

Last week we talked about how to use t-tests and one-way ANOVAs to compare means. Today we are going to talk about cases where things get just a bit more complicated: what if I care about the effect of more than one thing (say I want to know whether word frequency and/or word length affect reading time)? And what if my data is not normally distributed?

1 Factors, levels, main effects, and interactions

A factor is an independent (or predictor) variable that is nominal (i.e. categorical). A level refers to a sub-category of the factor. A variable must have at lest 2 levels (i.e. it must vary in some way). For example, if I’m interested in whether having a pet reduces stress level, I’ll have one factor Pet, with two levels: someone either has a pet, or doesn’t have a pet.

Factorial designs:

Many biological, psychological, and clinical phenomena involve more than one variable, for example the risk of cancer can be associated with many variables such as genetics, age, alchohol use, smoking, etc. Similarly, an experiment has a factorial design when it has 2 or more independent variables / factors. These are often described using numbers and multiplication signs, where each number represents an independent variable, and the value of the numbers represents the number of levels of that factor (e.g. a 2 x 2 x 2 design has three factors and each factor has two levels).

An example design of a study investigating the effect of different types of treatment on the outcome of a disorder, where patients were grouped by their biological sex.

Quick Q: how many factors does this study have and how many levels does each factor have?

treatment group
Talking therapy Female
Talking therapy Male
Medication Female
Medication Male
Placebo Female
Placebo Male
No treatment Female
No treatment Male

A main effect refers to the effect of one factor on the outcome, while completely ignoring the different levels of other factors. An interaction refers to the combined effect of 2 or more independent variables on the dependent variable: the way in which the effect of one independent variable may depend on the level of another independent variable. We’ll talk a bit more about it in a factorial ANOVA example.

2 Factorial ANOVAs

2.1 Two-way between-subjects ANOVA

Suppose I’m interested in the effect of L1 reading ability and L2 proficiency on L2 reading comprehension. A group of 40 L2 English learners participated in an English reading comprehension experiment. Participants were grouped by their L1 reading ability (high vs. low), and their L2 proficiency (high vs. low).

This study has a 2 x 2 factorial design:
L1 reading ability (high vs. low)
L2 proficiency (high vs. low)

Number.of.participants High.L1.ability Low.L1.ability
high L2 proficiency 10 10
low L2 proficiency 10 10

Example data.

Mean.scores High.L1.ability Low.L1.ability
high L2 proficiency 95 75
low L2 proficiency 65 60

2.1.1 Main effect of L2 proficiency

From the example data, we find a main effect of L2 proficiency: when the scores are averaged across the levels of L1 ability, proficient L2 learners are better at L2 reading comprehension than less proficient L2 learners.

Mean.scores High.L1.ability Low.L1.ability mean
high L2 proficiency 95 75 85.0
low L2 proficiency 65 60 62.5

2.1.2 Main effect of L1 ability

In this case, we also find a main effect of L1 ability: when scores are averaged across levels of L2 proficiency, people with better L1 reading ability perform better in L2 reading comprehension than people with worse L1 reading ability.

Mean.scores High.L1.ability Low.L1.ability
high L2 proficiency 95 75.0
low L2 proficiency 65 60.0
mean 80 67.5

2.1.3 Interaction between L1 ability and L2 proficiency

The effect of one independnet variable may depend on the level of another independent variable: For the low L2 proficiency group, having a higher L1 ability does not improve their scores much; but for the high L2 proficiency group, having a higher L1 ability means they perform much better. To say it more formally, the effect of L1 ability is larger in the high L2 proficiency group than the low L2 proficiency group.

Mean.scores High.L1.ability Low.L1.ability
high L2 proficiency 95 75
low L2 proficiency 65 60

2.1.4 Possible patterns

In a 2 x 2 factorial design, we can have these patterns of main effects and interactions:

No main effects; No interaction.
One main effect - L1 ability; No interaction.
One main effect - L2 proficiency; No interaction.
Two main effects; No interaction.
Two main effects; One interaction.
No main effects; One interaction.

2.1.5 Two-way between-subjects ANOVA: logic and assumptions

A two-way ANOVA can tell us:

  • If there is a main effect of IV1 (L1 ability)
  • If there is a main effect of IV2 (L2 proficiency)
  • If there is an interaction between the two factors

A two-way ANOVA computes and compares four difference souces of total vairance: Between group variance, which is further broken down to variation due to IV1, variation due to IV2, and variation due to the interaction; as well as within group variance.

The two-way between-subjects ANOVA assumes more or less the same things as its one-way counterpart:

  • Normality: Redisuals (observation - group mean) should be normally distributed in each group \(\approx\) normal distribution of data in each group
  • Homogeneity of variances: Variances are approximately equal for every group
  • Continuous data
  • Independence of observations

2.1.6 Do it in R: two-way between-subjects ANOVA

  1. Prepare data1 You can download the data here.
  2. Visualise and get descriptive statistics
  3. Run the anova
  4. Check the assumptions
  5. Post-hoc tests2 Think about this example. Do we need post-hocs? If we do, why?
    Remember that we needed post-hocs because ANOVA only tells us that at least one group is different from the others. And to see exactly which group(s) are different, we need pair-wise t-tests.
    In this example, we only have two levels for each factor, so when we have main effects, it’s pretty obvious that it must be the only two groups that differ from each other! In fact, if you run an ANOVA with one factor and two levels, you will get the same results as a t-test.
    However, we had two factors in our ANOVA that interacted with each other. In this case, post-hocs are still useful for us to explore the interaction. Could it be that L1 ability only affects those who have a high L2 proficiency and not affect those who have low L2 proficiency at all? Or could it be that L1 ability still affects everyone, it’s just the effect is larger for the highly proficient L2 learners? Or… Post-hocs will tell us the answer!
# Two-way between-subject ANOVA: Example
# Effect of L1 reading ability and L2 proficiency on L2 reading comprehension

# first, let's import our data
data_1 <- read.csv('l1-l2-comprehension.csv',header = TRUE)
data_1$L2_proficiency <- as.factor(data_1$L2_proficiency)
data_1$L1_ability <- as.factor(data_1$L1_ability)

# Visualisation and descriptives

# Line graph using ggplot2
# to make the line graph, we need to have group means
library(ggplot2)
data_1_mean <- aggregate(score ~ L1_ability + L2_proficiency, data = data_1, FUN = mean)
ggplot(data_1_mean, aes(x = L2_proficiency, y = score, group=L1_ability, color = L1_ability)) + 
  geom_point() + 
  geom_line() + 
  theme_light() +
  ylim(50, 100)

# Get the descriptives
library(psych)
describeBy(score ~ L1_ability*L2_proficiency, data = data_1)
## 
##  Descriptive statistics by group 
## L1_ability: high
## L2_proficiency: high
##       vars  n mean   sd median trimmed  mad min max range  skew kurtosis  se
## score    1 10 92.6 6.95     93   93.38 6.67  79 100    21 -0.58    -0.91 2.2
## ------------------------------------------------------------ 
## L1_ability: low
## L2_proficiency: high
##       vars  n mean    sd median trimmed  mad min max range skew kurtosis  se
## score    1 10   76 11.07     74   74.88 12.6  63  98    35 0.54    -0.96 3.5
## ------------------------------------------------------------ 
## L1_ability: high
## L2_proficiency: low
##       vars  n mean    sd median trimmed mad min max range  skew kurtosis   se
## score    1 10 61.5 10.95   62.5   61.38 8.9  45  79    34 -0.09    -1.28 3.46
## ------------------------------------------------------------ 
## L1_ability: low
## L2_proficiency: low
##       vars  n mean    sd median trimmed  mad min max range skew kurtosis   se
## score    1 10 57.7 10.34   55.5   57.62 9.64  42  74    32 0.26    -1.25 3.27
# Run the two-way between-subjects ANOVA
data_1_result <- aov(score ~ L1_ability * L2_proficiency, data = data_1)
summary(data_1_result)
##                           Df Sum Sq Mean Sq F value   Pr(>F)    
## L1_ability                 1   1040    1040  10.471   0.0026 ** 
## L2_proficiency             1   6101    6101  61.401 2.72e-09 ***
## L1_ability:L2_proficiency  1    410     410   4.122   0.0498 *  
## Residuals                 36   3577      99                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check the assumptions of the two-way between-subject ANOVA
# 1. Normality
# use Shapiro-Wilk test
data_1_residuals <- residuals(object = data_1_result)
shapiro.test(data_1_residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  data_1_residuals
## W = 0.97943, p-value = 0.6682
# 2. Homogeneity of variance
# use Levene's test
library(car)
leveneTest(score ~ L1_ability * L2_proficiency, data = data_1)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  3  0.7463 0.5316
##       36
# Post-hoc tests
# Pair-wise t-test using Bonferroni
# the pairwise.t.test() function can only take one factor column, so we need a single column that represents both of our factors.
data_1$group <- paste('L1_',data_1$L1_ability,'_L2_',data_1$L2_proficiency, sep = '')
data_1$group <- as.factor(data_1$group)
pairwise.t.test(data_1$score, data_1$group, p.adjust.method = 'bonferroni')
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data_1$score and data_1$group 
## 
##                L1_high_L2_high L1_high_L2_low L1_low_L2_high
## L1_high_L2_low 2.1e-07         -              -             
## L1_low_L2_high 0.0040          0.0149         -             
## L1_low_L2_low  1.7e-08         1.0000         0.0013        
## 
## P value adjustment method: bonferroni
  1. Report the results

The results of a two-way Analysis of Variance (ANOVA) with 2 between-subjects factors, L1 reading ability (high vs. low) and L2 proficiency (high vs. low) indicated a significant main effect of L1 reading ability (F(1,36)=10.47, p=0.002), as well as a main effect of L2 proficiency (F(1,36)=61.4, p<0.001). A significant interaction was also found (F(1,36)=4.12, p=0.049). Pairwise t-test with Bonferroni correction methods revealed that while for the high L2 proficiency groups, there was a significant effect of L1 ability (p=0.004), for the low L2 proficiency groups, there was no significant effect of L1 reading ability (p=1).

3 Withitn-subject ANOVAs: the ez package

Now that we’ve talked about one way and factorial anovas, you might notice that so far all of our examples only have between-subject factors. What if my factor(s) is within-subject?

Within-subject ANOVAs are not very different from between-subject ANOVAs, and here are some examples. There are a few different packages that can do within-subjects ANOVAs, and for this workshop we are going to use the ez package.

3.1 One-way within-subjects ANOVA: Example

Effect of written text on perceived clarity of degraded speech:

10 participants took part in the experiment. In the experiment, participant listened to vocoded words (vocoding: a procedure that removes speech’s fine structure while preserving low-frequency temporal information), and rated speech clarity on a 7-point scale. In the Before condition, written text of the word was presented 800ms before the onset of speech; in the Simaultaneous condition, written text of the word was presented at the same time of onset of speech; in the After condition, written text was presented 800ms after the onset of speech. Each participant completed 15 trials in each condition.3 Example adapted from Sohoglu, E., Peelle, J. E., Carlyon, R. P., & Davis, M. H. (2014). Top-down influences of written text on perceived clarity of degraded speech. Journal of Experimental Psychology: Human Perception and Performance, 40(1), 186. Note that data used in this example is simulated and conclusions may differ from the real experiment.

3.1.1 Do it in R: one-way within-subjects ANOVA

  1. Prepare data4 You can download the data here.
  2. Visualise and get descriptives
  3. Run the ANOVA (using the ez package, and specify within-subject factor(s))
  4. Check for the assumptions:
  • Normality of residuals
  • Sphericity: Analogous to Homogeneity of variance in other ANOVAs. Only matters when a factor has more than two levels
  1. Post-hocs
# One-way within-subjects ANOVA: Example

# First, let's import our data
one_within_exp <- read.csv('vocoded.csv', header = TRUE)

# let's order our factor 
one_within_exp$condition <- factor(one_within_exp$condition, levels = c('before', 'simultaneous', 'after'))

# Let's get some descriptives and visualise our data
# descriptives
library(psych)
describeBy(clarity ~ condition, data = one_within_exp)
## 
##  Descriptive statistics by group 
## condition: before
##         vars   n mean   sd median trimmed  mad min max range  skew kurtosis  se
## clarity    1 150 5.35 1.27    5.5    5.42 0.74   2   7     5 -0.39    -0.69 0.1
## ------------------------------------------------------------ 
## condition: simultaneous
##         vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## clarity    1 150 5.14 1.32      5     5.2 1.48   2   7     5 -0.26    -0.81
##           se
## clarity 0.11
## ------------------------------------------------------------ 
## condition: after
##         vars   n mean  sd median trimmed  mad min max range  skew kurtosis   se
## clarity    1 150 3.87 1.5      4    3.88 1.48   1   7     6 -0.09    -0.53 0.12
# visualisation, you can use line plots too
boxplot(clarity ~ condition, data = one_within_exp)

# remember that to do t-tests or ANOVAs we need by-participant mean
# in this case, it means that we need to average across the 15 trials in each condition for each participant
# in other words, condition and participant are our grouping variable
one_within_exp_by_par <- aggregate(clarity ~ participant+condition, data = one_within_exp, FUN = mean)

# run the within-subjects ANOVA with the package ez
# note that ezANOVA() checks for sphericity during computation
library(ez)
one_within_exp_result <- ezANOVA(data = one_within_exp_by_par, dv=.(clarity), wid=.(participant),within=.(condition), type = 3)
one_within_exp_result
## $ANOVA
##      Effect DFn DFd        F            p p<.05       ges
## 2 condition   2  18 56.52274 1.740421e-08     * 0.8179273
## 
## $`Mauchly's Test for Sphericity`
##      Effect         W       p p<.05
## 2 condition 0.6250943 0.15268      
## 
## $`Sphericity Corrections`
##      Effect       GGe       p[GG] p[GG]<.05       HFe        p[HF] p[HF]<.05
## 2 condition 0.7273226 1.10469e-06         * 0.8314024 2.257353e-07         *
# post-hoc: pairwise t-test (within-subjects)
pairwise.t.test(one_within_exp_by_par$clarity, one_within_exp_by_par$condition, p.adjust.method = 'bonferroni', paired = TRUE)
## 
##  Pairwise comparisons using paired t tests 
## 
## data:  one_within_exp_by_par$clarity and one_within_exp_by_par$condition 
## 
##              before  simultaneous
## simultaneous 0.16303 -           
## after        3.9e-05 0.00012     
## 
## P value adjustment method: bonferroni
  1. Report the results

One-way within-subject ANOVA revealed a signifcant main effect of Condition (F(2,18)=56.52, p<0.001). Pair-wise t-tests with Bonferroni correction revealed significantly decreased clarity in the After condition than the other two conditions (p’s < 0.01); while there was no difference in clarity between the Before and the Simultaneous conditions (p=0.16).

4 Mixed ANOVAs

Mixed ANOVAs refer to factorial ANOVAs with both within-subject and between-subject factors.

4.1 Mixed ANOVA: Example

Effect of language aptitude on the acquisition of programming languages: 40 participants took ten 45-minute sessions of Python training. Before training, participants’ language aptitude was measured using the Modern Language Aptitude Test. Participants were grouped into two groups of 20: high language aptitude and low language aptitude.5 Example adapted from Prat, C. S., Madhyastha, T. M., Mottarella, M. J., & Kuo, C. H. (2020). Relating natural language aptitude to individual differences in learning programming languages. Scientific reports, 10(1), 1-10. Note that data used in this example is simulated and conclusions may differ from the real experiment.

Participants’ programming ability is measured after the 1st, 5th, and 10th training session using an examination (max score 100).

2 x 3 design: aptitude (high vs. low, between-subjects) and session (1st vs. 5th vs. 10th, within-subjects).

4.1.1 Do it in R: mixed ANOVA using the ez package

  1. Prepare data6 You can download the data here.
  2. Visualise and get descriptives
  3. Run the ANOVA (using the ez package, and specify within-subject factor(s))
  4. Check for the assumptions:
  • Normality of residuals
  • Homogeneity of variance for between-subject factors, sphericity for within-subject factors.
  1. Post-hocs
# Mixed ANOVA: Example

# Let's import our data
mixed_exp <- read.csv('python.csv', header = TRUE)

# let's make categorical factors
mixed_exp$participant <- as.factor(mixed_exp$participant)
mixed_exp$session <- as.factor(mixed_exp$session)
mixed_exp$aptitude <- as.factor(mixed_exp$aptitude)

# Let's get some descriptives and visualise our data
# descriptives
library(psych)
describeBy(score ~ aptitude*session, data = mixed_exp)
## 
##  Descriptive statistics by group 
## aptitude: High
## session: 1
##       vars  n  mean   sd median trimmed mad min max range skew kurtosis   se
## score    1 20 24.75 8.62   23.5   24.31 8.9  11  43    32 0.39    -0.82 1.93
## ------------------------------------------------------------ 
## aptitude: Low
## session: 1
##       vars  n  mean   sd median trimmed mad min max range skew kurtosis   se
## score    1 20 19.75 9.87     20   19.31 8.9   5  38    33 0.21       -1 2.21
## ------------------------------------------------------------ 
## aptitude: High
## session: 5
##       vars  n mean   sd median trimmed   mad min max range skew kurtosis   se
## score    1 20 61.6 9.57     63   61.12 10.38  47  81    34 0.15    -1.12 2.14
## ------------------------------------------------------------ 
## aptitude: Low
## session: 5
##       vars  n  mean   sd median trimmed   mad min max range  skew kurtosis   se
## score    1 20 49.05 9.32   48.5   49.38 10.38  31  65    34 -0.33    -0.79 2.08
## ------------------------------------------------------------ 
## aptitude: High
## session: 10
##       vars  n  mean   sd median trimmed  mad min max range skew kurtosis   se
## score    1 20 83.65 8.13   82.5   83.44 7.41  69  99    30  0.1    -0.75 1.82
## ------------------------------------------------------------ 
## aptitude: Low
## session: 10
##       vars  n  mean   sd median trimmed  mad min max range skew kurtosis  se
## score    1 20 65.45 9.39   66.5   65.12 7.41  48  88    40 0.25    -0.19 2.1
# visualisation
# for two-way designs, line graph can help us understand the data structure better
# to make the line graph, we need to have group means
library(ggplot2)
mixed_exp_mean <- aggregate(score ~ aptitude*session, data = mixed_exp, FUN = mean)
ggplot(mixed_exp_mean, aes(x = session, y = score, group=aptitude, color = aptitude)) + 
  geom_point() + 
  geom_line() + 
  theme_light() 

# assumption check: normality (of data in each group)
# use shapiro test
tapply(mixed_exp$score, mixed_exp$condition, shapiro.test)
## $High_1
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.96375, p-value = 0.6212
## 
## 
## $High_10
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.97549, p-value = 0.8638
## 
## 
## $High_5
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.94779, p-value = 0.3348
## 
## 
## $Low_1
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.96006, p-value = 0.545
## 
## 
## $Low_10
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.97615, p-value = 0.8753
## 
## 
## $Low_5
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.95303, p-value = 0.4154
# assumption check: homogeneity of variance (of between subject factor)
# use Levene's test
library(rstatix)
mixed_exp %>%
  group_by(session) %>%
  levene_test(score ~ aptitude)
## # A tibble: 3 × 5
##   session   df1   df2 statistic     p
##   <fct>   <int> <int>     <dbl> <dbl>
## 1 1           1    38     0.465 0.499
## 2 5           1    38     0.295 0.590
## 3 10          1    38     0.262 0.612
# run the mixed ANOVA with package ez
library(ez)
mixed_exp_result <- ezANOVA(data = mixed_exp, dv=.(score), wid=.(participant),between=aptitude, within=session, type = 3)
mixed_exp_result
## $ANOVA
##             Effect DFn DFd          F            p p<.05        ges
## 2         aptitude   1  38  44.721448 6.532571e-08     * 0.30775274
## 3          session   2  76 356.779288 2.345908e-39     * 0.85384879
## 4 aptitude:session   2  76   5.590335 5.431865e-03     * 0.08386419
## 
## $`Mauchly's Test for Sphericity`
##             Effect         W         p p<.05
## 3          session 0.9812797 0.7049646      
## 4 aptitude:session 0.9812797 0.7049646      
## 
## $`Sphericity Corrections`
##             Effect       GGe        p[GG] p[GG]<.05      HFe        p[HF]
## 3          session 0.9816237 1.114937e-38         * 1.034592 2.345908e-39
## 4 aptitude:session 0.9816237 5.728230e-03         * 1.034592 5.431865e-03
##   p[HF]<.05
## 3         *
## 4         *
# post-hoc: 
# in this case, first, we want to look at the effect of aptitude at each time point
aptitude_effect <- mixed_exp %>%
  group_by(session) %>%
  anova_test(dv = score, wid = participant, between = aptitude) %>%
  get_anova_table() %>%
  adjust_pvalue(method = "bonferroni")
aptitude_effect
## # A tibble: 3 × 9
##   session Effect     DFn   DFd     F            p `p<.05`   ges       p.adj
##   <fct>   <chr>    <dbl> <dbl> <dbl>        <dbl> <chr>   <dbl>       <dbl>
## 1 1       aptitude     1    38  2.91 0.096        ""      0.071 0.288      
## 2 5       aptitude     1    38 17.7  0.000154     "*"     0.317 0.000462   
## 3 10      aptitude     1    38 42.9  0.0000000997 "*"     0.531 0.000000299
# then, we can also look at the effect of session for each aptitude group
session_effect <- mixed_exp %>%
  group_by(aptitude) %>%
  anova_test(dv = score, wid = participant, within = session) %>%
  get_anova_table() %>%
  adjust_pvalue(method = "bonferroni")
session_effect
## # A tibble: 2 × 9
##   aptitude Effect    DFn   DFd     F        p `p<.05`   ges    p.adj
##   <fct>    <chr>   <dbl> <dbl> <dbl>    <dbl> <chr>   <dbl>    <dbl>
## 1 High     session     2    38  197. 8.43e-21 *       0.889 1.69e-20
## 2 Low      session     2    38  160. 3.27e-19 *       0.806 6.54e-19
# because we have more than two levels of session, we also need to run pairwise t-tests to determine exactly which session(s) are different from the other(s).
session_effect_pt <- mixed_exp %>%
  group_by(aptitude) %>%
  rstatix::pairwise_t_test(score ~ session, paired = TRUE, p.adjust.method = "bonferroni")
print.data.frame(session_effect_pt)
##   aptitude   .y. group1 group2 n1 n2  statistic df        p    p.adj
## 1     High score      1      5 20 20 -11.120608 19 9.26e-10 2.78e-09
## 2     High score      1     10 20 20 -26.989311 19 1.29e-16 3.87e-16
## 3     High score      5     10 20 20  -6.597268 19 2.59e-06 7.77e-06
## 4      Low score      1      5 20 20 -11.487291 19 5.40e-10 1.62e-09
## 5      Low score      1     10 20 20 -15.223988 19 4.24e-12 1.27e-11
## 6      Low score      5     10 20 20  -7.610057 19 3.49e-07 1.05e-06
##   p.adj.signif
## 1         ****
## 2         ****
## 3         ****
## 4         ****
## 5         ****
## 6         ****
  1. Report the results

Mixed ANOVA with one within-subject factor Session (1st session vs. 5th session vs. 10th session) and one between-subject factor Aptitude (high vs. low) revealed significant main effects of Session (F(2, 76)=356.78, p<0.001) and Aptitude (F(1, 38)=44.72, P<0.001), as well as an interaction between Aptitude and Session (F(2, 76)=5.59, p=0.005). Post-hoc pair-wise t-tests suggest that for both High Aptitude and Low Aptitude participants, test scores significantly improved after each session (10th > 5th > 1st, all p’s <0.001). After the first session, no sigificant differences were found between the two groups of participants (p=0.09); however, High Aptitude participants perfromed significantly better than Low Aptitude participants after the 5th and the 10th session (all p’s <0.001).

5 The ultimate choose-the-test: t-tests and ANOVAs

How many groups/conditions?
How many groups/conditions?
one
one
two
two
One sample t-test
One sample t-test
Type of IV?
Type of IV?
within-subject
within-subject
between-subject
between-subject
Independent samples t-test
Independent samp...
Paired samples t-test
Paired samples...
two or more
two or more
Type(s) of IV?
Type(s) of IV?
within-subject
within-subject
between-subject
between-subject
both
both
Within-subjects ANOVA
Within-subjects...
Between-subjects ANOVA
Between-subject...
mixed ANOVA
mixed ANOVA
How many IVs?
How many IVs?
one
one
two or more
two or more
Text is not SVG - cannot display

How to choose the correct test.

6 Effect sizes

Effect size refers to the size of an effect (obviously). This is different from statistical significance: Effect sizes reflect the practical value of a difference.

6.1 Effect size of t-tests: Cohen’s \(d\)

The effect size of a t-test is usually reported in term of Cohen’s \(d\). Cohen’s \(d\) indiates the difference between means, standardized by (divided by) the standard deviation of the measure.

\[ d = \frac{\bar{X} - \bar{Y}}{\text{s.d.}} \]

6.1.1 Do it in R: Cohen’s \(d\)

Let’s use our first example of independent samples t-test: performance of SLI children and TD children in non-word repetition task.

# let's skip the assumption checks and run the t-test
t.test(score ~ group, var.equal=TRUE, data = df)
## 
##  Two Sample t-test
## 
## data:  score by group
## t = -2.7027, df = 18, p-value = 0.01457
## alternative hypothesis: true difference in means between group SLI and group TD is not equal to 0
## 95 percent confidence interval:
##  -9.419927 -1.180073
## sample estimates:
## mean in group SLI  mean in group TD 
##               7.5              12.8
# Now, let's calculate Cohen's d with the help of the effsize package
# the effsize:: makes sure the function we call is from the effsize package.
library(effsize)
effsize::cohen.d(score ~ group, data = df)
## 
## Cohen's d
## 
## d estimate: -1.20868 (large)
## 95 percent confidence interval:
##     lower     upper 
## -2.230434 -0.186926

We see that we have a Cohen’s \(d\) of -1.2, which is a relatively large effect.7 Cohen classified effect sizes as small (\(d=0.2\)), medium (\(d=0.5\)), or large(\(d>=0.8\)). See a visualisation here.

6.2 Effect sizes of ANOVAs: Eta-squared (\(\eta^2\))
and Partial Eta-squared (\(\eta_p^2\))

The effect size of an ANOVA can be reported in \(\eta^2\). Eta-squared \(\eta^2\) is the proportion of variance that is explained by the factor in a one-way ANOVA, whereas partical eta-squared \(\eta_p^2\) is the proportion of variance that a factor explains that is not explained by the other facotrs; which is used for factorial ANOVAs.

6.2.1 Do it in R: (Partial) Eta-squared (\(\eta^2\))

Let’s take our one-way between-subjects ANOVA example (vocabulary size).

# let's run the anova
one_between_exp_1_result <- aov(score ~ group, data = one_between_exp_1)
# take a look at our results using the summary() function
summary(one_between_exp_1_result)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## group        2 282478  141239   396.4 <2e-16 ***
## Residuals   27   9621     356                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Let's calculate our eta-squared
# same function also gives us partial eta-squared when we feed it a factorial ANOVA's results.
library(lsr)
etaSquared(one_between_exp_1_result)
##          eta.sq eta.sq.part
## group 0.9670626   0.9670626

We have eta-squared = 0.97, which is a very large effect.8 (Partical) Eta-squared can be interpreted as:
\(0 < \text{trivial effect} < 0.01 < \text{small effect} < 0.06 < \text{moderate effect} < 0.14 < \text{large effect}\).

7 Non-parametric tests

If you’ve been paying attention to the assumptions of the tests we talked about, you’ll probably notice that all of the tests require the data is in some way normally distributed. But what if my data isn’t normally distributed?

The statstical tests we talked about so far are all parametric tests, meaning that they require a certain distribution of data (normal distribution, in our case). Non-parametric tests, in contrast, do not require a certain distribution, thus can be used when your data do not meet the assumtions of marametric tests. But why don’t we always do non-parametric tests then, and forget about normal distributions? This is because parametric tests generally have more statistical power, meaning that if an effect in fact exists, parametric tests are more likely to detect it.

Parametric tests Non-Parametric equivalent
Independent samples t-test Mann-Whitney U test
Paired samples t-test Wilcoxon signed rank test
One-way ANOVA Kruskal-Wallis test

7.1 The Chi-squared (\({\chi}^2\)) test: when data is categorical

In this class we will not delve into the non-parametric equivalents of t-tests and ANOVAs. Instead, let’s talk about the Chi-squared test. This test is useful when you have categorical data (DV) (which obviously will not be normally distributed). For example, suppose a biologist is interested in whether a certain chromosome W affects/determines the sex of a newly-discovered bird species. The biologist tested 100 chicks and determined their sex and whether they have the W chromosome. In this case, the Independent Variable is whether the bird has the W chromosome (categorical), and the Dependent Variable is the bird’s sex (also categorical, male or female). For a human example, suppose a neurolinguist is interested in whether having a certain gene x increases the chance of developping dyslexia. In this case the data is also categorical: someone either develops dyslexia or they don’t.

What kinds of statistics can be compared in these situations? Because means and medians are meaningless for nominal categories. In the Chi-squared case, it’s all about counting! The Chi-squared test compared observed frequencies to expected frequencies (based on a null hypothesis). The bigger the difference between the expected and observed frequencies, the more likely the null hypothesis will be rejected.

7.1.1 Chi-squared test example: candies

300 children (150 boys, 150 girls) are asked if they prefer Haribo or M&Ms. Question: Is there a difference between boys & girls in the preferred sweet?9 Quick Q: what is the IV and what is the DV in this study? What data types are they?

  • Null hypothesis: no difference between boys and girls in preferred sweet
  • Alternative hypothesis: there is a difference between boys and girls in preferred sweet

Although non-parametric, the Chi-squared test still has some basic assumtions about your data: they need to be randomly sampled; observations should be independent; and groups should be mutually exclusive. All of these you can (should) make sure while collecting your data, so you don’t really need to check for any assumptions in R.

7.1.2 Do it in R: Chi-squared test

  1. Prepare data. 10 You can download the data here.
  2. Visualise and get descriptives.
  3. Convert data to a frequency table.
  4. Run the Chi-squared test.
# let's import our data
chi_exp <- read.csv('candy.csv',header = TRUE)
head(chi_exp)
##   participant gender  sweet
## 1           1   girl     MM
## 2           2   girl Haribo
## 3           3    boy     MM
## 4           4   girl     MM
## 5           5    boy     MM
## 6           6    boy Haribo
# for a chi squared test, we need the frequency table
chi_freq <- table(chi_exp$gender, chi_exp$sweet)
chi_freq
##       
##        Haribo  MM
##   boy     100  50
##   girl     80  70
# visualise using bar plot (ggplot2)
library(ggplot2)
chi_plot <- as.data.frame(table(subset(chi_exp, select = -c(participant))))
ggplot(data=chi_plot, aes(x=gender, y=Freq, fill=sweet)) +
  geom_bar(stat="identity", width=0.4)

# compute the chi squared test using chisq.test()
chisq.test(chi_freq)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  chi_freq
## X-squared = 5.0139, df = 1, p-value = 0.02514

From the result, we observe a significant difference between boys and girls in their preferred sweets.

  1. Reporting a Chi-squared test

A Chi-squared test for association was conducted comparing boys and girls on their choice of favoured sweet between Haribo and M&Ms. There was a significant association between the gender of the child and the sweet chosen, χ2(1)=5.01, p=0.03.

Further readings

Field, A. P., Miles, J., & Field, Zoë. (2012). Factorial ANOVA (GLM 3). In Discovering statistics using r.

Field, A. P., Miles, J., & Field, Zoë. (2012). Repeated measure designs (GLM 4). In Discovering statistics using r.

Field, A. P., Miles, J., & Field, Zoë. (2012). Mixed designs (GLM 5). In Discovering statistics using r.

Field, A. P., Miles, J., & Field, Zoë. (2012). Non-parametric tests. In Discovering statistics using r.

Field, A. P., Miles, J., & Field, Zoë. (2012). Categorical data. In Discovering statistics using r.

Phillips, N. D. (2018). Hypothesis tests. In YaRrr! The pirate’s guide to r.

Phillips, N. D. (2018). ANOVA. In YaRrr! The pirate’s guide to r.

Poldrack, R. A. (2019). Modeling categorical relationships. In Statistical thinking for the 21st century.