Week 3: T-tests and ANOVAs

Yiling Huo

2024-05-14

So far, we’ve talked about null hypothesis testing, normal distribution, and some descriptive statistics. This week and next week, we are going to talk about a few ways to test null hypotheses by comparing means. And for the last week, we are going to talk about regression models.

1 T-tests

As researchers, a lot of the times the questions we are trying to answer with statistics boil down to something like “Is there a difference between X and Y?”. T-tests are a simple test statistics that can help us answer those questions by determining whether there’s a significant difference between the means of two sets of data.

1.1 Independent vs. Repeated measures

Look at these questions that we can answer with a t-test:

Notice that sometimes we are interested in how two groups of subjects compare, and sometimes we are interested in how results change within the same group depending on the conditions.

Independent measures, also known as between-subject measures, describes experiments where different participants are measured in each group or condition. For example, we can compare 20 students from university A to 20 students from university B (assuming no one attends both universities simultaneously). When experiments use independent measures, usually the number of data points = the number of participants. Results in the two groups are independent from one another.

Group A Group B
participant no. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 11, 12, 13, 14, 15, 16, 17, 18, 19, 20

Certain topics can only be investigated in independent designs (e.g., comparing between sexes or demographic groups, or a test procedure or treatment that can only be run once). With independent measures, there is less risk of practice or fatigue effects. There is also less risk of data loss due to participant drop out.

Repeated measures, also known as within-subject measures, describes experiments where the same participants are measured in each group or condition. For example, we can be interested in whether a group of 20 children’s IQ when they are 10 years old improved from when they were 5 years old. In an experiment with repeated measures, each participant is tested (at least) twice. Therefore, you can expect a correlation between results: children who had higher IQ when they were 5 are more likely to have a high IQ at 10.

Condition A Condition B
participant no. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

With repeated measures, participants act as their own control. Therefore, the level of error variance in the data is reduced. Experiments with repeated measures can also be quicker and sometimes cheaper to run than between-subject experiments.1 Quick Q: Read the research questions again and identify whether they should be answered with an experiment with independent or repeated measures.

1.2 Independent samples t-test

The independent samples t-test is used to compare the mean of two groups in between-subject designs.

1.2.1 Assumptions of Independent samples t-test

All statistical tests have some assumptions about the data. The independent samples t-test assumes that:

  1. Scores come from two different groups of participants
    • i.e. each participant is only measured once
  2. Scores should approximate a normal distribution
  3. Variances are roughly equal in the two groups
    • ‘Levene’s test for equality of error variances’
  4. Data is of a continuous (interval or ratio) nature

1.2.2 Independent samples t-test example

The performance of 10 children with specific language impairment (SLI) is compared to 10 typically-developing aged-matched control children on a test of non-word repetition.

group_SLI 1 2 3 4 5 6 7 8 9 10
score 6 17 2 8 9 3 12 10 5 3
group_TD 11 12 13 14 15 16 17 18 19 20
score 10 14 12 9 18 16 8 20 10 11

Boxplot with means shown.

Boxplot with means shown.

In the most general words, t statistics are differences between two means divided by a measure of the variability of that difference. In the case of the independent samples t-test, the t value is the difference between the two group means, divided by the standard error2 Standard errors: Imagine randomly picking a sample from the population, for example randomly picking 10 children from all of the children in the world and testing their word repetition accuracies. Each time we do this, the mean score of these 10 children is likely to be slightly different. Now say we do this (randomly picking 10 children) 100 times, we will have a distribution of 100 mean values. The standard error of this distribution of means is the standard error (SE). Here is an animation. 3 Calculating the SE from one experiment: But we never conduct the same experiment over and over again! Instead, we can estimate the SE from our one experiment. To calculate the SE from one experiment’s data: \(SE=\frac{s.d.}{\sqrt{n}}\) of the difference between the two means.

\[t=\frac{M}{SE}=\frac{M_1-M_2}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}}\]

The bigger the difference between the two means, the bigger the t value. The smaller the standard error is (i.e., the more accurately you determine the difference between the means), the bigger the t value is.

1.2.3 Do it in R: Independent samples t-test

  1. Build a data frame in long form: each row represents one observation, with group and score as columns.
score <- c(6, 17, 2, 8, 9, 3, 12, 10, 5, 3, 10, 14, 12, 9, 18,
    16, 8, 20, 10, 11)
group <- c("SLI", "SLI", "SLI", "SLI", "SLI", "SLI", "SLI", "SLI",
    "SLI", "SLI", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD",
    "TD", "TD")
df <- data.frame(group, score)
df
##    group score
## 1    SLI     6
## 2    SLI    17
## 3    SLI     2
## 4    SLI     8
## 5    SLI     9
## 6    SLI     3
## 7    SLI    12
## 8    SLI    10
## 9    SLI     5
## 10   SLI     3
## 11    TD    10
## 12    TD    14
## 13    TD    12
## 14    TD     9
## 15    TD    18
## 16    TD    16
## 17    TD     8
## 18    TD    20
## 19    TD    10
## 20    TD    11
  1. Check the assumptions
    • Normal distribution: Shapiro-Wilk test
    • Homogeneity of variance: Levene’s test
# normality use Shapiro Wilk test
tapply(df$score, df$group, shapiro.test)
## $SLI
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.93703, p-value = 0.5205
## 
## 
## $TD
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.92153, p-value = 0.3699
# homogeneity of variance Levene's test
library(car)
leveneTest(score ~ factor(group), data = df)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.1783 0.6778
##       18
  1. Independent samples t-test
t.test(score ~ group, var.equal = TRUE, data = df)
## 
##  Two Sample t-test
## 
## data:  score by group
## t = -2.7027, df = 18, p-value = 0.01457
## alternative hypothesis: true difference in means between group SLI and group TD is not equal to 0
## 95 percent confidence interval:
##  -9.419927 -1.180073
## sample estimates:
## mean in group SLI  mean in group TD 
##               7.5              12.8
  1. Report the results

An independent samples t-test showed that 10 SLI children performed significantly worse [t(18)=-2.703, p=0.015] on a test of non-word repetition (M=7.5, s.d.=4.7) than 10 age-matched typically-developing controls (M=12.8, s.d.=4.05).4 Reporting statistics:
Statistical tests need to be reported once the data have been analysed, usually in a short paragraph, containing:
1. Statistical test that was performed
2. The measures that were compared (with the different levels if there are any)
3. The means and standard deviations
4. Significant or non-significant?
5. Specific test value – in this case the t-statistic (later you will see the \(F\)-value, \(\chi^2\), \(r\), etc.)
6. Degrees of freedom – convention dictates that these are placed in rounded parentheses
7. The specific \(p\)-value. (as well as confidence intervals/effect size)
5 Degree of freedom:
The degree of freedom is the number of factors in a calculation that we can vary and still achieve a specific outcome. For example, I want a group of five numbers that has the mean of 10. The first four numbers can vary freely, but once the first four numbers are chosen, the last number must be fixed. Therefore, my degree of freedom is 4.
10, 11, x, y, z (mean = 10) (x=?? y=?? z=??)
10, 11, 9, 12, x (mean = 10) (x must be 8)
300, 4123, -50, 890, x (mean = 10) (x must be -5213)
In t-tests, degrees of freedom define the shape of the t-distribution used to calculate the p value. \(d.f. = n-1\) is the basic method for calculating degrees of freedom. In the case of independent samples t-test, two means are calculated, thus \(d.f. = n_1 + n_2 -2\). The degree of freedom reflects sample size. The higher the d.f., the more power to reject \(H_0\).

1.3 Paired samples t-test

The paired samples t-test are useful to compare two groups or conditions in repeated measures/within-subjects designs.

1.3.1 Assumptions of the paired samples t-test

  • Data is of a continuous nature
  • Same participants perform in both conditions
    • i.e., each participant is measured twice
    • so typically there is a correlation between the two sets of scores
  • The difference in scores should approximate a normal distribution

1.3.2 Paired samples t-test example

The performance of 10 children with SLI is measured on a task of non-word repetition before and after they receive a specialised treatment.

condition s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
before 6 17 7 2 8 9 3 12 10 5
after 8 15 8 5 12 15 5 10 19 8
difference s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
difference 2 -2 1 3 4 6 2 -2 -1 3

Boxplot with means shown.

\label{exp4}Boxplot with means shown.

In a paired samples t-test, the t value is the average difference between the two scores, divided by the standard error of the differences:

\[ t=\frac{\bar{x}_{diff}}{\frac{S}{\sqrt{n}}} \]

In our example, the average difference is 1.6, the standard deviation of the difference is 2.63, sample size is 10. So \(t = \frac{1.6}{\frac{2.63}{3.162}} = 1.92\). Our degree of freedom is \(n-1 = 9\).

1.3.3 Do it in R: Paired samples t-test

  1. Make a data frame: data should be in long form6 Short form and long form of data:
    Data in short form puts all measures from one participant in one row. One row = one participant.
    Data in short form.

    Data in long form puts each measure in a different row, even when they are from the same participants. Condition or group is usually indicated by a column. One row = one observation/measure. Data in long form.
    Both forms are useful for different purposes.
    , one observation per row.
score <- c(6, 17, 7, 2, 8, 9, 3, 12, 10, 5, 8, 15, 8, 5, 12,
    15, 5, 10, 9, 8)
condition <- rep(c("before", "after"), each = 10)
participant <- rep(c(1:10), 2)
paired_exp <- data.frame(participant, condition, score)
paired_exp
##    participant condition score
## 1            1    before     6
## 2            2    before    17
## 3            3    before     7
## 4            4    before     2
## 5            5    before     8
## 6            6    before     9
## 7            7    before     3
## 8            8    before    12
## 9            9    before    10
## 10          10    before     5
## 11           1     after     8
## 12           2     after    15
## 13           3     after     8
## 14           4     after     5
## 15           5     after    12
## 16           6     after    15
## 17           7     after     5
## 18           8     after    10
## 19           9     after     9
## 20          10     after     8
  1. Check the assumption: Normality (of the difference): Shapiro Wilk test
# check normality (of difference) calculate difference
dif <- with(paired_exp, score[condition == "after"] - score[condition ==
    "before"])
# use the Shapiro-Wilk test
shapiro.test(dif)
## 
##  Shapiro-Wilk normality test
## 
## data:  dif
## W = 0.94191, p-value = 0.5745
  1. Paired-samples t-test
t.test(score ~ condition, data = paired_exp, paired = TRUE)
## 
##  Paired t-test
## 
## data:  score by condition
## t = 1.9215, df = 9, p-value = 0.08684
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.2836223  3.4836223
## sample estimates:
## mean difference 
##             1.6
  1. Report the results

A paired samples t-test revealed that 10 children with SLI’s performance did not signiticantly improve (t(9)=1.92, p=0.08) after the treatment (before: mean=7.9, sd=4.43; after: mean = 9.5, sd=3.57).

1.4 How to choose the correct t-test?

Here’s a map of how to choose the correct t-test.7 One sample t-test:
For time reasons I did not include it in the lecture. The idea of a one-sample t-test is very simple: is the mean of a group of values significantly different from a specific number? For example, you might be interested in whether the IQ of students at UCL is different from the average (100). To run a one sample t-test in r: t.test(data, mu=100), where mu is the number you want to compare your data with.

How to choose the correct t-test.

How many groups?
How many groups?
one group
one group
two groups
two groups
One sample t-test
One sample t-test
Participants in group 1 and group 2 are the same?
Participants in...
yes
yes
no
no
Independent samples t-test
Independent samp...
Paired samples t-test
Paired samples...
Text is not SVG - cannot display

2 Analysis of Variance (ANOVA)

So far with t-tests, we only talked about experiments with one or two groups. What if you have more? If we only had t-tests, we would need to do a t-test with each pair of the groups. If you have three groups A, B, and C, you would need to compare groups A & B, B & C, A & C (3 t-tests). If you have four groups, you would need to do 6 t-tests… That’s a lot of t-tests! Not only would it take a lot of time and effort, doing multiple tests increases the chance of Type I error, as well. This is because at \(\alpha = 0.05\), each time we conduct a statistical test, we have a 5% chance of making an error.

The ANOVA is unaffected by this problem because no matter how many groups you have, you will only run one test which provides the main effects and interactions. Therefore, alpha remains at the “desired” level of 0.05.

As for the t-tests, the \(H_0\) in an ANOVA is the assumption that all means are equal. Different from t-tests, the \(H_1\) in an ANOVA is that at least one of the means is different from the others (non-directional), and we don’t know which one(s) stand out unless we do further tests.

Whereas t-tests use the mean as the basic statistic to evaluate the null hypothesis, ANOVA uses the variance. Variance refers to a way to represent variability in data: \(S^2 = \frac{\Sigma(x_i - \bar{x})^2}{n-1}\). Variance is the average of the squared differences from the mean, equivalent to the square of the standard deviation.

2.0.1 Basic logic of ANOVA

Conceptually, the ANOVA compares variability due to the experiment manipulation with any other sources of variability:

\[F=\frac{\text{Variability BETWEEN groups}}{\text{Variability WITHIN groups}}\]

The variability (variance) between groups is attributed to the independent variable (the predictor, the experiment manipulation), e.g., people behave differently because they receive different types of treatment. The variability within each group is attributed to chance and individual difference. The F-ratio is the between-group variance divided by the within-group variance.

More between-group variability = larger F

More between-group variability = larger F\label{anova1}

More with-in group variability = smaller F

More with-in group variability = smaller F\label{anova2}

2.0.2 Using ANOVA for null hypothesis testing

The F-ratio is the cricial statsitic when using the ANOVA for null hypothesis testing. The F-ratio must be calculated with the appropriate specification of the degrees of freedom. There are two different degrees of freedom with the ANOVA: one is the number of groups -1, the other is the number of subjects - the number of groups.

2.1 One-way between-subjects ANOVA

The one-way betwen-subects ANOVA is used when you have two group or more, and only between-subject independent variables.

How many groups/conditions?
How many groups/conditions?
one group
one group
two groups
two groups
One sample t-test
One sample t-test
Type of IV?
Type of IV?
within-subject
within-subject
between-subject
between-subject
Independent samples t-test
Independent samp...
Paired samples t-test
Paired samples...
two groups or more
two groups or more
Type(s) of IV?
Type(s) of IV?
within-subject
within-subject
between-subject
between-subject
both
both
Within-subjects ANOVA
Within-subjects...
Between-subjects ANOVA
Between-subject...
mixed ANOVA
mixed ANOVA
Text is not SVG - cannot display

How to choose the correct test.

2.1.1 Assumptions of One-way between-subjects ANOVA

  1. Normality: Residuals (observation - group mean) should be normally distributed in each group \(\approx\) normal distribution of data in each group
  2. Homogeneity of variances: Variances are approximately equal for every group
  3. Continuous data
  4. Independence of observations:
    • Each subject’s score on the outcome variable is independent of other subjects’ scores in the same group
    • and no subject overlap between groups
  5. One predictor variable (aka factor) with 3+ levels
    • In fact, you can have just two, and you get the same answers as a t-test

2.1.2 One-way between-subjects ANOVA: Example

10 native English speakers, 10 intermediate level L2 English speakers, and 10 advanced level L2 English speakers participated in a vocabulary size test. Data is measured by hundred words.

group s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
native 289 269 265 268 301 256 283 279 251 282
group s11 s12 s13 s14 s15 s16 s17 s18 s19 s20
intermediate L2 53 40 23 12 77 31 50 36 57 38
group s21 s22 s23 s24 s25 s26 s27 s28 s29 s30
advanced L2 89 120 118 149 122 135 76 104 117 95

Native group: mean = 273.6, s.d. = 15.81

Intermediate L2 group: mean = 41.7, s.d. = 18.48

Advanced L2: mean = 112.5, s.d. = 21.85

2.1.3 Do it in R: One-way between-subjects ANOVA

  1. Prepare your data: data should be in long form
group <- c(rep("native", 10), rep("L2_intermediate", 10), rep("L2_advanced",
    10))
score <- c(289, 269, 265, 261, 301, 256, 283, 279, 251, 282,
    53, 40, 23, 12, 77, 31, 50, 36, 57, 38, 89, 120, 118, 149,
    122, 135, 76, 104, 117, 95)
participant <- c(1:30)
one_between_exp_1 <- data.frame(participant, group, score)
one_between_exp_1$group <- as.factor(one_between_exp_1$group)
one_between_exp_1
##    participant           group score
## 1            1          native   289
## 2            2          native   269
## 3            3          native   265
## 4            4          native   261
## 5            5          native   301
## 6            6          native   256
## 7            7          native   283
## 8            8          native   279
## 9            9          native   251
## 10          10          native   282
## 11          11 L2_intermediate    53
## 12          12 L2_intermediate    40
## 13          13 L2_intermediate    23
## 14          14 L2_intermediate    12
## 15          15 L2_intermediate    77
## 16          16 L2_intermediate    31
## 17          17 L2_intermediate    50
## 18          18 L2_intermediate    36
## 19          19 L2_intermediate    57
## 20          20 L2_intermediate    38
## 21          21     L2_advanced    89
## 22          22     L2_advanced   120
## 23          23     L2_advanced   118
## 24          24     L2_advanced   149
## 25          25     L2_advanced   122
## 26          26     L2_advanced   135
## 27          27     L2_advanced    76
## 28          28     L2_advanced   104
## 29          29     L2_advanced   117
## 30          30     L2_advanced    95
  1. Run the ANOVA
# run the one-way anova test
# Notice: for the t-tests, we ran the code directly (t.test(...)).
# however, in r, a statistical test's results can be assigned to an object, too (a <- t.test(...)).
# assign the test results to an object can:
#    1. save the results
#    2. allow us to perform further tests on the results (such as checking anova assumptions)
one_between_exp_1_result <- aov(score ~ group, data = one_between_exp_1)
# take a look at our results using the summary() function
summary(one_between_exp_1_result)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## group        2 282478  141239   396.4 <2e-16 ***
## Residuals   27   9621     356                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Check the assumptions:8 We’re checking assumptions after running the anova because the aov object has the residuals already calculated so we don’t need to calculate them ourselves.
    • Normality
    • Homogeneity of variance
# 1. Normality (of residuals)
# using Shapiro-Wilk test
one_between_exp_1_residuals <- residuals(object = one_between_exp_1_result)
shapiro.test(one_between_exp_1_residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  one_between_exp_1_residuals
## W = 0.98292, p-value = 0.8967
# 2. Homogeneity of variance
library(car)
leveneTest(score ~ group, data = one_between_exp_1)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  0.2038 0.8168
##       27
  1. Post-hoc tests9 ANOVA post-hoc tests:
    The ANOVA only tells us whether there is at lease one group that is different from others. But a lot of the times we need more details than that, for example we are usually interested in exactly which group(s) stood out from others. In this case, we can do pair-wise t-tests following the ANOVA. But wait a second, didn’t I just say the reason we were running ANOVAs in the first place was multiple t-tests increase the chance of type I errors? You are right to ask this question. When doing pair-wise t-tests, we can adjust the p values, so that there is an overall chance of error at 5%. This is called the Bonferroni correction method. Bonferroni correction divides the overall alpha level by the number of tests being conducted, for example, if you are doing 20 t-tests simultaneously (on the same data set), the alpha level of each test should be \(\alpha = 0.05/20 = 0.0025\) (i.e. you’d only say you found a significant difference if \(p < 0.0025\)).
    10 You might wonder why don’t we just do corrected pairwise t-tests and forget about ANOVAs? We usually want to run as few statistical tests on our data as possible. ANOVAs give you some immediate, easy-to-intepret overall results. When you have a significant result, then of course you’d have to do the pairwise t-tests to explore that in more detail, but when you don’t find any significant effects with the ANOVA, you are able to stop right there with the conclusion that none of the groups you tested stood out, so you just ran one test on your data instead of many. 11 How to read the pairwise t-test results::
    The pairwise t-test gives us a table of p-values for the pairwise comparisons. Each p-value in a cell is the result of a comparison between the group in the column and the group in the row. We see that all of the pairwise comparisons are highly significant, which means that there are significant differences between each of our three groups.
# one-way ANOVA post hoc: pairwise t-tests with Bonferroni correction
pairwise.t.test(one_between_exp_1$score, one_between_exp_1$group,
                 p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  one_between_exp_1$score and one_between_exp_1$group 
## 
##                 L2_advanced L2_intermediate
## L2_intermediate 1.6e-08     -              
## native          < 2e-16     < 2e-16        
## 
## P value adjustment method: bonferroni
  1. Report the results

The results of one-way analysis of variance (ANOVA) with one between-subject factor Group (native vs. intermediate L2 vs. advanced L2) indicate a significant main effect of Group (F(2, 27)=396.4, p<0.001). Follow-up pairwise t-tests using Bonferroni correction method revealed that the main effect was due to significant difference among all the groups (all p’s<0.001) (Native: mean=273.6, s.d.=15.81; Intermediate L2: mean=41.7, s.d.=18.48; Advanced L2: mean=112.5, s.d.=21.85).

Further readings

Field, A. P., Miles, J., & Field, Zoë. (2012). Comparing two means. In Discovering statistics using r.

Field, A. P., Miles, J., & Field, Zoë. (2012). Comparing several means: ANOVA (GLM 1). In Discovering statistics using r.

Phillips, N. D. (2018). Hypothesis tests. In YaRrr! The pirate’s guide to r.

Phillips, N. D. (2018). ANOVA. In YaRrr! The pirate’s guide to r.

Poldrack, R. A. (2019). Hypothesis testing. In Statistical thinking for the 21st century.

Poldrack, R. A. (2019). Comparing means. In Statistical thinking for the 21st century.