So far, we’ve talked about null hypothesis testing, normal distribution, and some descriptive statistics. This week and next week, we are going to talk about a few ways to test null hypotheses by comparing means. And for the last week, we are going to talk about regression models.
As researchers, a lot of the times the questions we are trying to answer with statistics boil down to something like “Is there a difference between X and Y?”. T-tests are a simple test statistics that can help us answer those questions by determining whether there’s a significant difference between the means of two sets of data.
Look at these questions that we can answer with a t-test:
Notice that sometimes we are interested in how two groups of subjects compare, and sometimes we are interested in how results change within the same group depending on the conditions.
Independent measures, also known as between-subject measures, describes experiments where different participants are measured in each group or condition. For example, we can compare 20 students from university A to 20 students from university B (assuming no one attends both universities simultaneously). When experiments use independent measures, usually the number of data points = the number of participants. Results in the two groups are independent from one another.
Group A | Group B | |
participant no. | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 |
Certain topics can only be investigated in independent designs (e.g., comparing between sexes or demographic groups, or a test procedure or treatment that can only be run once). With independent measures, there is less risk of practice or fatigue effects. There is also less risk of data loss due to participant drop out.
Repeated measures, also known as within-subject measures, describes experiments where the same participants are measured in each group or condition. For example, we can be interested in whether a group of 20 children’s IQ when they are 10 years old improved from when they were 5 years old. In an experiment with repeated measures, each participant is tested (at least) twice. Therefore, you can expect a correlation between results: children who had higher IQ when they were 5 are more likely to have a high IQ at 10.
Condition A | Condition B | |
participant no. | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
With repeated measures, participants act as their own control. Therefore, the level of error variance in the data is reduced. Experiments with repeated measures can also be quicker and sometimes cheaper to run than between-subject experiments.1 Quick Q: Read the research questions again and identify whether they should be answered with an experiment with independent or repeated measures.
The independent samples t-test is used to compare the mean of two groups in between-subject designs.
All statistical tests have some assumptions about the data. The independent samples t-test assumes that:
The performance of 10 children with specific language impairment (SLI) is compared to 10 typically-developing aged-matched control children on a test of non-word repetition.
group_SLI | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
score | 6 | 17 | 2 | 8 | 9 | 3 | 12 | 10 | 5 | 3 |
group_TD | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
---|---|---|---|---|---|---|---|---|---|---|
score | 10 | 14 | 12 | 9 | 18 | 16 | 8 | 20 | 10 | 11 |
In the most general words, t statistics are differences between two means divided by a measure of the variability of that difference. In the case of the independent samples t-test, the t value is the difference between the two group means, divided by the standard error2 Standard errors: Imagine randomly picking a sample from the population, for example randomly picking 10 children from all of the children in the world and testing their word repetition accuracies. Each time we do this, the mean score of these 10 children is likely to be slightly different. Now say we do this (randomly picking 10 children) 100 times, we will have a distribution of 100 mean values. The standard error of this distribution of means is the standard error (SE). Here is an animation. 3 Calculating the SE from one experiment: But we never conduct the same experiment over and over again! Instead, we can estimate the SE from our one experiment. To calculate the SE from one experiment’s data: \(SE=\frac{s.d.}{\sqrt{n}}\) of the difference between the two means.
\[t=\frac{M}{SE}=\frac{M_1-M_2}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}}\]
The bigger the difference between the two means, the bigger the t value. The smaller the standard error is (i.e., the more accurately you determine the difference between the means), the bigger the t value is.
score <- c(6, 17, 2, 8, 9, 3, 12, 10, 5, 3, 10, 14, 12, 9, 18,
16, 8, 20, 10, 11)
group <- c("SLI", "SLI", "SLI", "SLI", "SLI", "SLI", "SLI", "SLI",
"SLI", "SLI", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD",
"TD", "TD")
df <- data.frame(group, score)
df
## group score
## 1 SLI 6
## 2 SLI 17
## 3 SLI 2
## 4 SLI 8
## 5 SLI 9
## 6 SLI 3
## 7 SLI 12
## 8 SLI 10
## 9 SLI 5
## 10 SLI 3
## 11 TD 10
## 12 TD 14
## 13 TD 12
## 14 TD 9
## 15 TD 18
## 16 TD 16
## 17 TD 8
## 18 TD 20
## 19 TD 10
## 20 TD 11
# normality use Shapiro Wilk test
tapply(df$score, df$group, shapiro.test)
## $SLI
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.93703, p-value = 0.5205
##
##
## $TD
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.92153, p-value = 0.3699
# homogeneity of variance Levene's test
library(car)
leveneTest(score ~ factor(group), data = df)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.1783 0.6778
## 18
t.test(score ~ group, var.equal = TRUE, data = df)
##
## Two Sample t-test
##
## data: score by group
## t = -2.7027, df = 18, p-value = 0.01457
## alternative hypothesis: true difference in means between group SLI and group TD is not equal to 0
## 95 percent confidence interval:
## -9.419927 -1.180073
## sample estimates:
## mean in group SLI mean in group TD
## 7.5 12.8
An independent samples t-test showed that 10 SLI children performed significantly worse [t(18)=-2.703, p=0.015] on a test of non-word repetition (M=7.5, s.d.=4.7) than 10 age-matched typically-developing controls (M=12.8, s.d.=4.05).4 Reporting statistics:
Statistical tests need to be reported once the data have been analysed, usually in a short paragraph, containing:
1. Statistical test that was performed
2. The measures that were compared (with the different levels if there are any)
3. The means and standard deviations
4. Significant or non-significant?
5. Specific test value – in this case the t-statistic (later you will see the \(F\)-value, \(\chi^2\), \(r\), etc.)
6. Degrees of freedom – convention dictates that these are placed in rounded parentheses
7. The specific \(p\)-value. (as well as confidence intervals/effect size) 5 Degree of freedom:
The degree of freedom is the number of factors in a calculation that we can vary and still achieve a specific outcome. For example, I want a group of five numbers that has the mean of 10. The first four numbers can vary freely, but once the first four numbers are chosen, the last number must be fixed. Therefore, my degree of freedom is 4.
10, 11, x, y, z (mean = 10) (x=?? y=?? z=??)
10, 11, 9, 12, x (mean = 10) (x must be 8)
300, 4123, -50, 890, x (mean = 10) (x must be -5213)
In t-tests, degrees of freedom define the shape of the t-distribution used to calculate the p value. \(d.f. = n-1\) is the basic method for calculating degrees of freedom. In the case of independent samples t-test, two means are calculated, thus \(d.f. = n_1 + n_2 -2\). The degree of freedom reflects sample size. The higher the d.f., the more power to reject \(H_0\).
The paired samples t-test are useful to compare two groups or conditions in repeated measures/within-subjects designs.
The performance of 10 children with SLI is measured on a task of non-word repetition before and after they receive a specialised treatment.
condition | s1 | s2 | s3 | s4 | s5 | s6 | s7 | s8 | s9 | s10 |
---|---|---|---|---|---|---|---|---|---|---|
before | 6 | 17 | 7 | 2 | 8 | 9 | 3 | 12 | 10 | 5 |
after | 8 | 15 | 8 | 5 | 12 | 15 | 5 | 10 | 19 | 8 |
difference | s1 | s2 | s3 | s4 | s5 | s6 | s7 | s8 | s9 | s10 |
---|---|---|---|---|---|---|---|---|---|---|
difference | 2 | -2 | 1 | 3 | 4 | 6 | 2 | -2 | -1 | 3 |
In a paired samples t-test, the t value is the average difference between the two scores, divided by the standard error of the differences:
\[ t=\frac{\bar{x}_{diff}}{\frac{S}{\sqrt{n}}} \]
In our example, the average difference is 1.6, the standard deviation of the difference is 2.63, sample size is 10. So \(t = \frac{1.6}{\frac{2.63}{3.162}} = 1.92\). Our degree of freedom is \(n-1 = 9\).
score <- c(6, 17, 7, 2, 8, 9, 3, 12, 10, 5, 8, 15, 8, 5, 12,
15, 5, 10, 9, 8)
condition <- rep(c("before", "after"), each = 10)
participant <- rep(c(1:10), 2)
paired_exp <- data.frame(participant, condition, score)
paired_exp
## participant condition score
## 1 1 before 6
## 2 2 before 17
## 3 3 before 7
## 4 4 before 2
## 5 5 before 8
## 6 6 before 9
## 7 7 before 3
## 8 8 before 12
## 9 9 before 10
## 10 10 before 5
## 11 1 after 8
## 12 2 after 15
## 13 3 after 8
## 14 4 after 5
## 15 5 after 12
## 16 6 after 15
## 17 7 after 5
## 18 8 after 10
## 19 9 after 9
## 20 10 after 8
# check normality (of difference) calculate difference
dif <- with(paired_exp, score[condition == "after"] - score[condition ==
"before"])
# use the Shapiro-Wilk test
shapiro.test(dif)
##
## Shapiro-Wilk normality test
##
## data: dif
## W = 0.94191, p-value = 0.5745
t.test(score ~ condition, data = paired_exp, paired = TRUE)
##
## Paired t-test
##
## data: score by condition
## t = 1.9215, df = 9, p-value = 0.08684
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.2836223 3.4836223
## sample estimates:
## mean difference
## 1.6
A paired samples t-test revealed that 10 children with SLI’s performance did not signiticantly improve (t(9)=1.92, p=0.08) after the treatment (before: mean=7.9, sd=4.43; after: mean = 9.5, sd=3.57).
Here’s a map of how to choose the correct t-test.7 One sample t-test:
For time reasons I did not include it in the lecture. The idea of a one-sample t-test is very simple: is the mean of a group of values significantly different from a specific number? For example, you might be interested in whether the IQ of students at UCL is different from the average (100). To run a one sample t-test in r: t.test(data, mu=100)
, where mu is the number you want to compare your data with.
So far with t-tests, we only talked about experiments with one or two groups. What if you have more? If we only had t-tests, we would need to do a t-test with each pair of the groups. If you have three groups A, B, and C, you would need to compare groups A & B, B & C, A & C (3 t-tests). If you have four groups, you would need to do 6 t-tests… That’s a lot of t-tests! Not only would it take a lot of time and effort, doing multiple tests increases the chance of Type I error, as well. This is because at \(\alpha = 0.05\), each time we conduct a statistical test, we have a 5% chance of making an error.
The ANOVA is unaffected by this problem because no matter how many groups you have, you will only run one test which provides the main effects and interactions. Therefore, alpha remains at the “desired” level of 0.05.
As for the t-tests, the \(H_0\) in an ANOVA is the assumption that all means are equal. Different from t-tests, the \(H_1\) in an ANOVA is that at least one of the means is different from the others (non-directional), and we don’t know which one(s) stand out unless we do further tests.
Whereas t-tests use the mean as the basic statistic to evaluate the null hypothesis, ANOVA uses the variance. Variance refers to a way to represent variability in data: \(S^2 = \frac{\Sigma(x_i - \bar{x})^2}{n-1}\). Variance is the average of the squared differences from the mean, equivalent to the square of the standard deviation.
Conceptually, the ANOVA compares variability due to the experiment manipulation with any other sources of variability:
\[F=\frac{\text{Variability BETWEEN groups}}{\text{Variability WITHIN groups}}\]
The variability (variance) between groups is attributed to the independent variable (the predictor, the experiment manipulation), e.g., people behave differently because they receive different types of treatment. The variability within each group is attributed to chance and individual difference. The F-ratio is the between-group variance divided by the within-group variance.
The F-ratio is the cricial statsitic when using the ANOVA for null hypothesis testing. The F-ratio must be calculated with the appropriate specification of the degrees of freedom. There are two different degrees of freedom with the ANOVA: one is the number of groups -1, the other is the number of subjects - the number of groups.
The one-way betwen-subects ANOVA is used when you have two group or more, and only between-subject independent variables.
10 native English speakers, 10 intermediate level L2 English speakers, and 10 advanced level L2 English speakers participated in a vocabulary size test. Data is measured by hundred words.
group | s1 | s2 | s3 | s4 | s5 | s6 | s7 | s8 | s9 | s10 |
---|---|---|---|---|---|---|---|---|---|---|
native | 289 | 269 | 265 | 268 | 301 | 256 | 283 | 279 | 251 | 282 |
group | s11 | s12 | s13 | s14 | s15 | s16 | s17 | s18 | s19 | s20 |
---|---|---|---|---|---|---|---|---|---|---|
intermediate L2 | 53 | 40 | 23 | 12 | 77 | 31 | 50 | 36 | 57 | 38 |
group | s21 | s22 | s23 | s24 | s25 | s26 | s27 | s28 | s29 | s30 |
---|---|---|---|---|---|---|---|---|---|---|
advanced L2 | 89 | 120 | 118 | 149 | 122 | 135 | 76 | 104 | 117 | 95 |
Native group: mean = 273.6, s.d. = 15.81
Intermediate L2 group: mean = 41.7, s.d. = 18.48
Advanced L2: mean = 112.5, s.d. = 21.85
group <- c(rep("native", 10), rep("L2_intermediate", 10), rep("L2_advanced",
10))
score <- c(289, 269, 265, 261, 301, 256, 283, 279, 251, 282,
53, 40, 23, 12, 77, 31, 50, 36, 57, 38, 89, 120, 118, 149,
122, 135, 76, 104, 117, 95)
participant <- c(1:30)
one_between_exp_1 <- data.frame(participant, group, score)
one_between_exp_1$group <- as.factor(one_between_exp_1$group)
one_between_exp_1
## participant group score
## 1 1 native 289
## 2 2 native 269
## 3 3 native 265
## 4 4 native 261
## 5 5 native 301
## 6 6 native 256
## 7 7 native 283
## 8 8 native 279
## 9 9 native 251
## 10 10 native 282
## 11 11 L2_intermediate 53
## 12 12 L2_intermediate 40
## 13 13 L2_intermediate 23
## 14 14 L2_intermediate 12
## 15 15 L2_intermediate 77
## 16 16 L2_intermediate 31
## 17 17 L2_intermediate 50
## 18 18 L2_intermediate 36
## 19 19 L2_intermediate 57
## 20 20 L2_intermediate 38
## 21 21 L2_advanced 89
## 22 22 L2_advanced 120
## 23 23 L2_advanced 118
## 24 24 L2_advanced 149
## 25 25 L2_advanced 122
## 26 26 L2_advanced 135
## 27 27 L2_advanced 76
## 28 28 L2_advanced 104
## 29 29 L2_advanced 117
## 30 30 L2_advanced 95
# run the one-way anova test
# Notice: for the t-tests, we ran the code directly (t.test(...)).
# however, in r, a statistical test's results can be assigned to an object, too (a <- t.test(...)).
# assign the test results to an object can:
# 1. save the results
# 2. allow us to perform further tests on the results (such as checking anova assumptions)
one_between_exp_1_result <- aov(score ~ group, data = one_between_exp_1)
# take a look at our results using the summary() function
summary(one_between_exp_1_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 282478 141239 396.4 <2e-16 ***
## Residuals 27 9621 356
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 1. Normality (of residuals)
# using Shapiro-Wilk test
one_between_exp_1_residuals <- residuals(object = one_between_exp_1_result)
shapiro.test(one_between_exp_1_residuals)
##
## Shapiro-Wilk normality test
##
## data: one_between_exp_1_residuals
## W = 0.98292, p-value = 0.8967
# 2. Homogeneity of variance
library(car)
leveneTest(score ~ group, data = one_between_exp_1)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.2038 0.8168
## 27
# one-way ANOVA post hoc: pairwise t-tests with Bonferroni correction
pairwise.t.test(one_between_exp_1$score, one_between_exp_1$group,
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: one_between_exp_1$score and one_between_exp_1$group
##
## L2_advanced L2_intermediate
## L2_intermediate 1.6e-08 -
## native < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
The results of one-way analysis of variance (ANOVA) with one between-subject factor Group (native vs. intermediate L2 vs. advanced L2) indicate a significant main effect of Group (F(2, 27)=396.4, p<0.001). Follow-up pairwise t-tests using Bonferroni correction method revealed that the main effect was due to significant difference among all the groups (all p’s<0.001) (Native: mean=273.6, s.d.=15.81; Intermediate L2: mean=41.7, s.d.=18.48; Advanced L2: mean=112.5, s.d.=21.85).
Field, A. P., Miles, J., & Field, Zoë. (2012). Comparing two means. In Discovering statistics using r.
Field, A. P., Miles, J., & Field, Zoë. (2012). Comparing several means: ANOVA (GLM 1). In Discovering statistics using r.
Phillips, N. D. (2018). Hypothesis tests. In YaRrr! The pirate’s guide to r.
Phillips, N. D. (2018). ANOVA. In YaRrr! The pirate’s guide to r.
Poldrack, R. A. (2019). Hypothesis testing. In Statistical thinking for the 21st century.
Poldrack, R. A. (2019). Comparing means. In Statistical thinking for the 21st century.