The Randomised Controlled Trial (RCT) is regarded by many as the gold standard method for evaluating interventions. In this section we’ll look at an example of RCT data and introduce the basic statistical methods that can be used to analyse the results.
A RCT is effective simply because it is designed to counteract all of the systematic biases that can confound the relationship between the exposure and the outcome of interest.
RCTs have become such a bedrock of medical research that standards for reporting them have been developed. The CONSORT flowchart is a useful way of documenting the flow of participants through a trial. CONSORT stands for Consolidated Standards of Reporting Trials, which are endorsed by many medical journals. Indeed, if you plan to publish an intervention study in one of those journals, you are likely to be required to show you have followed the guidelines. The relevant information is available on the ‘Enhancing the QUAlity and Transparency Of health Research’ EQUATOR network website. The EQUATOR network site covers not only RCTs but also the full spectrum guidelines of many types of clinical research designs.
Statisticians often complain that researchers will come along with a collection of data and ask for advice as to how to analyse it. Sir Ronald Fisher (one of the most famous statisticians of all time) commented:
“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”
-Sir Ronald Fisher, Presidential Address to the First Indian Statistical Congress, 1938.
His point was that very often the statistician would have advised doing something different in the first place, had they been consulted at the outset. Once the data are collected, it may be too late to rescue the study from a fatal flaw.
The answer to the question “How should I analyse my data?” depends crucially on what hypothesis is being tested.
In the case of an intervention trial, the hypothesis will usually be
did intervention X make a difference to outcome Y in people with
condition Z?
There is, in this case, a clear null hypothesis – that the
intervention was ineffective, and the outcome of the intervention group
would have been just the same if it had not been done. The null
hypothesis significance testing approach answers just that question: it
tells you how likely your data are if the the null hypothesis was true.
To do that, you compare the distribution of outcome scores in the
intervention group and the control group. In this context we don’t just
look simply at the difference in means between two groups, usually we
consider whether that difference is greater than a pre-specified
effect size.
To illustrate data analysis, we will use a real dataset that can be
retrieved from the ESRC data
archive. We will focus only on a small subset of the data, which
comes from an intervention study in which teaching assistants
administered an individual reading and language intervention to children
with Down syndrome. A wait-list RCT design was used, but we
will focus here on just the first two phases, in which half the children
were randomly assigned to intervention, and the remainder formed a
control group.
Several language and reading measures were included in the study, giving
a total of 11 outcomes. Here we will illustrate the analysis with just
one of the outcomes - letter-sound coding - which was administered at
baseline (t1) and immediately after the intervention (t2). In this basic
example we will not introduce the pre-specified “effect size” concept in
the hypothesis test.
dsdat <- read.csv("dse-rli-trial-data-archive.csv")
dsdat <- dsdat[dsdat$included==1,]
dsdat$group <- as.factor(dsdat$group)
levels(dsdat$group)<-c("Intervention","Control")
dsdat$pretest <- dsdat$letter_sound_t1
dsdat$posttest <- dsdat$letter_sound_t2
pirateplot(posttest~group,theme=1,cex.lab=1.5,data=dsdat, point.o=1, xlab="", ylab="Post-intervention score")
Figure shows results on letter-sound coding after one group had received the intervention. This test had also been administered at baseline, but we will focus first just on the outcome results.
Raw data should always be inspected prior to any data analysis, in
order to check that the distribution of scores looks sensible. One hears
of horror stories where, for instance, an error code of 999 got included
in an analysis, distorting all the statistics. Or where an outcome score
was entered as 10 rather than 100. Visualising the data is useful when
checking whether the results are in the right numerical range for the
particular outcome measure.
The pirate plot (https://www.psychologicalscience.org/observer/yarrr-the-pirates-guide-to-r)
is a useful way of showing means and distributions as well as individual
data points.
A related step is to check whether the distribution of the data meets the assumptions of the proposed statistical analysis. Many common statistical procedures assume that continuous variables are normally distributed. Statistical approaches to checking of assumptions are beyond the scope of this section, but just eyeballing the data is useful, and can detect obvious cases of non-normality, cases of ceiling or floor effects, or clumpy data, where only certain values are possible. Data with these features may need special treatment (as for example the application of scale transformations or non-parametric test). For the data in Figure, although neither distribution has an classically normal distribution, we do not see major problems with ceiling or floor effects, and there is a reasonable spread of scores in both groups. To show now only basic methods, we will use parametric approaches (i.e. exploring differences in mean values!)
mytab <- psych::describeBy(dsdat$posttest, group=dsdat$group,mat=TRUE,digits=3)
mytab <- mytab[,c(2,4:6)]
colnames(mytab)<-c('Group','N','Mean','SD')
mytab[1:2,1]<-c("Intervention","Control")
ft <- flextable(mytab)
ft
Group | N | Mean | SD |
|---|---|---|---|
Intervention | 28 | 22.286 | 7.282 |
Control | 26 | 16.346 | 9.423 |
The next step is just to compute some basic statistics to get a feel for the effect size. As discussed, the standard deviation (SD) is a key statistic for measuring an intervention effect. In these results, one mean is higher than the other, but there is overlap between the groups. Statistical analysis gives us a way of quantifying how much confidence we can place in the group difference: in particular, how likely is it that there is no real impact of intervention and the observed results just reflect the play of chance. In this case we can see that the difference between means is around 6 points and the average SD is around 8, so the effect size (Cohen’s d) is about .75 - a large effect size for a language intervention.
The simplest way of measuring the intervention effect is therefore to just compare outcome (posttest) measures on a t-test. We can use a one-tailed test with confidence, given that we anticipate outcomes will be better after intervention. One-tailed tests are often treated with suspicion, because they can be used by researchers engaged in p-hacking, but where we predict a directional effect, they are entirely appropriate and give greater power than a two-tailed test. When reporting the result of a t-test, researchers should always report all the statistics: the value of t, the degrees of freedom, the means and SDs, and the confidence interval around the mean difference, as well as the p-value. This not only helps readers understand the magnitude and reliability of the effect of interest: it also allows for the study to readily be incorporated in a meta-analysis. Results from a t-test are shown in the following Table. [Note that with a one-tailed test, the confidence interval on one side will extend to infinity: this is because a one-tailed test assumes that the true result is greater than a specified mean value, and disregards results that go in the opposite direction].
myt1 <- t.test(dsdat$posttest~dsdat$group,var.equal=T,alternative='greater')
mytabt <- c(round(myt1$statistic,3),round(myt1$parameter,0), round(myt1$p.value,3),round(myt1$estimate[1]-myt1$estimate[2],3),round(myt1$conf.int,3))
mytabt <- as.matrix(mytabt,ncol=6)
mytabt <- as.data.frame(t(mytabt))
colnames(mytabt)<-c('t','df','p','mean diff.','lowerCI','upperCI')
flextable(mytabt)
t | df | p | mean diff. | lowerCI | upperCI |
|---|---|---|---|---|---|
2.602 | 52 | 0.006 | 5.94 | 2.117 | Inf |
Interpreting this table, we can conclude that data offer evidence in rejecting the null hypothesis, i.e. in the direction of a substantial effect of the intervention. Note that the mean difference here can be interpreted as an estimate of the ATE (Average Treatment Effect) in the causal language.
The t-test on outcomes is easy to do, but it misses an opportunity to control for one unwanted source of variation, namely individual differences in the initial level of the language measure. For this reason, researchers often prefer to take difference scores: the difference between outcome and baseline measures, and apply a t-test to these. While this had some advantages over reliance on raw outcome measures, it also has disadvantages, because the amount of change that is possible from baseline to outcome is not the same for everyone. A child with a very low score at baseline has more “room for improvement” than one who has an average score. For this reason, analysis of difference scores is not generally recommended.
Rather than taking difference scores, it is preferable to analyse differences in outcome measures after making a statistical adjustment that takes into account the initial baseline scores, using a method known as analysis of covariance or ANCOVA.
Of note: being a RCT we did not expect statistically significant differences in the initial baseline scores between groups, but we expect that this feature act as a covariate helping us improving the precision in the estimate of the effect on the outcome.
pirateplot(pretest~group,theme=1,cex.lab=1.5,data=dsdat, point.o=1, xlab="", ylab="Pre-intervention score")
myt1 <- t.test(dsdat$pretest~dsdat$group)
mytabt <- c(round(myt1$statistic,3),round(myt1$parameter,0), round(myt1$p.value,3),round(myt1$estimate[1]-myt1$estimate[2],3),round(myt1$conf.int,3))
mytabt <- as.matrix(mytabt,ncol=6)
mytabt <- as.data.frame(t(mytabt))
colnames(mytabt)<-c('t','df','p','mean diff.','lowerCI','upperCI')
flextable(mytabt)
t | df | p | mean diff. | lowerCI | upperCI |
|---|---|---|---|---|---|
0.942 | 50 | 0.351 | 2.242 | -2.539 | 7.023 |
As expected due to the randomization process there are no statistical difference in the pre-test scores, nonetheless they are associate with the outcome of interest:
ggplot(dsdat, aes(x = pretest, y = posttest, color = group)) +
geom_point() + # Add the points to the plot
scale_color_manual(values = c("Intervention" = "red", "Control" = "blue"))
Coming back to the ANCOVA method: in practice, this method usually
gives results that are very similar to those you would obtain from an
analysis of difference scores, but the precision, and hence the
statistical power is greater.
However, the data do need to meet certain assumptions of the method. We
can for example start with a plot to check if there is a linear
relationship between pretest vs posttest scores in both groups -
i.e. the points cluster around a straight line.
#plot to check linearity by eye
ggscatter(
dsdat, x = "pretest", y = "posttest",
color = "group", add = "reg.line"
)+
stat_regline_equation(
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~"), color = group)
)
Pretest vs posttest scores in the Down syndrome data
Inspection of the plot confirms that the relationship between pretest and posttest looks reasonably linear in both groups. (Note that it also shows that there are rather slightly more children with very low scores at pretest in the control group. This is just a chance finding - the kind of thing that can easily happen when you have relatively small numbers of children randomly assigned to groups). Randomization does not always guarantees “perfect” balance, especially for small sample size. The important is balance on average… As we have seen, on average pre-tests scores are not significantly different between the two groups, at is expected from randomization.
Now, let’s estimate the ATE (Average Treatment Effect) with only group as a factor, and then instead adding the pretest level: we use different methods (anova test and linear regression) that are substantially equivalent:
aov.1 <- dsdat %>% anova_test(posttest ~ group)
aov.1.full <- aov(posttest ~ group, data=dsdat)
aov.1.full$coefficients # the same as the simple linear regression model
## (Intercept) groupControl
## 22.28571 -5.93956
res.aov <- dsdat %>% anova_test(posttest ~ pretest +group)
resaov.1.full <- aov(posttest ~ pretest +group, data=dsdat)
resaov.1.full$coefficients # the same as the bivariable linear regression model
## (Intercept) pretest groupControl
## 10.3645402 0.7762625 -4.1993676
mylm1 <- lm(posttest~group,data=dsdat) # is equivalent to run aov.1
mylm <- lm(posttest~pretest+group,data=dsdat) #is equivalent to run res.aov
To better visualize results:
tab1 <- as.data.frame(get_anova_table(aov.1))
tab2 <- as.data.frame(get_anova_table(res.aov))
tab3 <- as.data.frame(summary(mylm1)$coefficients)
tab4 <- as.data.frame(summary(mylm)$coefficients)
source <-c('Intercept','Pretest','Group')
tab4 <- cbind(source,tab4)
colnames(tab4)[5]<-'p'
colnames(tab3)[4]<-'p'
tab1$p <- round(tab1$p,3)
tab2$p <- round(tab2$p,3)
tab3$p <- round(tab3$p,3)
tab4$p <- round(tab4$p,3)
tab1<-tab1[,-6]
tab2<-tab2[,-6]
ft1<-flextable(tab1)
ft1<-set_caption(ft1,'Analysis of posttest only (ANOVA)')
ft1
Effect | DFn | DFd | F | p | ges |
|---|---|---|---|---|---|
group | 1 | 52 | 6.773 | 0.012 | 0.115 |
ft2 <- flextable(tab2)
ft2<-set_caption(ft2,'Analysis adjusted for pretest scores (ANCOVA)')
ft2
Effect | DFn | DFd | F | p | ges |
|---|---|---|---|---|---|
pretest | 1 | 51 | 94.313 | 0.000 | 0.649 |
group | 1 | 51 | 9.301 | 0.004 | 0.154 |
# ft3 shows the equivalent results from linear regression
ft3<-flextable(tab3)
ft3<-set_caption(ft3,'Linear regression with group as predictor' )
ft3
Estimate | Std. Error | t value | p |
|---|---|---|---|
22.28571 | 1.583656 | 14.072320 | 0.000 |
-5.93956 | 2.282291 | -2.602455 | 0.012 |
ft4<-flextable(tab4)
ft4<-set_caption(ft4,'Linear regression with pretest and group as predictor' )
ft4
source | Estimate | Std. Error | t value | p |
|---|---|---|---|---|
Intercept | 10.3645402 | 1.55058301 | 6.684286 | 0.000 |
Pretest | 0.7762625 | 0.07993237 | 9.711491 | 0.000 |
Group | -4.1993676 | 1.37698465 | -3.049684 | 0.004 |
The table shows the same data analysed first of all by using ANOVA or simple linear regression to compare only the post-test scores (i.e. the ATE effect), then using ANCOVA or the bi-variable linear regression model to adjust scores for the baseline (pretest) values or the linear regression to do the same thing (i.e. the CATE effect: conditional on the pre-test value). The effect size in the ANOVA/ANCOVA approaches is shown as ges, which stands for “generalised eta squared”. You can notice that the ATE estimated by the regression coefficient of the simple linear regression model with only group as an independent variable is the same as the value of the mean difference reported by the t-test. Of note: if you want to deepen this method (ANCOVA),see for example: https://www.datanovia.com/en/lessons/ancova-in-r/.