Load the required library:

require(tigerstats)

Introduction : sampling basics

Recall that most fundamental epidemiological/health research questions actually break down into 3 parts:

To answer all these questions, the starting point involve dealing with the sample.

In order to make valid conclusions about any research question, we first need to make sure we are dealing with a good sample.

Here we will discuss various techniques for drawing samples, with some notes about the strengths and weaknesses of these different sampling techniques.

Population versus Sample

An important distinction that we want to make sure has been made clear before we go any further is the distinction between a sample and a population.

Population

A population is the set of all subjects of interest.

Sample

A sample is the subset of the population in which we measure data.

Let’s consider these two definitions with a research question.

Research Question: In the United States, what is the mean height of adult males (18 years +)?

The population that we are dealing with in this case is all U.S. adult males. One way to find an exact answer to this research question would be to survey the entire population. However, this is nearly impossible! It would be much quicker and easier to measure only a subset of the population, a sample.

However, if we want our sample to be an accurate reflection of the population, we can’t just choose any sample that we wish.

The way in which we collect our sample is very important.

For the time being, let’s suppose that we were able to choose an appropriate sample (and we’ll talk more about how this is done later).
Suppose that our sample of U.S. men is an accurate representation of the U.S. population of men. Then, we might discuss two different means: the mean height of the sample and the mean height of the population. These are both descriptions, as opposed to inferences. There are a couple of differences, however.

Mean Height of the Sample

Mean Height of the Population

Our goal is to use the information we’ve gathered from the sample to infer, or predict, something about the population.
For our example, we want to predict the population mean, using our knowledge of the sample. The accuracy of our sample mean relies heavily upon how well our sample represents the population at large. If our sample does a poor job at representing the population, then any inferences that we make about the population are also going to be poor. Thus, it is very important to select a good sample!

Note: If we already knew everything about a population, it would be useless to gather a sample in order to infer something about the population. We would already have this information! Using statistics as an inferential tool means that you don’t have information about the entire population to start with. If you are able to sample the entire population, this would be called a census.

Types of Samples

There are 2 main kinds of sampling:

Non-Random sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection. This sampling method depends heavily on the expertise of the researchers. It is carried out by observation, and researchers use it widely in qualitative research.Not all members of the population have an equal chance of participating in the study, unlike random sampling, where each member of the population has a known chance of being selected. Non-probability sampling is most useful for exploratory studies like a pilot survey (deploying a survey to a smaller sample compared to pre-determined sample size). Researchers use this method in studies where it is not possible to draw using random probability sampling due to time or cost considerations. We do not explore here further this issue, (most of the times related also to the data availability issues). Note that in clinical and epidemiological studies we assume that the ideal sampling scheme is to have a random sample of the target population, even if the treatment assignment in the sample is not random (unless you are doing a Randomized Clinical Trial, the gold standard!) ….

Random Sampling

There are basically four different methods of random sampling:

  • Simple Random Sampling (SRS)
  • Systematic Sampling
  • Stratified Sampling
  • Cluster Sampling

The simple random sample (SRS) is the basic type of sample. The other types have been included just to give you comparisons to the SRS and also to aid you in the future if you want to deepen further these topics.

It will be helpful to work with an example as we describe each of these methods, so let’s use the following set of 28 students from FakeSchool as our target population from which we will sample.

data(FakeSchool)
head(FakeSchool)
##   Students Sex class  GPA Honors
## 1    Alice   F    Fr 3.80    Yes
## 2     Brad   M    Fr 2.60    Yes
## 3    Caleb   M    Fr 2.25     No
## 4    Daisy   F    Fr 2.10     No
## 5     Faye   F    Fr 2.00     No
## 6      Eva   F    Fr 1.80     No

This is a data frame with 28 observations on the following 5 variables:

  • Students: Name of each student

  • Sex: sex of the student

  • class: class rank of the student

  • GPA: grade point average

  • Honors: whether or not he student is in the Honors Program

Keep in mind that we would not know information about an entire population in real life, We are using this “population” for demonstration purposes only

Our goal is to describe how the different sampling techniques are implemented, and comment on the strengths and weaknesses of them.

We can easily compute the true mean GPA for the students at FakeSchool by averaging the values in the fourth column of the dataset.

This will be the population mean. We will call it \(\mu\) (“mu”).

mu <- mean(~GPA,data=FakeSchool)
mu
## [1] 2.766429

Again, the population parameter, \(\mu\), is not typically known !!! If it were known, there would be no reason to estimate it. However, the point of this example is to practice selecting different types of samples and to compare the performance of these different sampling techniques.

Simple Random Sample

In simple random sampling, for a given sample size \(n\) every set of \(n\) members of the population has the same chance to be the sample that is actually selected.

We often use the acronym SRS as an abbreviation for “simple random sampling”.

Intuitively, let’s think of simple random sampling as follows: we find a big box, and for each member of the population we put into the box a ticket that has the name of the individual written on it. All tickets are the same size and shape. Mix up the tickets thoroughly in the box. Then pull out a ticket at random, set it aside, pull out another ticket, set it aside, and so on until the desired number of tickets have been selected.

Let’s select a simple random sample of 7 elements without replacement.

We can accomplish this easily with the built in function popsamp in R. This function requires two pieces of information:

  • the size of the sample
  • the dataset from which to draw the sample

Remember that sampling without replacement means that once we draw an element from the population, we do not put it back so that it can be drawn again.
We would not want to draw with replacement as this could possibly result with a sample containing the same person more than once. This would not be a good representation of the entire school. By default, the popsamp function always samples without replacement. If you want to sample with replacement, you would need to add a third argument to the function: replace=TRUE. Typically, we will sample without replacement in most of cases. [Some exceptions are used in the propensity score-based methods, for the purpose of matching (see examples in block 3)].

Since we may want to access this sample later, it’s a good idea to store our sample in an object.

set.seed(314159)
srs <- popsamp(7,FakeSchool)
srs
##    Students Sex class GPA Honors
## 10    Chris   M    So 4.0    Yes
## 2      Brad   M    Fr 2.6    Yes
## 16 Brittany   F    Jr 3.9     No
## 20   Eliott   M    Jr 1.9     No
## 21    Garth   M    Jr 1.1     No
## 23      Bob   M    Sr 3.8    Yes
## 13     Eric   M    So 2.1     No

Let’s now calculate the mean GPA for the 7 sampled students. This will be the sample mean, \(\bar{x}_{srs}\). We will use the subscript ‘srs’ to remind ourselves that this is the sample mean for the simple random sample.

xbar.srs <- mean(~GPA,data=srs)
xbar.srs
## [1] 2.771429

Strengths

  • The selection of one element does not affect the selection of others.

  • Each possible sample, of a given size, has an equal chance of being selected.

  • Simple random samples tend to be good representations of the population.

  • Requires little knowledge of the population.

Weaknesses

  • If there are small subgroups within the population, a SRS may not give an accurate representation of that subgroup. In fact, it may not include it at all. This is especially true if the sample size is small, as in our example.

  • If the population is large and widely dispersed, it can be costly (both in time and money) to collect the data.

Systematic Sample

In a systematic sample, the members of the population are put in a row.
Then 1 out of every \(k\) members are selected.
The starting point is randomly chosen from the first \(k\) elements and then elements are sampled at the same location in each of the subsequent segments of size \(k\).

To illustrate the idea, let’s take a 1-in-4 systematic sample from our FakeSchool population.

We will start by randomly selecting our starting element.

set.seed(49464)
start=sample(1:4,1)
start
## [1] 4

So, we will start with element 4, which is Daisy and choose every 4th element after that for our sample.

##    Students Sex class GPA Honors
## 4     Daisy   F    Fr 2.1     No
## 8    Andrea   F    So 4.0    Yes
## 12   Felipe   M    So 3.0     No
## 16 Brittany   F    Jr 3.9     No
## 20   Eliott   M    Jr 1.9     No
## 24     Carl   M    Sr 3.1     No
## 28    Grace   F    Sr 1.4     No
## [1] 2.771429

The mean GPA of the systematic sample, the sample mean, \(\bar{x}_{sys}\), is 2.7714286.

Strengths

  • Assures a random sampling of the population.

  • When the population is an ordered list, a systematic sample could give a better representation of the population than a SRS.

  • Can be used in situations where a SRS is difficult or impossible. It is especially useful when the population that you are studying is arranged in time.

For example,suppose you are interested in the average amount of money that people spend at the grocery store on a Wednesday evening. A systematic sample could be used by selecting every 10th person that walks into the store.

Weaknesses

  • Not every combination has an equal chance of being selected. Many combinations will never be selected using a systematic sample.

  • Beware of periodicity in the population. If, after ordering, the selections match some pattern in the list (skip interval), the sample may not be representative of the population.

Stratified Sample

In a stratified sample, the population must first be separated into homogeneous groups, or strata.
Each element only belongs to one stratum and the stratum consist of elements that are alike in some way.
A simple random sample is then drawn from each stratum, which is combined to make the stratified sample.

Let’s take a stratified sample of 7 elements from FakeSchool using the following strata: Honors, Not Honors.
First, let’s determine how many elements belong to each strata:

## Honors
##  No Yes 
##  16  12

So there are 12 Honors students at FakeSchool and 16 non-Honors students at FakeSchool.

There are various ways to determine how many students to include from each stratum.
For example, you could choose to select the same number of students from each stratum.
Another strategy is to use a proportionate stratified sample.
In a proportionate stratified sample, the number of students selected from each stratum is proportional to the representation of the strata in the population.
For example, \(\frac{12}{28}\) X 100% = 42.8571429% of the population are Honors students.
This means that there should be 0.4285714 X 7 = 3 Honors students in the sample. So there should be 7-3=4 non-Honors students in the sample.

Let’s go through the coding to draw these samples.
Check out the how we use the subset function to pull out the Honors students from the rest of the populations:

set.seed(1837)
honors=subset(FakeSchool,Honors=="Yes")
honors
##    Students Sex class  GPA Honors
## 1     Alice   F    Fr 3.80    Yes
## 2      Brad   M    Fr 2.60    Yes
## 8    Andrea   F    So 4.00    Yes
## 9     Betsy   F    So 4.00    Yes
## 10    Chris   M    So 4.00    Yes
## 11    Dylan   M    So 3.50    Yes
## 15     Adam   M    Jr 3.98    Yes
## 17   Cassie   F    Jr 3.75    Yes
## 18    Derek   M    Jr 3.10    Yes
## 19    Faith   F    Jr 2.50    Yes
## 22   Angela   F    Sr 4.00    Yes
## 23      Bob   M    Sr 3.80    Yes

Next, we take a SRS of size 3 from the Honors students:

honors.samp=popsamp(3,honors)
honors.samp
##    Students Sex class  GPA Honors
## 15     Adam   M    Jr 3.98    Yes
## 19    Faith   F    Jr 2.50    Yes
## 18    Derek   M    Jr 3.10    Yes

The same method will work for non-Honors students.

set.seed(17365)
nonhonors=subset(FakeSchool,Honors=="No") 
nonhonors.samp=popsamp(4,nonhonors) 
nonhonors.samp
##    Students Sex class GPA Honors
## 26    Frank   M    Sr 2.0     No
## 28    Grace   F    Sr 1.4     No
## 13     Eric   M    So 2.1     No
## 25    Diana   F    Sr 2.9     No

We can put this together to create our stratified sample.

##    Students Sex class  GPA Honors
## 15     Adam   M    Jr 3.98    Yes
## 19    Faith   F    Jr 2.50    Yes
## 18    Derek   M    Jr 3.10    Yes
## 26    Frank   M    Sr 2.00     No
## 28    Grace   F    Sr 1.40     No
## 13     Eric   M    So 2.10     No
## 25    Diana   F    Sr 2.90     No
## [1] 2.568571

The sample mean for the stratified sample, \(\bar{x}_{strat}\), is 2.5685714.

Strengths

  • Representative of the population, because elements from all strata are included in the sample.

  • Ensures that specific groups are represented, sometimes even proportionally, in the sample.

  • Since each stratified sample will be distributed similarly, the amount of variability between samples is decreased.

  • Allows comparisons to be made between strata, if necessary. For example, a stratified sample allows you to easily compare the mean GPA of Honors students to the mean GPA of non-Honors students.

Weaknesses

  • Requires prior knowledge of the population. You have to know some characteristics about the population to be able to split into strata.

Cluster Sample

In cluster sampling the population is partitioned into groups, called clusters. The clusters, which are composed of elements, are not necessarily of the same size. Each element should belong to one cluster only and none of the elements of the population should be left out.

The clusters, and not the elements, become the units to be sampled.

Whenever a cluster is sampled, every element within it is observed.

In cluster sampling, usually only a few clusters are sampled.

Hence, in order to increase the precision of the estimates, the population should be partitioned into clusters in such a way that the clusters will have similar mean values. As the elements inside the clusters are not sampled, the variance within clusters does not contribute to the sampling variance of the estimators.

Cluster sampling is often more cost effective than other sampling designs, as one does not have to sample all the clusters. However, if the size of a cluster is large it might not be possible to observe all its elements.

Cluster sampling could be used when natural groups are evident in the population.
The clusters should all be similar each other: each cluster should be a small scale representation of the population.
To take a cluster sample, a random sample of the clusters is chosen. The elements of the randomly chosen clusters make up the sample.

Let’s take now a cluster sample using the grade level (freshmen, sophomore, junior, senior) of FakeSchool as the clusters.
Let’s take a random sample of 2 of them. Remember that this is really a basic-level example (single-stage cluster sampling).

set.seed(17393)
clusters=sample(FakeSchool$class,2,replace=FALSE)
clusters
## [1] Fr Sr
## Levels: Fr Jr So Sr
cluster1=subset(FakeSchool,class==clusters[1])
cluster2=subset(FakeSchool,class==clusters[2])
clust.samp=rbind(cluster1,cluster2)
clust.samp
##    Students Sex class  GPA Honors
## 1     Alice   F    Fr 3.80    Yes
## 2      Brad   M    Fr 2.60    Yes
## 3     Caleb   M    Fr 2.25     No
## 4     Daisy   F    Fr 2.10     No
## 5      Faye   F    Fr 2.00     No
## 6       Eva   F    Fr 1.80     No
## 7     Georg   M    Fr 1.40     No
## 22   Angela   F    Sr 4.00    Yes
## 23      Bob   M    Sr 3.80    Yes
## 24     Carl   M    Sr 3.10     No
## 25    Diana   F    Sr 2.90     No
## 26    Frank   M    Sr 2.00     No
## 27       Ed   M    Sr 1.50     No
## 28    Grace   F    Sr 1.40     No
xbar.clust=mean(clust.samp$GPA)
xbar.clust
## [1] 2.475

The sample mean for the clustered sample, \(\bar{x}_{clust}\), is 2.475.

Strengths

  • Makes it possible to sample if there is no list of the entire population, but there is only a list of subpopulations.
    For example, there is not a list of all church members in the United States. However, there is a list of churches that you could sample and then acquire the members list from each of the selected churches.

Weaknesses

  • Not always representative of the population. Elements within clusters could be similar to one another based on some characteristic(s). This can lead to over-representation or under-representation of those characteristics in the sample.

Some considerations about the different sampling methods

Note that here we have not discussed about the error of the sampling estimates, [we will review some related ideas when talking of sample size].

In general, cluster sampling is more economical and feasible than SRS.

However, we must point out that the standard errors of estimates obtained from cluster sampling are often high compared with those obtained from samples of the same number of listing units chosen by other sampling designs.

The reason for this situation is that listing units within the same cluster are often homogeneous with respect to many characteristics.

For example, households on the same block are often quite similar with respect to socioeconomic status, ethnicity, and other variables.

Because of homogeneity among listing units within the same cluster, selection of more than one household within the same cluster, as is done in cluster sampling, is in a sense redundant. The effect of this redundancy becomes evident in the high standard errors of estimates that are often seen in cluster sampling.

If we were to choose between cluster sampling and some alternative design solely on the basis of cost or feasibility, cluster sampling would inevitably be the sampling design of choice.

On the other hand, if we were to choose a design solely on the basis of reliability of estimates, then cluster sampling would rarely be the design of choice.

However, because it is possible to take a larger sample for a fixed cost with cluster sampling, greater precision may be sometimes attained than is possible with other methods.

Generally, in choosing between cluster sampling and alternatives, we use criteria that incorporate both reliability and cost.

In fact, we generally choose the sampling design that gives the lowest possible standard error at a specified cost or, conversely, the sampling design that yields, at the lowest cost, estimates having pre-specified standard error (precision).

Randomized assignment to treatment

Randomization is a statistical technique used to ensure the comparability of treatment groups in clinical trials by introducing a deliberate element of chance.

First of all, it tends to balance known and unknown covariates and, thus, to produce structural equality of the treatment groups (exchangeability).

Second, by ensuring effective blinding of treatment allocations from investigators and patients, randomization helps to avoid bias caused by the selection of patients.

Finally, randomization contributes to the internal validity of a trial that provides the basis for statistical inference (and the estimation of causal effects in the population).

The randomization concept was firstly introduced by R. A. Fisher in his classic book “The Design of Experiments” (1935). The importance of randomization for clinical trials was first noted in the 1940s by the epidemiologist Sir A. Bradford Hill who realized that successful blinding of treatment allocations was impossible without randomization.

Since that time, regulators have advocated the use of randomization in their guidelines and several different randomization procedures have been proposed in the literature. It has been noticed that different randomization procedures behave differently, e.g., concerning their susceptibility to bias and their potential to control power and type-I-error probability. An overview containing the latest developments can be found in Rosenberger and Lachin (2016).

We will use here the R library randomizr to illustrate some basic examples of randomization procedures.

Install and then load the required library:

library(randomizr)

There are two main purposes for the generation of randomization sequences:
1. The first purpose is the generation of a single sequence for the allocation of patients in a clinical trial. 2. The second purpose is the generation of multiple sequences in order to assess the statistical properties of a randomization procedure (long-run frequentist approach!).

An hypothetical experiment

Suppose that we are conducting an experiment among the 592 individuals in the built-in HairEyeColor dataset. As we’ll see, there are many ways to randomly assign subjects to treatments. We’ll step through three common designs, each associated with one of the five randomizr functions: simple_ra(), complete_ra(), block_ra(). [Not covered here: cluster_ra(), and block_and_cluster_ra()]

We first need to transform the dataset, which has each row describe a type of subject, to a new dataset in which each row describes an individual subject.

# Load built-in dataset
data(HairEyeColor)
HairEyeColor <- data.frame(HairEyeColor)
head(HairEyeColor)
##    Hair   Eye  Sex Freq
## 1 Black Brown Male   32
## 2 Brown Brown Male   53
## 3   Red Brown Male   10
## 4 Blond Brown Male    3
## 5 Black  Blue Male   11
## 6 Brown  Blue Male   50
dim(HairEyeColor)
## [1] 32  4
# Transform so each row is a subject
# Columns describe subject's hair color, eye color, and gender
hec <- HairEyeColor[rep(1:nrow(HairEyeColor),
                        times = HairEyeColor$Freq), 1:3]

N <- nrow(hec)

# Fix the rownames
rownames(hec) <- NULL
head(hec)
##    Hair   Eye  Sex
## 1 Black Brown Male
## 2 Black Brown Male
## 3 Black Brown Male
## 4 Black Brown Male
## 5 Black Brown Male
## 6 Black Brown Male
dim(hec)
## [1] 592   3

Typically, researchers know some basic information about their subjects before deploying treatment. For example, they usually know how many subjects there are in the study population (N), and they usually know some basic demographic information about each subject.

Our new dataset has 592 subjects. We have three pre-treatment covariates: Hair, Eye, and Sex, which describe the hair color, eye color, and gender of each subject.

Let’s suppose that the treatment under study is aimed to improve a certain numerical score of “self-confidence”.

For didactic purposes, we now will create simulated potential outcomes.

We’ll call the untreated outcome Y0 and we’ll call the treated outcome Y1.

Imagine that in the absence of any intervention, the outcome (Y0) is correlated with our pre-treatment covariates.

Imagine further that the effectiveness of the intervention also varies according to these covariates, i.e., the difference between Y1 and Y0 is correlated with the pre-treatment covariates.

If we were really running an experiment, we would only observe either Y0 or Y1 for each subject, but since we are simulating, we generate both.

Our inferential target is the average treatment effect (ATE), which is defined as the average difference between Y0 and Y1.

# Set a the random seed for reproducibility
set.seed(343)

# Create untreated and treated outcomes for all subjects
hec <- within(hec,{
  Y0 <- rnorm(n = N,mean = (2*as.numeric(Hair) + -4*as.numeric(Eye) + -6*as.numeric(Sex)), sd = 5)
  Y1 <- Y0 + 6*as.numeric(Hair) + 4*as.numeric(Eye) + 2*as.numeric(Sex)
})

hist(hec$Y0, xlab="Potential Outcome Score if untreated")

hist(hec$Y1, xlab="Potential Outcome Score if treated")

hist(hec$Y1-hec$Y0, xlab="Distribution of potential outcomes differences")
abline(v=25.26, col="red", lwd=2)

# Calculate the true ATE
with(hec, mean(Y1 - Y0))
## [1] 25.26351

Therefore, the ATE on the population is 25.26 : in this example there is an average improvement in the score of self-confidence in the treated of 25.26 units with respect to the untreated. This comes from an ideal experiment : treating all subjects vs not treating all subjects ! We are now ready to randomly allocate treatment assignments to subjects. Let’s start by contrasting simple and complete random assignment.

Simple random assignment

Simple random assignment assigns all subjects to treatment with an equal probability by flipping a (possibly weighted) coin for each subject. The main trouble with simple random assignment is that the number of subjects assigned to treatment is itself a random number - depending on the random assignment, therefore a different number of subjects might be assigned to each group.

The simple_ra() function has one required argument N, the total number of subjects. If no other arguments are specified, simple_ra() assumes a two-group design and a 0.50 probability of assignment.

Z <- simple_ra(N = N)
table(Z)
## Z
##   0   1 
## 301 291

To change the probability of assignment (i.e. using a weighted coin), you should specify the prob argument:

Z <- simple_ra(N = N, prob = 0.30)
table(Z)
## Z
##   0   1 
## 402 190

If you specify num_arms without changing prob_each, simple_ra() will assume equal probabilities across all arms.

Z <- simple_ra(N = N, num_arms = 3)
table(Z)
## Z
##  T1  T2  T3 
## 191 215 186

You can also just specify the specific probabilities of your multiple arms. These probabilities must sum to 1.

Z <- simple_ra(N = N, prob_each = c(.2, .2, .6))
table(Z)
## Z
##  T1  T2  T3 
## 118 119 355

You can also give a name to your treatment arms.

Z <- simple_ra(N = N, prob_each = c(.2, .2, .6),
               conditions=c("control", "placebo", "treatment"))
table(Z)
## Z
##   control   placebo treatment 
##       132       108       352

Complete random assignment

Complete random assignment is very similar to simple random assignment, except that the researcher can specify exactly how many units should be assigned to each condition.

The syntax for complete_ra() is very similar to that of simple_ra().

The argument m is the number of units assigned to treatment in two-arm designs; it is analogous to simple_ra()’s prob. Similarly, the argument m_each is analogous to prob_each.

If you only specify N, complete_ra() assigns exactly half of the subjects to each of the two treatments.

Z <- complete_ra(N = N)
table(Z)
## Z
##   0   1 
## 296 296

To change the number of units assigned, specify the m argument:

Z <- complete_ra(N = N, m = 200)
table(Z)
## Z
##   0   1 
## 392 200

If you specify multiple arms, complete_ra() will assign an equal (within rounding!) number of units to treatment.

Z <- complete_ra(N = N, num_arms = 3)
table(Z)
## Z
##  T1  T2  T3 
## 197 198 197

You can also specify exactly how many units should be assigned to each arm. The total of m_each must equal N.

Z <- complete_ra(N = N, m_each = c(100, 200, 292))
table(Z)
## Z
##  T1  T2  T3 
## 100 200 292

You can also name your treatment arms:

Z <- complete_ra(N = N, m_each = c(100, 200, 292),
               conditions = c("control", "placebo", "treatment"))
table(Z)
## Z
##   control   placebo treatment 
##       100       200       292

OPTIONAL: stratified random assignment

Stratified random assignment is a powerful tool when used well.

In this design, subjects are sorted into strata (usually according to their pre-treatment covariates), and then complete random assignment is conducted within each stratum. For example, a researcher might stratify on gender, assigning exactly half of the men and exactly half of the women to treatment.

Why stratify ?

The first reason is related to heterogeneity of treatment effect (HTE): is the treatment effect different for men versus women?

Of course, such heterogeneity could be also explored if complete random assignment had been used, but stratifying on a covariate defends a researcher (somewhat) against claims of data dredging.

The second reason is to increase precision. If the stratifying variables are associated with the outcome, then strata may help to decrease sampling variability.

It’s important, however, not to overstate these advantages. The gains from a stratified design can often be realized through covariate adjustment alone.

Stratifying can also produce some complications for estimation. For example, it can produce different probabilities of assignment for different subjects, related to their characteristics. This complication is typically addressed in one of two ways: “controlling for strata” in a regression context, or the use of inverse probability weights (IPW), in which units are weighted by the inverse of the probability that the unit is in the condition that it is in (see for more details examples in block 3).

In the R library the word block is used in place of stratum. I think that this name could be confused with block randomization that instead is not the topic of this function (to explore block randomization see the package https://cran.r-project.org/web/packages/blockrand/index.html).

The only required argument to block_ra() is blocks, which is a vector of length N that describes which block (strata) a unit belongs to. Blocks (strata) can be a factor, character, or numeric variable. If no other arguments are specified, block_ra() assigns an approximately equal proportion of each block to treatment.

Z <- block_ra(blocks = hec$Hair)
table(Z, hec$Hair)
##    
## Z   Black Brown Red Blond
##   0    54   143  35    64
##   1    54   143  36    63

For multiple treatment arms, use the num_arms argument, with or without the conditions argument:

Z <- block_ra(blocks = hec$Hair, num_arms = 3)
table(Z, hec$Hair)
##     
## Z    Black Brown Red Blond
##   T1    36    95  24    43
##   T2    36    96  23    42
##   T3    36    95  24    42
Z <- block_ra(blocks = hec$Hair, conditions = c("Control", "Placebo", "Treatment"))
table(Z, hec$Hair)
##            
## Z           Black Brown Red Blond
##   Control      36    95  23    42
##   Placebo      36    96  24    43
##   Treatment    36    95  24    42

block_ra() provides a number of ways to adjust the number of subjects assigned to each conditions. The prob_each argument describes what proportion of each block should be assigned to treatment arm. Note of course, that block_ra() still uses complete random assignment within each block; the appropriate number of units to assign to treatment within each block is automatically determined.

Z <- block_ra(blocks = hec$Hair, prob_each = c(.3, .7))
table(Z, hec$Hair)
##    
## Z   Black Brown Red Blond
##   0    32    86  21    38
##   1    76   200  50    89

For finer control, use the block_m_each argument, which takes a matrix with as many rows as there are blocks, and as many columns as there are treatment conditions. Remember that the rows are in the same order as sort(unique(blocks)), a command that is good to run before constructing a block_m_each matrix.

sort(unique(hec$Hair))
## [1] Black Brown Red   Blond
## Levels: Black Brown Red Blond
block_m_each <- rbind(c(78, 30),
                      c(186, 100),
                      c(51, 20),
                      c(87,40))

block_m_each
##      [,1] [,2]
## [1,]   78   30
## [2,]  186  100
## [3,]   51   20
## [4,]   87   40
Z <- block_ra(blocks = hec$Hair, block_m_each = block_m_each)
table(Z, hec$Hair)
##    
## Z   Black Brown Red Blond
##   0    78   186  51    87
##   1    30   100  20    40

How to create blocks? In the HairEyeColor dataset, we could make blocks for each unique combination of hair color, eye color, and sex.

blocks <- with(hec, paste(Hair, Eye, Sex, sep = "_"))
Z <- block_ra(blocks = blocks)
head(table(blocks, Z))
##                     Z
## blocks                0  1
##   Black_Blue_Female   4  5
##   Black_Blue_Male     5  6
##   Black_Brown_Female 18 18
##   Black_Brown_Male   16 16
##   Black_Green_Female  1  1
##   Black_Green_Male    2  1

Save your random assignment

Whenever you conduct a random assignment for use in an experiment, save it! At a minimum, the random assignment should be saved with an id variable in a csv.

hec <- within(hec,{
  Z_blocked <- complete_ra(N = N, m_each = c(100, 200, 292),
               conditions = c("control", "placebo", "treatment"))
  id_var <- 1:nrow(hec)
})
write.csv(hec[,c("id_var", "Z_blocked")], file = "C:\\Users\\Giulia Barbati\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\AA 24_25\\Block 2\\R codes\\MyRandomAssignment.csv")

Note that there are many others package that can be used to perform randomization procedures in R: https://cran.r-project.org/web/views/ClinicalTrials.html

Estimating causal effects: the magic of randomized trials !

Let’s now conduct a small simulation with our HairEyeColor dataset, in order to obtain a distribution of ATE estimates under both simple and complete random assignment: remember that the true effect that we previously obtained was 25.26.

sims <- 1000

# Set up empty vectors to collect results
simple_ests    <- rep(NA, sims)
complete_ests  <- rep(NA, sims)

# Loop through simulation 2000 times
for(i in 1:sims){
  hec <- within(hec,{
    
    # Conduct both kinds of random assignment
    Z_simple    <- simple_ra(N = N)
    Z_complete  <- complete_ra(N = N)
    
    # Define the observed outcomes
    Y_simple    <- Y1*Z_simple + Y0*(1-Z_simple)
    Y_complete  <- Y1*Z_complete + Y0*(1-Z_complete)
    })
  
  # Estimate ATE under both types of randomization
  fit_simple   <- lm(Y_simple ~ Z_simple, data=hec)
  fit_complete <- lm(Y_complete ~ Z_complete, data=hec)
  
  # Save the estimates
  simple_ests[i]   <- coef(fit_simple)[2]
  complete_ests[i] <- coef(fit_complete)[2]
}
mean(simple_ests)
## [1] 25.24697
mean(complete_ests)
## [1] 25.27613

As we expect from randomized assignment, we obtain ATE estimates consistent with the true one (long-run frequentist approach).

Since we are statisticians and we always want to have a look to the variability across these estimates, let’s do it.

This simulation allows us to measure the standard error directly, since the vectors simple_ests and complete_ests describe the sampling distribution of each design. (In this case, the standard deviation is the standard error of the sampling distribution of the estimator).

sd(simple_ests)
## [1] 0.6032299
sd(complete_ests)
## [1] 0.604826

Therefore we can also evaluate the 95% confidence intervals around estimates using the normal approximation as:

c(mean(simple_ests)- 1.96*sd(simple_ests), mean(simple_ests)+ 1.96*sd(simple_ests))
## [1] 24.06464 26.42930
c(mean(complete_ests)-1.96*sd(complete_ests),mean(complete_ests)+1.96*sd(complete_ests))
## [1] 24.09067 26.46159

Therefore we have obtained in both cases unbiased estimator of the true ATE ! Nice work done by the randomization process that we applied.

The randomised controlled trial (RCT) : a worked example

The Randomised Controlled Trial (RCT) is regarded by many as the gold standard method for evaluating interventions. In this section we’ll look at an example of RCT data and introduce the basic statistical methods that can be used to analyse the results.

A RCT is effective simply because it is designed to counteract all of the systematic biases that can confound the relationship between the exposure and the outcome of interest.

RCTs have become such a bedrock of medical research that standards for reporting them have been developed. The CONSORT flowchart is a useful way of documenting the flow of participants through a trial. CONSORT stands for Consolidated Standards of Reporting Trials, which are endorsed by many medical journals. Indeed, if you plan to publish an intervention study in one of those journals, you are likely to be required to show you have followed the guidelines. The relevant information is available on the ‘Enhancing the QUAlity and Transparency Of health Research’ EQUATOR network website. The EQUATOR network site covers not only RCTs but also the full spectrum guidelines of many types of clinical research designs.

Statistical analysis of a RCT

Statisticians often complain that researchers will come along with a collection of data and ask for advice as to how to analyse it. Sir Ronald Fisher (one of the most famous statisticians of all time) commented:

“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”

-Sir Ronald Fisher, Presidential Address to the First Indian Statistical Congress, 1938.

His point was that very often the statistician would have advised doing something different in the first place, had they been consulted at the outset. Once the data are collected, it may be too late to rescue the study from a fatal flaw.

The answer to the question “How should I analyse my data?” depends crucially on what hypothesis is being tested.

In the case of an intervention trial, the hypothesis will usually be did intervention X make a difference to outcome Y in people with condition Z?
There is, in this case, a clear null hypothesis – that the intervention was ineffective, and the outcome of the intervention group would have been just the same if it had not been done. The null hypothesis significance testing approach answers just that question: it tells you how likely your data are if the the null hypothesis was true. To do that, you compare the distribution of outcome scores in the intervention group and the control group. In this context we don’t just look simply at the difference in means between two groups, usually we consider whether that difference is greater than a pre-specified effect size.

Steps to take before data analysis

  • General sanity check on dataset - are values within expected range?
  • Check assumptions
  • Plot data to get sense of likely range of results

Sample dataset with illustrative analysis

To illustrate data analysis, we will use a real dataset that can be retrieved from the ESRC data archive. We will focus only on a small subset of the data, which comes from an intervention study in which teaching assistants administered an individual reading and language intervention to children with Down syndrome. A wait-list RCT design was used, but we will focus here on just the first two phases, in which half the children were randomly assigned to intervention, and the remainder formed a control group.
Several language and reading measures were included in the study, giving a total of 11 outcomes. Here we will illustrate the analysis with just one of the outcomes - letter-sound coding - which was administered at baseline (t1) and immediately after the intervention (t2). In this basic example we will not introduce the pre-specified “effect size” concept in the hypothesis test.

dsdat <- read.csv("dse-rli-trial-data-archive.csv")
dsdat <- dsdat[dsdat$included==1,]
dsdat$group <- as.factor(dsdat$group)
levels(dsdat$group)<-c("Intervention","Control")
dsdat$pretest <- dsdat$letter_sound_t1
dsdat$posttest <- dsdat$letter_sound_t2
pirateplot(posttest~group,theme=1,cex.lab=1.5,data=dsdat, point.o=1, xlab="", ylab="Post-intervention score")

Figure shows results on letter-sound coding after one group had received the intervention. This test had also been administered at baseline, but we will focus first just on the outcome results.

Raw data should always be inspected prior to any data analysis, in order to check that the distribution of scores looks sensible. One hears of horror stories where, for instance, an error code of 999 got included in an analysis, distorting all the statistics. Or where an outcome score was entered as 10 rather than 100. Visualising the data is useful when checking whether the results are in the right numerical range for the particular outcome measure.
The pirate plot (https://www.psychologicalscience.org/observer/yarrr-the-pirates-guide-to-r) is a useful way of showing means and distributions as well as individual data points.

A related step is to check whether the distribution of the data meets the assumptions of the proposed statistical analysis. Many common statistical procedures assume that continuous variables are normally distributed. Statistical approaches to checking of assumptions are beyond the scope of this section, but just eyeballing the data is useful, and can detect obvious cases of non-normality, cases of ceiling or floor effects, or clumpy data, where only certain values are possible. Data with these features may need special treatment (as for example the application of scale transformations or non-parametric test). For the data in Figure, although neither distribution has an classically normal distribution, we do not see major problems with ceiling or floor effects, and there is a reasonable spread of scores in both groups. To show now only basic methods, we will use parametric approaches (i.e. exploring differences in mean values!)

mytab <- psych::describeBy(dsdat$posttest, group=dsdat$group,mat=TRUE,digits=3)
mytab <- mytab[,c(2,4:6)]
colnames(mytab)<-c('Group','N','Mean','SD')
mytab[1:2,1]<-c("Intervention","Control")
ft <- flextable(mytab)
ft

Group

N

Mean

SD

Intervention

28

22.286

7.282

Control

26

16.346

9.423

The next step is just to compute some basic statistics to get a feel for the effect size. As discussed, the standard deviation (SD) is a key statistic for measuring an intervention effect. In these results, one mean is higher than the other, but there is overlap between the groups. Statistical analysis gives us a way of quantifying how much confidence we can place in the group difference: in particular, how likely is it that there is no real impact of intervention and the observed results just reflect the play of chance. In this case we can see that the difference between means is around 6 points and the average SD is around 8, so the effect size (Cohen’s d) is about .75 - a large effect size for a language intervention.

Simple t-test on outcome data

The simplest way of measuring the intervention effect is therefore to just compare outcome (posttest) measures on a t-test. We can use a one-tailed test with confidence, given that we anticipate outcomes will be better after intervention. One-tailed tests are often treated with suspicion, because they can be used by researchers engaged in p-hacking, but where we predict a directional effect, they are entirely appropriate and give greater power than a two-tailed test. When reporting the result of a t-test, researchers should always report all the statistics: the value of t, the degrees of freedom, the means and SDs, and the confidence interval around the mean difference, as well as the p-value. This not only helps readers understand the magnitude and reliability of the effect of interest: it also allows for the study to readily be incorporated in a meta-analysis. Results from a t-test are shown in the following Table. [Note that with a one-tailed test, the confidence interval on one side will extend to infinity: this is because a one-tailed test assumes that the true result is greater than a specified mean value, and disregards results that go in the opposite direction].

myt1 <- t.test(dsdat$posttest~dsdat$group,var.equal=T,alternative='greater')

mytabt <- c(round(myt1$statistic,3),round(myt1$parameter,0), round(myt1$p.value,3),round(myt1$estimate[1]-myt1$estimate[2],3),round(myt1$conf.int,3))
mytabt <- as.matrix(mytabt,ncol=6)
mytabt <- as.data.frame(t(mytabt))
colnames(mytabt)<-c('t','df','p','mean diff.','lowerCI','upperCI')
flextable(mytabt)
T-test on outcomes

t

df

p

mean diff.

lowerCI

upperCI

2.602

52

0.006

5.94

2.117

Inf

Interpreting this table, we can conclude that data offer evidence in rejecting the null hypothesis, i.e. in the direction of a substantial effect of the intervention. Note that the mean difference here can be interpreted as an estimate of the ATE (Average Treatment Effect) in the causal language.

Supplementary Material

T-test on difference scores ?? :-(

The t-test on outcomes is easy to do, but it misses an opportunity to control for one unwanted source of variation, namely individual differences in the initial level of the language measure. For this reason, researchers often prefer to take difference scores: the difference between outcome and baseline measures, and apply a t-test to these. While this had some advantages over reliance on raw outcome measures, it also has disadvantages, because the amount of change that is possible from baseline to outcome is not the same for everyone. A child with a very low score at baseline has more “room for improvement” than one who has an average score. For this reason, analysis of difference scores is not generally recommended.

Analysis of covariance on outcome scores

Rather than taking difference scores, it is preferable to analyse differences in outcome measures after making a statistical adjustment that takes into account the initial baseline scores, using a method known as analysis of covariance or ANCOVA. In practice, this method usually gives results that are very similar to those you would obtain from an analysis of difference scores, but the precision, and hence the statistical power is greater.
However, the data do need to meet certain assumptions of the method. We can for example start with a plot to check if there is a linear relationship between pretest vs posttest scores in both groups - i.e. the points cluster around a straight line.

#plot to check linearity by eye
ggscatter(
  dsdat, x = "pretest", y = "posttest",
  color = "group", add = "reg.line"
  )+
  stat_regline_equation(
    aes(label =  paste(..eq.label.., ..rr.label.., sep = "~~~~"), color = group)
    )
Pretest vs posttest scores in the Down syndrome data

Pretest vs posttest scores in the Down syndrome data

Inspection of the plot confirms that the relationship between pretest and posttest looks reasonably linear in both groups. Note that it also shows that there are rather slightly more children with very low scores at pretest in the control group. This is just a chance finding - the kind of thing that can easily happen when you have relatively small numbers of children randomly assigned to groups. Randomization does not always guarantees “perfect” balance, especially for small sample size. The important is balance on average… On average pre-tests scores are not significantly different between the two groups, at is expected from randomization:

mytcheck <- t.test(dsdat$pretest~dsdat$group)
mytcheck
## 
##  Welch Two Sample t-test
## 
## data:  dsdat$pretest by dsdat$group
## t = 0.94182, df = 49.914, p-value = 0.3508
## alternative hypothesis: true difference in means between group Intervention and group Control is not equal to 0
## 95 percent confidence interval:
##  -2.539301  7.022817
## sample estimates:
## mean in group Intervention      mean in group Control 
##                   15.35714                   13.11538

Now, let’s estimate the ATE (Average Treatment Effect) with only group as a factor, and then instead adding the pretest level: we use different methods (anova test and linear regression) that are substantially equivalent:

aov.1   <- dsdat %>% anova_test(posttest ~ group)
aov.1.full <- aov(posttest ~ group, data=dsdat)
aov.1.full$coefficients # the same as the simple linear regression model 
##  (Intercept) groupControl 
##     22.28571     -5.93956
res.aov <- dsdat %>% anova_test(posttest ~ pretest +group)
resaov.1.full <- aov(posttest ~ pretest +group, data=dsdat)
resaov.1.full$coefficients # the same as the bivariable linear regression model 
##  (Intercept)      pretest groupControl 
##   10.3645402    0.7762625   -4.1993676
mylm1   <- lm(posttest~group,data=dsdat) # is equivalent to run aov.1
mylm    <- lm(posttest~pretest+group,data=dsdat) #is equivalent to run res.aov

To better visualize results:

tab1 <- as.data.frame(get_anova_table(aov.1))
tab2 <- as.data.frame(get_anova_table(res.aov))
tab3 <- as.data.frame(summary(mylm1)$coefficients)
tab4 <- as.data.frame(summary(mylm)$coefficients)

source <-c('Intercept','Pretest','Group')
tab4 <- cbind(source,tab4)
colnames(tab4)[5]<-'p'


colnames(tab3)[4]<-'p'

tab1$p <- round(tab1$p,3)
tab2$p <- round(tab2$p,3)
tab3$p <- round(tab3$p,3)
tab4$p <- round(tab4$p,3)


tab1<-tab1[,-6]
tab2<-tab2[,-6]

ft1<-flextable(tab1)
ft1<-set_caption(ft1,'Analysis of posttest only (ANOVA)')
ft1
Analysis of posttest only (ANOVA)

Effect

DFn

DFd

F

p

ges

group

1

52

6.773

0.012

0.115

ft2 <- flextable(tab2)
ft2<-set_caption(ft2,'Analysis adjusted for pretest scores (ANCOVA)')
ft2
Analysis adjusted for pretest scores (ANCOVA)

Effect

DFn

DFd

F

p

ges

pretest

1

51

94.313

0.000

0.649

group

1

51

9.301

0.004

0.154

# ft3 shows the equivalent results from linear regression
ft3<-flextable(tab3)
ft3<-set_caption(ft3,'Linear regression with group as predictor' )
ft3
Linear regression with group as predictor

Estimate

Std. Error

t value

p

22.28571

1.583656

14.072320

0.000

-5.93956

2.282291

-2.602455

0.012

ft4<-flextable(tab4)
ft4<-set_caption(ft4,'Linear regression with pretest and group as predictor' )
ft4
Linear regression with pretest and group as predictor

source

Estimate

Std. Error

t value

p

Intercept

10.3645402

1.55058301

6.684286

0.000

Pretest

0.7762625

0.07993237

9.711491

0.000

Group

-4.1993676

1.37698465

-3.049684

0.004

The table shows the same data analysed first of all by using ANOVA or simple linear regression to compare only the post-test scores (i.e. the ATE effect), then using ANCOVA or the bi-variable linear regression model to adjust scores for the baseline (pretest) values or the linear regression to do the same thing (i.e. the CATE effect: conditional on the pre-test value). The effect size in the ANOVA/ANCOVA approaches is shown as ges, which stands for “generalised eta squared”. You can notice that the ATE estimated by the regression coefficient of the simple linear regression model with only group as an independent variable is the same as the value of the mean difference reported by the t-test. Of note: if you want to deepen this method (ANCOVA),see for example: https://www.datanovia.com/en/lessons/ancova-in-r/.

References

Fisher, R. A. (1935). The design of experiments. Oliver & Boyd.

Armitage P (1982). “The Role of Randomization in Clinical Trials.” Statistics in Medicine, 1(4), 345-352. doi:10.1002/sim.4780010412.

Chalmers I (1999). “Why Transition from Alternation to Randomisation in Clinical Trials Was Made.” BMJ, 319(7221), 1372. doi:10.1136/bmj.319.7221.1372.

ICH E9 (1998). “Statistical Principles for Clinical Trials.” Current version dated 1998-02-05. Last access 2014-09-01., URL http://www.ich.org/.

Rosenberger WF, Lachin JM (2016). Randomization in Clinical Trials: Theory and Practice. John Wiley & Sons. doi:10.1002/9781118742112.