Load the required data for the examples:
load("FakeSchool.rda")
Recall that most fundamental epidemiological/health research questions actually break down into 3 parts:
Descriptive Statistics: What relationship can we observe between the variables, in the sample?
Inferential Statistics: Supposing we see a relationship in the sample data, how much evidence is provided for a relationship in the population? Does the data provide lots of evidence for a relationship in the population, or could the relationship we see in the sample be due just to chance variation in the sampling process that gave us the data? [we ignore for the moment problems related to bias (systematic errors), we only consider here the random variation related to the sampling mechanism]
Prediction/Counterfactual prediction: Using data collected on a sample, about features and outcomes, are we able to predict what will happen in the future to “new” subjects having measured the same features ? Are we able to evaluate the impact of a treatment on the outcomes of subjects with that features ? (this is related to regression modelling / causal inference tools)
To answer all these questions, the starting point involve dealing with the sample.
In order to make valid conclusions about any research question, we first need to make sure we are dealing with a good sample.
Here we will discuss various techniques for drawing samples, with some notes about the strengths and weaknesses of these different sampling techniques.
An important distinction that we want to make sure has been made clear before we go any further is the distinction between a sample and a population.
A population is the set of all subjects of interest.
A sample is the subset of the population in which we measure data.
Let’s consider these two definitions with a research question.
Research Question: In the United States, what is the mean height of adult males (18 years +)?
The population that we are dealing with in this case is all U.S. adult males. One way to find an exact answer to this research question would be to survey the entire population. However, this is nearly impossible! It would be much quicker and easier to measure only a subset of the population, a sample.
However, if we want our sample to be an accurate reflection of the population, we can’t just choose any sample that we wish.
The way in which we collect our sample is very important.
For the time being, let’s suppose that we were able to choose an
appropriate sample (and we’ll talk more about how this is done
later).
Suppose that our sample of U.S. men is an accurate representation of the
U.S. population of men. Then, we might discuss two different means: the
mean height of the sample and the mean height of the
population. These are both descriptions, as
opposed to inferences. There are a couple of
differences, however.
Mean Height of the Sample
Mean Height of the Population
Our goal is to use the information we’ve gathered from the
sample to infer, or predict,
something about the population.
For our example, we want to predict the population mean, using our
knowledge of the sample. The accuracy of our sample mean relies
heavily upon how well our sample represents the
population at large. If our sample does a poor job at
representing the population, then any inferences that we make about the
population are also going to be poor. Thus, it is very important to
select a good sample!
Note: If we already knew everything about a population, it would be useless to gather a sample in order to infer something about the population. We would already have this information! Using statistics as an inferential tool means that you don’t have information about the entire population to start with. If you are able to sample the entire population, this would be called a census.
There are 2 main kinds of sampling:
Random Sampling
Non-Random Sampling
Non-Random sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection. This sampling method depends heavily on the expertise of the researchers. It is carried out by observation, and researchers use it widely in qualitative research.Not all members of the population have an equal chance of participating in the study, unlike random sampling, where each member of the population has a known chance of being selected. Non-probability sampling is most useful for exploratory studies like a pilot survey (deploying a survey to a smaller sample compared to pre-determined sample size). Researchers use this method in studies where it is not possible to draw using random probability sampling due to time or cost considerations. We do not explore here further this issue, (most of the times related also to the data availability issues). Note that in clinical and epidemiological studies we assume that the ideal sampling scheme is to have a random sample of the target population, even if the treatment assignment in the sample is not random (unless you are doing a Randomized Clinical Trial, the gold standard!) ….
There are basically four different methods of random sampling:
The simple random sample (SRS) is the basic type of sample. The other types are more advanced methods that will not be discussed in this course.
It will be helpful to work with an example as we describe each of these methods, so let’s use the following set of 28 students from FakeSchool as our target population from which we will sample.
data(FakeSchool)
head(FakeSchool)
## Students Sex class GPA Honors
## 1 Alice F Fr 3.80 Yes
## 2 Brad M Fr 2.60 Yes
## 3 Caleb M Fr 2.25 No
## 4 Daisy F Fr 2.10 No
## 5 Faye F Fr 2.00 No
## 6 Eva F Fr 1.80 No
This is a data frame with 28 observations on the following 5 variables:
Students: Name of each student
Sex: sex of the student
class: class rank of the student
GPA: grade point average
Honors: whether or not he student is in the Honors Program
Keep in mind that we would not know information about an entire population in real life, We are using this “population” for demonstration purposes only
Our goal is to describe how the different sampling techniques are implemented, and comment on the strengths and weaknesses of them.
We can easily compute the true mean GPA for the students at FakeSchool by averaging the values in the fourth column of the dataset.
This will be the population mean. We will call it \(\mu\) (“mu”).
mu <- mean(FakeSchool$GPA)
mu
## [1] 2.766429
Again, the population parameter, \(\mu\), is not typically known !!! If it were known, there would be no reason to estimate it. However, the point of this example is to practice selecting different types of samples and to compare the performance of these different sampling techniques.
In simple random sampling, for a given sample size \(n\) every set of \(n\) members of the population has the same chance to be the sample that is actually selected.
We often use the acronym SRS as an abbreviation for “simple random sampling”.
Intuitively, let’s think of simple random sampling as follows: we find a big box, and for each member of the population we put into the box a ticket that has the name of the individual written on it. All tickets are the same size and shape. Mix up the tickets thoroughly in the box. Then pull out a ticket at random, set it aside, pull out another ticket, set it aside, and so on until the desired number of tickets have been selected.
Let’s select a simple random sample of 7 elements
without replacement.
We need two pieces of information:
Remember that sampling without replacement means that once
we draw an element from the population, we do not put it back so that it
can be drawn again.
We would not want to draw with replacement as this could
possibly result with a sample containing the same person more than once.
This would not be a good representation of the entire school. Typically,
we will sample without replacement in most of cases. [Some exceptions
are used in the propensity score-based methods, for the purpose
of matching (see examples in block 3)].
Since we may want to access this sample later, it’s a good idea to store our sample in an object.
set.seed(314159)
# 1. Get the total number of rows in the population
n_total <- nrow(FakeSchool)
# 2. Randomly select 7 subjects
sample_indices <- sample(1:n_total, size = 7, replace = FALSE)
# 3. Create the sample data frame by subsetting
srs <- FakeSchool[sample_indices, ]
Let’s now calculate the mean GPA for the 7 sampled students. This will be the sample mean, \(\bar{x}_{srs}\). We will use the subscript ‘srs’ to remind ourselves that this is the sample mean for the simple random sample.
xbar.srs <- mean(srs$GPA)
xbar.srs
## [1] 2.771429
Strengths
The selection of one element does not affect the selection of others.
Each possible sample, of a given size, has an equal chance of being selected.
Simple random samples tend to be good representations of the population.
Requires little knowledge of the population.
Weaknesses
If there are small subgroups within the population, a SRS may not give an accurate representation of that subgroup. In fact, it may not include it at all. This is especially true if the sample size is small, as in our example.
If the population is large and widely dispersed, it can be costly (both in time and money) to collect the data.
Randomization is a statistical technique used to ensure the comparability of treatment groups in clinical trials by introducing a deliberate element of chance.
First of all, it tends to balance known and unknown covariates and, thus, to produce structural equality of the treatment groups (exchangeability).
Second, by ensuring effective blinding of treatment allocations from investigators and patients, randomization helps to avoid bias caused by the selection of patients.
Finally, randomization contributes to the internal validity of a trial that provides the basis for statistical inference (and the estimation of causal effects in the population).
The randomization concept was firstly introduced by R. A. Fisher in his classic book “The Design of Experiments” (1935). The importance of randomization for clinical trials was first noted in the 1940s by the epidemiologist Sir A. Bradford Hill who realized that successful blinding of treatment allocations was impossible without randomization.
Since that time, regulators have advocated the use of randomization in their guidelines and several different randomization procedures have been proposed in the literature. It has been noticed that different randomization procedures behave differently, e.g., concerning their susceptibility to bias and their potential to control power and type-I-error probability. An overview containing the latest developments can be found in Rosenberger and Lachin (2016).
We will use here the R library randomizr to illustrate some basic examples of randomization procedures.
Install and then load the required library:
library(randomizr)
There are two main purposes for the generation of randomization sequences:
Suppose that we are conducting an experiment among the 592 individuals in the built-in HairEyeColor dataset. As we’ll see, there are many ways to randomly assign subjects to treatments. We’ll step through three common designs, each associated with one of the five randomizr functions: simple_ra(), complete_ra(), block_ra(). [Not covered here: cluster_ra(), and block_and_cluster_ra()]
We first need to transform the dataset, which has each row describe a type of subject, to a new dataset in which each row describes an individual subject.
# Load built-in dataset
data(HairEyeColor)
HairEyeColor <- data.frame(HairEyeColor)
head(HairEyeColor)
## Hair Eye Sex Freq
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
dim(HairEyeColor)
## [1] 32 4
# Transform so each row is a subject
# Columns describe subject's hair color, eye color, and gender
hec <- HairEyeColor[rep(1:nrow(HairEyeColor),
times = HairEyeColor$Freq), 1:3]
N <- nrow(hec)
# Fix the rownames
rownames(hec) <- NULL
head(hec)
## Hair Eye Sex
## 1 Black Brown Male
## 2 Black Brown Male
## 3 Black Brown Male
## 4 Black Brown Male
## 5 Black Brown Male
## 6 Black Brown Male
dim(hec)
## [1] 592 3
Typically, researchers know some basic information about their subjects before deploying treatment. For example, they usually know how many subjects there are in the study population (N), and they usually know some basic demographic information about each subject.
Our new dataset has 592 subjects. We have three pre-treatment covariates: Hair, Eye, and Sex, which describe the hair color, eye color, and gender of each subject.
Let’s suppose that the treatment under study is aimed to improve a certain numerical score of “self-confidence”.
For didactic purposes, we now will create simulated potential outcomes.
We’ll call the untreated outcome Y0 and we’ll call the treated outcome Y1.
Imagine that in the absence of any intervention, the outcome (Y0) is correlated with our pre-treatment covariates.
Imagine further that the effectiveness of the intervention also varies according to these covariates, i.e., the difference between Y1 and Y0 is correlated with the pre-treatment covariates.
If we were really running an experiment, we would only observe either Y0 or Y1 for each subject, but since we are simulating, we generate both.
Our inferential target is the average treatment effect (ATE), which is defined as the average difference between Y0 and Y1.
# Set a the random seed for reproducibility
set.seed(343)
# Create untreated and treated outcomes for all subjects
hec <- within(hec,{
Y0 <- rnorm(n = N,mean = (2*as.numeric(Hair) + -4*as.numeric(Eye) + -6*as.numeric(Sex)), sd = 5)
Y1 <- Y0 + 6*as.numeric(Hair) + 4*as.numeric(Eye) + 2*as.numeric(Sex)
})
hist(hec$Y0, xlab="Potential Outcome Score if untreated")
hist(hec$Y1, xlab="Potential Outcome Score if treated")
hist(hec$Y1-hec$Y0, xlab="Distribution of potential outcomes differences")
abline(v=25.26, col="red", lwd=2)
# Calculate the true ATE
with(hec, mean(Y1 - Y0))
## [1] 25.26351
Therefore, the ATE on the population is 25.26 : in this example there is an average improvement in the score of self-confidence in the treated of 25.26 units with respect to the untreated. This comes from an ideal experiment : treating all subjects vs not treating all subjects ! We are now ready to randomly allocate treatment assignments to subjects. Let’s start by contrasting simple and complete random assignment.
Simple random assignment assigns all subjects to treatment with an equal probability by flipping a (possibly weighted) coin for each subject. The main trouble with simple random assignment is that the number of subjects assigned to treatment is itself a random number - depending on the random assignment, therefore a different number of subjects might be assigned to each group.
The simple_ra() function has one required argument N, the total number of subjects. If no other arguments are specified, simple_ra() assumes a two-group design and a 0.50 probability of assignment.
Z <- simple_ra(N = N)
table(Z)
## Z
## 0 1
## 301 291
To change the probability of assignment (i.e. using a weighted coin), you should specify the prob argument:
Z <- simple_ra(N = N, prob = 0.30)
table(Z)
## Z
## 0 1
## 402 190
If you specify num_arms without changing prob_each, simple_ra() will assume equal probabilities across all arms.
Z <- simple_ra(N = N, num_arms = 3)
table(Z)
## Z
## T1 T2 T3
## 191 215 186
You can also just specify the specific probabilities of your multiple arms. These probabilities must sum to 1.
Z <- simple_ra(N = N, prob_each = c(.2, .2, .6))
table(Z)
## Z
## T1 T2 T3
## 118 119 355
You can also give a name to your treatment arms.
Z <- simple_ra(N = N, prob_each = c(.2, .2, .6),
conditions=c("control", "placebo", "treatment"))
table(Z)
## Z
## control placebo treatment
## 132 108 352
Complete random assignment is very similar to simple random assignment, except that the researcher can specify exactly how many units should be assigned to each condition.
The syntax for complete_ra() is very similar to that of simple_ra().
The argument m is the number of units assigned to treatment in two-arm designs; it is analogous to simple_ra()’s prob. Similarly, the argument m_each is analogous to prob_each.
If you only specify N, complete_ra() assigns exactly half of the subjects to each of the two treatments.
Z <- complete_ra(N = N)
table(Z)
## Z
## 0 1
## 296 296
To change the number of units assigned, specify the m argument:
Z <- complete_ra(N = N, m = 200)
table(Z)
## Z
## 0 1
## 392 200
If you specify multiple arms, complete_ra() will assign an equal (within rounding!) number of units to treatment.
Z <- complete_ra(N = N, num_arms = 3)
table(Z)
## Z
## T1 T2 T3
## 197 198 197
You can also specify exactly how many units should be assigned to each arm. The total of m_each must equal N.
Z <- complete_ra(N = N, m_each = c(100, 200, 292))
table(Z)
## Z
## T1 T2 T3
## 100 200 292
You can also name your treatment arms:
Z <- complete_ra(N = N, m_each = c(100, 200, 292),
conditions = c("control", "placebo", "treatment"))
table(Z)
## Z
## control placebo treatment
## 100 200 292
Stratified random assignment is a powerful tool when used well.
In this design, subjects are sorted into strata (usually according to their pre-treatment covariates), and then complete random assignment is conducted within each stratum. For example, a researcher might stratify on gender, assigning exactly half of the men and exactly half of the women to treatment.
Why stratify ?
The first reason is related to heterogeneity of treatment effect (HTE): is the treatment effect different for men versus women?
Of course, such heterogeneity could be also explored if complete random assignment had been used, but stratifying on a covariate defends a researcher (somewhat) against claims of data dredging.
The second reason is to increase precision. If the stratifying variables are associated with the outcome, then strata may help to decrease sampling variability.
It’s important, however, not to overstate these advantages. The gains from a stratified design can often be realized through covariate adjustment alone.
Stratifying can also produce some complications for estimation. For example, it can produce different probabilities of assignment for different subjects, related to their characteristics. This complication is typically addressed in one of two ways: “controlling for strata” in a regression context, or the use of inverse probability weights (IPW), in which units are weighted by the inverse of the probability that the unit is in the condition that it is in (see for more details examples in block 3).
In the R library the word block is used in place of stratum. I think that this name could be confused with block randomization that instead is not the topic of this function (to explore block randomization see the package https://cran.r-project.org/web/packages/blockrand/index.html).
The only required argument to block_ra() is blocks, which is a vector of length N that describes which block (strata) a unit belongs to. Blocks (strata) can be a factor, character, or numeric variable. If no other arguments are specified, block_ra() assigns an approximately equal proportion of each block to treatment.
Z <- block_ra(blocks = hec$Hair)
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## 0 54 143 35 64
## 1 54 143 36 63
For multiple treatment arms, use the num_arms argument, with or without the conditions argument:
Z <- block_ra(blocks = hec$Hair, num_arms = 3)
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## T1 36 95 24 43
## T2 36 96 23 42
## T3 36 95 24 42
Z <- block_ra(blocks = hec$Hair, conditions = c("Control", "Placebo", "Treatment"))
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## Control 36 95 23 42
## Placebo 36 96 24 43
## Treatment 36 95 24 42
block_ra() provides a number of ways to adjust the number of subjects assigned to each conditions. The prob_each argument describes what proportion of each block should be assigned to treatment arm. Note of course, that block_ra() still uses complete random assignment within each block; the appropriate number of units to assign to treatment within each block is automatically determined.
Z <- block_ra(blocks = hec$Hair, prob_each = c(.3, .7))
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## 0 32 86 21 38
## 1 76 200 50 89
For finer control, use the block_m_each argument, which takes a matrix with as many rows as there are blocks, and as many columns as there are treatment conditions. Remember that the rows are in the same order as sort(unique(blocks)), a command that is good to run before constructing a block_m_each matrix.
sort(unique(hec$Hair))
## [1] Black Brown Red Blond
## Levels: Black Brown Red Blond
block_m_each <- rbind(c(78, 30),
c(186, 100),
c(51, 20),
c(87,40))
block_m_each
## [,1] [,2]
## [1,] 78 30
## [2,] 186 100
## [3,] 51 20
## [4,] 87 40
Z <- block_ra(blocks = hec$Hair, block_m_each = block_m_each)
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## 0 78 186 51 87
## 1 30 100 20 40
How to create blocks? In the HairEyeColor dataset, we could make blocks for each unique combination of hair color, eye color, and sex.
blocks <- with(hec, paste(Hair, Eye, Sex, sep = "_"))
Z <- block_ra(blocks = blocks)
head(table(blocks, Z))
## Z
## blocks 0 1
## Black_Blue_Female 4 5
## Black_Blue_Male 5 6
## Black_Brown_Female 18 18
## Black_Brown_Male 16 16
## Black_Green_Female 1 1
## Black_Green_Male 2 1
Whenever you conduct a random assignment for use in an experiment, save it! At a minimum, the random assignment should be saved with an id variable in a csv.
hec <- within(hec,{
Z_blocked <- complete_ra(N = N, m_each = c(100, 200, 292),
conditions = c("control", "placebo", "treatment"))
id_var <- 1:nrow(hec)
})
write.csv(hec[,c("id_var", "Z_blocked")], file = "MyRandomAssignment.csv")
Note that there are many others package that can be used to perform randomization procedures in R: https://cran.r-project.org/web/views/ClinicalTrials.html
Let’s now conduct a small simulation with our HairEyeColor dataset, in order to obtain a distribution of ATE estimates under both simple and complete random assignment: remember that the true effect that we previously obtained was 25.26.
sims <- 1000
# Set up empty vectors to collect results
simple_ests <- rep(NA, sims)
complete_ests <- rep(NA, sims)
# Loop through simulation 2000 times
for(i in 1:sims){
hec <- within(hec,{
# Conduct both kinds of random assignment
Z_simple <- simple_ra(N = N)
Z_complete <- complete_ra(N = N)
# Define the observed outcomes
Y_simple <- Y1*Z_simple + Y0*(1-Z_simple)
Y_complete <- Y1*Z_complete + Y0*(1-Z_complete)
})
# Estimate ATE under both types of randomization
fit_simple <- lm(Y_simple ~ Z_simple, data=hec)
fit_complete <- lm(Y_complete ~ Z_complete, data=hec)
# Save the estimates
simple_ests[i] <- coef(fit_simple)[2]
complete_ests[i] <- coef(fit_complete)[2]
}
mean(simple_ests)
## [1] 25.24697
mean(complete_ests)
## [1] 25.27613
As we expect from randomized assignment, we obtain ATE estimates consistent with the true one (long-run frequentist approach).
Since we are statisticians and we always want to have a look to the variability across these estimates, let’s do it.
This simulation allows us to measure the standard error directly, since the vectors simple_ests and complete_ests describe the sampling distribution of each design. (In this case, the standard deviation is the standard error of the sampling distribution of the estimator).
sd(simple_ests)
## [1] 0.6032299
sd(complete_ests)
## [1] 0.604826
Therefore we can also evaluate the 95% confidence intervals around estimates using the normal approximation as:
c(mean(simple_ests)- 1.96*sd(simple_ests), mean(simple_ests)+ 1.96*sd(simple_ests))
## [1] 24.06464 26.42930
c(mean(complete_ests)-1.96*sd(complete_ests),mean(complete_ests)+1.96*sd(complete_ests))
## [1] 24.09067 26.46159
Therefore we have obtained in both cases unbiased estimator of the true ATE ! Nice work done by the randomization process that we applied.
Fisher, R. A. (1935). The design of experiments. Oliver & Boyd.
Armitage P (1982). “The Role of Randomization in Clinical Trials.” Statistics in Medicine, 1(4), 345-352. doi:10.1002/sim.4780010412.
Chalmers I (1999). “Why Transition from Alternation to Randomisation in Clinical Trials Was Made.” BMJ, 319(7221), 1372. doi:10.1136/bmj.319.7221.1372.
ICH E9 (1998). “Statistical Principles for Clinical Trials.” Current version dated 1998-02-05. Last access 2014-09-01., URL http://www.ich.org/.
Rosenberger WF, Lachin JM (2016). Randomization in Clinical Trials: Theory and Practice. John Wiley & Sons. doi:10.1002/9781118742112.