Load the required library:
require(tigerstats)
Recall that most fundamental epidemiological/biomedical research questions actually break down into 2 parts:
Descriptive Statistics: What relationship can we observe between the variables, in the sample?
Inferential Statistics: Supposing we see a relationship in the sample data, how much evidence is provided for a relationship in the population? Does the data provide lots of evidence for a relationship in the population, or could the relationship we see in the sample be due just to chance variation in the sampling process that gave us the data? [we ignore for the moment problems related to bias (systematic errors), we only consider here the random variation related to the sampling mechanism]
Both parts of answering research questions involve dealing with the sample.
In order to make valid conclusions about any research question, we first need to make sure we are dealing with a good sample.
Here we will discuss various techniques for drawing samples, with some notes about the strengths and weaknesses of these different sampling techniques.
An important distinction that we want to make sure has been made clear before we go any further is the distinction between a sample and a population.
A population is the set of all subjects of interest.
A sample is the subset of the population in which we measure data.
Let’s consider these two definitions with a research question.
Research Question: In the United States, what is the mean height of adult males (18 years +)?
The population that we are dealing with in this case is all U.S. adult males. One way to find an exact answer to this research question would be to survey the entire population. However, this is nearly impossible! It would be much quicker and easier to measure only a subset of the population, a sample.
However, if we want our sample to be an accurate reflection of the population, we can’t just choose any sample that we wish. The way in which we collect our sample is very important.
For the time being, let’s suppose that we were able to choose an
appropriate sample (and we’ll talk more about how this is done
later).
Suppose that our sample of U.S. men is an accurate representation of the
U.S. population of men. Then, we might discuss two different means: the
mean height of the sample and the mean height of the
population. These are both descriptions, as
opposed to inferences. There are a couple of
differences, however.
Mean Height of the Sample
Mean Height of the Population
Our goal is to use the information we’ve gathered from the sample to infer, or predict, something about the population. For our example, we want to predict the population mean, using our knowledge of the sample. The accuracy of our sample mean relies heavily upon how well our sample represents the population at large. If our sample does a poor job at representing the population, then any inferences that we make about the population are also going to be poor. Thus, it is very important to select a good sample!
Note: If we already knew everything about a population, it would be useless to gather a sample in order to infer something about the population. We would already have this information! Using statistics as an inferential tool means that you don’t have information about the entire population to start with. If you are able to sample the entire population, this would be called a census.
There are 2 main kinds of sampling:
Random Sampling
Non-Random Sampling
Non-Random sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection. This sampling method depends heavily on the expertise of the researchers. It is carried out by observation, and researchers use it widely in qualitative research.Not all members of the population have an equal chance of participating in the study, unlike random sampling, where each member of the population has a known chance of being selected. Non-probability sampling is most useful for exploratory studies like a pilot survey (deploying a survey to a smaller sample compared to pre-determined sample size). Researchers use this method in studies where it is not possible to draw using random probability sampling due to time or cost considerations. We do not explore here further this issue, (most of the times related also to the data availability issues). Note that in clinical and epidemiological studies we assume that the ideal sampling scheme is to have a random sample of the target population, even if the treatment assignment in the sample is not random (unless you are doing a Randomized clinical Trial, the gold standard!) ….
There are basically four different methods of random sampling:
The simple random sample (SRS) is the basic type of sample. The other types have been included just to give you comparisons to the SRS and also to aid you in the future if you want to deepen further these topics.
It will be helpful to work with an example as we describe each of these methods, so let’s use the following set of 28 students from FakeSchool as our population from which we will sample.
data(FakeSchool)
head(FakeSchool)
## Students Sex class GPA Honors
## 1 Alice F Fr 3.80 Yes
## 2 Brad M Fr 2.60 Yes
## 3 Caleb M Fr 2.25 No
## 4 Daisy F Fr 2.10 No
## 5 Faye F Fr 2.00 No
## 6 Eva F Fr 1.80 No
This is a data frame with 28 observations on the following 5 variables:
Students: Name of each student
Sex: sex of the student
class: class rank of the student
GPA: grade point average
Honors: whether or not he student is in the Honors Program
Keep in mind that we would not know information about an entire population in real life, We are using this “population” for demonstration purposes only
Our goal is to describe how the different sampling techniques are implemented, and comment on the strengths and weaknesses of them.
We can easily compute the true mean GPA for the students at
FakeSchool by averaging the values in the fourth column of the
dataset.
This will be the population mean. We will call it \(\mu\) (“mu”).
mu <- mean(~GPA,data=FakeSchool)
mu
## [1] 2.766429
Again, the population parameter, \(\mu\), is not typically known !!! If it were known, there would be no reason to estimate it. However, the point of this example is to practice selecting different types of samples and to compare the performance of these different sampling techniques.
In simple random sampling, for a given sample size \(n\) every set of \(n\) members of the population has the same chance to be the sample that is actually selected.
We often use the acronym SRS as an abbreviation for “simple random sampling”.
Intuitively, let’s think of simple random sampling as follows: we find a big box, and for each member of the population we put into the box a ticket that has the name of the individual written on it. All tickets are the same size and shape. Mix up the tickets thoroughly in the box. Then pull out a ticket at random, set it aside, pull out another ticket, set it aside, and so on until the desired number of tickets have been selected.
Let’s select a simple random sample of 7 elements without replacement.
We can accomplish this easily with the built in function
popsamp
in R. This function requires two pieces of
information:
Remember that sampling without replacement means that once
we draw an element from the population, we do not put it back so that it
can be drawn again. We would not want to draw with replacement
as this could possibly result with a sample containing the same person
more than once. This would not be a good representation of the entire
school. By default, the popsamp
function always samples
without replacement. If you want to sample with replacement, you would
need to add a third argument to the function: replace=TRUE
.
Typically, we will sample without replacement in most of cases. Some
exceptions are used in the propensity score-based methods, for
the purpose of matching (see examples in block 3).
Since we may want to access this sample later, it’s a good idea to store our sample in an object.
set.seed(314159)
srs <- popsamp(7,FakeSchool)
srs
## Students Sex class GPA Honors
## 10 Chris M So 4.0 Yes
## 2 Brad M Fr 2.6 Yes
## 16 Brittany F Jr 3.9 No
## 20 Eliott M Jr 1.9 No
## 21 Garth M Jr 1.1 No
## 23 Bob M Sr 3.8 Yes
## 13 Eric M So 2.1 No
Let’s now calculate the mean GPA for the 7 sampled students. This will be the sample mean, \(\bar{x}_{srs}\). We will use the subscript ‘srs’ to remind ourselves that this is the sample mean for the simple random sample.
xbar.srs <- mean(~GPA,data=srs)
xbar.srs
## [1] 2.771429
Strengths
The selection of one element does not affect the selection of others.
Each possible sample, of a given size, has an equal chance of being selected.
Simple random samples tend to be good representations of the population.
Requires little knowledge of the population.
Weaknesses
If there are small subgroups within the population, a SRS may not give an accurate representation of that subgroup. In fact, it may not include it at all. This is especially true if the sample size is small, as in our example.
If the population is large and widely dispersed, it can be costly (both in time and money) to collect the data.
In a systematic sample, the members of the population are
put in a row.
Then 1 out of every \(k\) members are
selected.
The starting point is randomly chosen from the first \(k\) elements and then elements are sampled
at the same location in each of the subsequent segments of size \(k\).
To illustrate the idea, let’s take a 1-in-4 systematic sample from our FakeSchool population.
We will start by randomly selecting our starting element.
set.seed(49464)
start=sample(1:4,1)
start
## [1] 4
So, we will start with element 4, which is Daisy and choose every 4th element after that for our sample.
## Students Sex class GPA Honors
## 4 Daisy F Fr 2.1 No
## 8 Andrea F So 4.0 Yes
## 12 Felipe M So 3.0 No
## 16 Brittany F Jr 3.9 No
## 20 Eliott M Jr 1.9 No
## 24 Carl M Sr 3.1 No
## 28 Grace F Sr 1.4 No
## [1] 2.771429
The mean GPA of the systematic sample, the sample mean, \(\bar{x}_{sys}\), is 2.7714286.
Strengths
Assures a random sampling of the population.
When the population is an ordered list, a systematic sample could give a better representation of the population than a SRS.
Can be used in situations where a SRS is difficult or impossible. It is especially useful when the population that you are studying is arranged in time.
For example,suppose you are interested in the average amount of money that people spend at the grocery store on a Wednesday evening. A systematic sample could be used by selecting every 10th person that walks into the store.
Weaknesses
Not every combination has an equal chance of being selected. Many combinations will never be selected using a systematic sample.
Beware of periodicity in the population. If, after ordering, the selections match some pattern in the list (skip interval), the sample may not be representative of the population.
In a stratified sample, the population must first be
separated into homogeneous groups, or strata.
Each element only belongs to one stratum and the stratum consist of
elements that are alike in some way.
A simple random sample is then drawn from each stratum, which is
combined to make the stratified sample.
Let’s take a stratified sample of 7 elements from FakeSchool using
the following strata: Honors, Not Honors.
First, let’s determine how many elements belong to each strata:
## Honors
## No Yes
## 16 12
So there are 12 Honors students at FakeSchool and 16 non-Honors students at FakeSchool.
There are various ways to determine how many students to include from
each stratum. For example, you could choose to select the same number of
students from each stratum.
Another strategy is to use a proportionate stratified
sample.
In a proportionate stratified sample, the number of students
selected from each stratum is proportional to the representation of the
strata in the population.
For example, \(\frac{12}{28}\) X 100% =
42.8571429% of the population are Honors students.
This means that there should be 0.4285714 X 7 = 3 Honors students in the
sample. So there should be 7-3=4 non-Honors students in the sample.
Let’s go through the coding to draw these samples.
Check out the how we use the subset
function to pull out
the Honors students from the rest of the populations:
set.seed(1837)
honors=subset(FakeSchool,Honors=="Yes")
honors
## Students Sex class GPA Honors
## 1 Alice F Fr 3.80 Yes
## 2 Brad M Fr 2.60 Yes
## 8 Andrea F So 4.00 Yes
## 9 Betsy F So 4.00 Yes
## 10 Chris M So 4.00 Yes
## 11 Dylan M So 3.50 Yes
## 15 Adam M Jr 3.98 Yes
## 17 Cassie F Jr 3.75 Yes
## 18 Derek M Jr 3.10 Yes
## 19 Faith F Jr 2.50 Yes
## 22 Angela F Sr 4.00 Yes
## 23 Bob M Sr 3.80 Yes
Next, we take a SRS of size 3 from the Honors students:
honors.samp=popsamp(3,honors)
honors.samp
## Students Sex class GPA Honors
## 15 Adam M Jr 3.98 Yes
## 19 Faith F Jr 2.50 Yes
## 18 Derek M Jr 3.10 Yes
The same method will work for non-Honors students.
set.seed(17365)
nonhonors=subset(FakeSchool,Honors=="No")
nonhonors.samp=popsamp(4,nonhonors)
nonhonors.samp
## Students Sex class GPA Honors
## 26 Frank M Sr 2.0 No
## 28 Grace F Sr 1.4 No
## 13 Eric M So 2.1 No
## 25 Diana F Sr 2.9 No
We can put this together to create our stratified sample.
## Students Sex class GPA Honors
## 15 Adam M Jr 3.98 Yes
## 19 Faith F Jr 2.50 Yes
## 18 Derek M Jr 3.10 Yes
## 26 Frank M Sr 2.00 No
## 28 Grace F Sr 1.40 No
## 13 Eric M So 2.10 No
## 25 Diana F Sr 2.90 No
## [1] 2.568571
The sample mean for the stratified sample, \(\bar{x}_{strat}\), is 2.5685714.
Strengths
Representative of the population, because elements from all strata are included in the sample.
Ensures that specific groups are represented, sometimes even proportionally, in the sample.
Since each stratified sample will be distributed similarly, the amount of variability between samples is decreased.
Allows comparisons to be made between strata, if necessary. For example, a stratified sample allows you to easily compare the mean GPA of Honors students to the mean GPA of non-Honors students.
Weaknesses
In cluster sampling the population is partitioned into groups, called clusters. The clusters, which are composed of elements, are not necessarily of the same size. Each element should belong to one cluster only and none of the elements of the population should be left out.
The clusters, and not the elements, become the units to be sampled.
Whenever a cluster is sampled, every element within it is observed.
In cluster sampling, usually only a few clusters are sampled.
Hence, in order to increase the precision of the estimates, the population should be partitioned into clusters in such a way that the clusters will have similar mean values. As the elements inside the clusters are not sampled, the variance within clusters does not contribute to the sampling variance of the estimators.
Cluster sampling is often more cost effective than other sampling designs, as one does not have to sample all the clusters. However, if the size of a cluster is large it might not be possible to observe all its elements.
Cluster sampling could be used when natural groups are evident in the
population.
The clusters should all be similar each other: each cluster should be a
small scale representation of the population.
To take a cluster sample, a random sample of the
clusters is chosen. The elements of the randomly chosen clusters make up
the sample.
Let’s take now a cluster sample using the grade level (freshmen,
sophomore, junior, senior) of FakeSchool as the clusters.
Let’s take a random sample of 2 of them. Remember that this is really a
basic-level example (single-stage cluster sampling).
set.seed(17393)
clusters=sample(FakeSchool$class,2,replace=FALSE)
clusters
## [1] Fr Sr
## Levels: Fr Jr So Sr
cluster1=subset(FakeSchool,class==clusters[1])
cluster2=subset(FakeSchool,class==clusters[2])
clust.samp=rbind(cluster1,cluster2)
clust.samp
## Students Sex class GPA Honors
## 1 Alice F Fr 3.80 Yes
## 2 Brad M Fr 2.60 Yes
## 3 Caleb M Fr 2.25 No
## 4 Daisy F Fr 2.10 No
## 5 Faye F Fr 2.00 No
## 6 Eva F Fr 1.80 No
## 7 Georg M Fr 1.40 No
## 22 Angela F Sr 4.00 Yes
## 23 Bob M Sr 3.80 Yes
## 24 Carl M Sr 3.10 No
## 25 Diana F Sr 2.90 No
## 26 Frank M Sr 2.00 No
## 27 Ed M Sr 1.50 No
## 28 Grace F Sr 1.40 No
xbar.clust=mean(clust.samp$GPA)
xbar.clust
## [1] 2.475
The sample mean for the clustered sample, \(\bar{x}_{clust}\), is 2.475.
Strengths
Weaknesses
Note that here we have not discussed about the error of the sampling estimates, we will review some related ideas when talking of sample size.
In general, cluster sampling is more economical and feasible than SRS.
However, we must point out that the standard errors of estimates obtained from cluster sampling are often high compared with those obtained from samples of the same number of listing units chosen by other sampling designs.
The reason for this situation is that listing units within the same cluster are often homogeneous with respect to many characteristics.
For example, households on the same block are often quite similar with respect to socioeconomic status, ethnicity, and other variables.
Because of homogeneity among listing units within the same cluster, selection of more than one household within the same cluster, as is done in cluster sampling, is in a sense redundant.
The effect of this redundancy becomes evident in the high standard errors of estimates that are often seen in cluster sampling.
If we were to choose between cluster sampling and some alternative design solely on the basis of cost or feasibility, cluster sampling would inevitably be the sampling design of choice.
On the other hand, if we were to choose a design solely on the basis of reliability of estimates, then cluster sampling would rarely be the design of choice.
However, because it is possible to take a larger sample for a fixed cost with cluster sampling, greater precision may be sometimes attained than is possible with other methods.
Generally, in choosing between cluster sampling and alternatives, we use criteria that incorporate both reliability and cost.
In fact, we generally choose the sampling design that gives the lowest possible standard error at a specified cost or, conversely, the sampling design that yields, at the lowest cost, estimates having pre-specified standard error (precision).