Load the required library:

require(tigerstats)

Introduction

Recall that most fundamental epidemiological/biomedical research questions actually break down into 2 parts:

Both parts of answering research questions involve dealing with the sample.

In order to make valid conclusions about any research question, we first need to make sure we are dealing with a good sample.

Here we will discuss various techniques for drawing samples, with some notes about the strengths and weaknesses of these different sampling techniques.

Population versus Sample

An important distinction that we want to make sure has been made clear before we go any further is the distinction between a sample and a population.

Population

A population is the set of all subjects of interest.

Sample

A sample is the subset of the population in which we measure data.

Let’s consider these two definitions with a research question.

Research Question: In the United States, what is the mean height of adult males (18 years +)?

The population that we are dealing with in this case is all U.S. adult males. One way to find an exact answer to this research question would be to survey the entire population. However, this is nearly impossible! It would be much quicker and easier to measure only a subset of the population, a sample.

However, if we want our sample to be an accurate reflection of the population, we can’t just choose any sample that we wish. The way in which we collect our sample is very important.

For the time being, let’s suppose that we were able to choose an appropriate sample (and we’ll talk more about how this is done later).
Suppose that our sample of U.S. men is an accurate representation of the U.S. population of men. Then, we might discuss two different means: the mean height of the sample and the mean height of the population. These are both descriptions, as opposed to inferences. There are a couple of differences, however.

Mean Height of the Sample

Mean Height of the Population

Our goal is to use the information we’ve gathered from the sample to infer, or predict, something about the population. For our example, we want to predict the population mean, using our knowledge of the sample. The accuracy of our sample mean relies heavily upon how well our sample represents the population at large. If our sample does a poor job at representing the population, then any inferences that we make about the population are also going to be poor. Thus, it is very important to select a good sample!

Note: If we already knew everything about a population, it would be useless to gather a sample in order to infer something about the population. We would already have this information! Using statistics as an inferential tool means that you don’t have information about the entire population to start with. If you are able to sample the entire population, this would be called a census.

Types of Samples

There are 2 main kinds of sampling:

Non-Random sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection. This sampling method depends heavily on the expertise of the researchers. It is carried out by observation, and researchers use it widely in qualitative research.Not all members of the population have an equal chance of participating in the study, unlike random sampling, where each member of the population has a known chance of being selected. Non-probability sampling is most useful for exploratory studies like a pilot survey (deploying a survey to a smaller sample compared to pre-determined sample size). Researchers use this method in studies where it is not possible to draw using random probability sampling due to time or cost considerations. We do not explore here further this issue, (most of the times related also to the data availability issues). Note that in clinical and epidemiological studies we assume that the ideal sampling scheme is to have a random sample of the target population, even if the treatment assignment in the sample is not random (unless you are doing a Randomized clinical Trial, the gold standard!) ….

Random Sampling

There are basically four different methods of random sampling:

  • Simple Random Sampling (SRS)
  • Systematic Sampling
  • Stratified Sampling
  • Cluster Sampling

The simple random sample (SRS) is the basic type of sample. The other types have been included just to give you comparisons to the SRS and also to aid you in the future if you want to deepen further these topics.

It will be helpful to work with an example as we describe each of these methods, so let’s use the following set of 28 students from FakeSchool as our population from which we will sample.

data(FakeSchool)
head(FakeSchool)
##   Students Sex class  GPA Honors
## 1    Alice   F    Fr 3.80    Yes
## 2     Brad   M    Fr 2.60    Yes
## 3    Caleb   M    Fr 2.25     No
## 4    Daisy   F    Fr 2.10     No
## 5     Faye   F    Fr 2.00     No
## 6      Eva   F    Fr 1.80     No

This is a data frame with 28 observations on the following 5 variables:

  • Students: Name of each student

  • Sex: sex of the student

  • class: class rank of the student

  • GPA: grade point average

  • Honors: whether or not he student is in the Honors Program

Keep in mind that we would not know information about an entire population in real life, We are using this “population” for demonstration purposes only

Our goal is to describe how the different sampling techniques are implemented, and comment on the strengths and weaknesses of them.

We can easily compute the true mean GPA for the students at FakeSchool by averaging the values in the fourth column of the dataset.
This will be the population mean. We will call it \(\mu\) (“mu”).

mu <- mean(~GPA,data=FakeSchool)
mu
## [1] 2.766429

Again, the population parameter, \(\mu\), is not typically known !!! If it were known, there would be no reason to estimate it. However, the point of this example is to practice selecting different types of samples and to compare the performance of these different sampling techniques.

Simple Random Sample

In simple random sampling, for a given sample size \(n\) every set of \(n\) members of the population has the same chance to be the sample that is actually selected.

We often use the acronym SRS as an abbreviation for “simple random sampling”.

Intuitively, let’s think of simple random sampling as follows: we find a big box, and for each member of the population we put into the box a ticket that has the name of the individual written on it. All tickets are the same size and shape. Mix up the tickets thoroughly in the box. Then pull out a ticket at random, set it aside, pull out another ticket, set it aside, and so on until the desired number of tickets have been selected.

Let’s select a simple random sample of 7 elements without replacement.

We can accomplish this easily with the built in function popsamp in R. This function requires two pieces of information:

  • the size of the sample
  • the dataset from which to draw the sample

Remember that sampling without replacement means that once we draw an element from the population, we do not put it back so that it can be drawn again. We would not want to draw with replacement as this could possibly result with a sample containing the same person more than once. This would not be a good representation of the entire school. By default, the popsamp function always samples without replacement. If you want to sample with replacement, you would need to add a third argument to the function: replace=TRUE. Typically, we will sample without replacement in most of cases. Some exceptions are used in the propensity score-based methods, for the purpose of matching (see examples in block 3).

Since we may want to access this sample later, it’s a good idea to store our sample in an object.

set.seed(314159)
srs <- popsamp(7,FakeSchool)
srs
##    Students Sex class GPA Honors
## 10    Chris   M    So 4.0    Yes
## 2      Brad   M    Fr 2.6    Yes
## 16 Brittany   F    Jr 3.9     No
## 20   Eliott   M    Jr 1.9     No
## 21    Garth   M    Jr 1.1     No
## 23      Bob   M    Sr 3.8    Yes
## 13     Eric   M    So 2.1     No

Let’s now calculate the mean GPA for the 7 sampled students. This will be the sample mean, \(\bar{x}_{srs}\). We will use the subscript ‘srs’ to remind ourselves that this is the sample mean for the simple random sample.

xbar.srs <- mean(~GPA,data=srs)
xbar.srs
## [1] 2.771429

Strengths

  • The selection of one element does not affect the selection of others.

  • Each possible sample, of a given size, has an equal chance of being selected.

  • Simple random samples tend to be good representations of the population.

  • Requires little knowledge of the population.

Weaknesses

  • If there are small subgroups within the population, a SRS may not give an accurate representation of that subgroup. In fact, it may not include it at all. This is especially true if the sample size is small, as in our example.

  • If the population is large and widely dispersed, it can be costly (both in time and money) to collect the data.

Systematic Sample

In a systematic sample, the members of the population are put in a row.
Then 1 out of every \(k\) members are selected.
The starting point is randomly chosen from the first \(k\) elements and then elements are sampled at the same location in each of the subsequent segments of size \(k\).

To illustrate the idea, let’s take a 1-in-4 systematic sample from our FakeSchool population.

We will start by randomly selecting our starting element.

set.seed(49464)
start=sample(1:4,1)
start
## [1] 4

So, we will start with element 4, which is Daisy and choose every 4th element after that for our sample.

##    Students Sex class GPA Honors
## 4     Daisy   F    Fr 2.1     No
## 8    Andrea   F    So 4.0    Yes
## 12   Felipe   M    So 3.0     No
## 16 Brittany   F    Jr 3.9     No
## 20   Eliott   M    Jr 1.9     No
## 24     Carl   M    Sr 3.1     No
## 28    Grace   F    Sr 1.4     No
## [1] 2.771429

The mean GPA of the systematic sample, the sample mean, \(\bar{x}_{sys}\), is 2.7714286.

Strengths

  • Assures a random sampling of the population.

  • When the population is an ordered list, a systematic sample could give a better representation of the population than a SRS.

  • Can be used in situations where a SRS is difficult or impossible. It is especially useful when the population that you are studying is arranged in time.

For example,suppose you are interested in the average amount of money that people spend at the grocery store on a Wednesday evening. A systematic sample could be used by selecting every 10th person that walks into the store.

Weaknesses

  • Not every combination has an equal chance of being selected. Many combinations will never be selected using a systematic sample.

  • Beware of periodicity in the population. If, after ordering, the selections match some pattern in the list (skip interval), the sample may not be representative of the population.

Stratified Sample

In a stratified sample, the population must first be separated into homogeneous groups, or strata.
Each element only belongs to one stratum and the stratum consist of elements that are alike in some way.
A simple random sample is then drawn from each stratum, which is combined to make the stratified sample.

Let’s take a stratified sample of 7 elements from FakeSchool using the following strata: Honors, Not Honors.
First, let’s determine how many elements belong to each strata:

## Honors
##  No Yes 
##  16  12

So there are 12 Honors students at FakeSchool and 16 non-Honors students at FakeSchool.

There are various ways to determine how many students to include from each stratum. For example, you could choose to select the same number of students from each stratum.
Another strategy is to use a proportionate stratified sample.
In a proportionate stratified sample, the number of students selected from each stratum is proportional to the representation of the strata in the population.
For example, \(\frac{12}{28}\) X 100% = 42.8571429% of the population are Honors students.
This means that there should be 0.4285714 X 7 = 3 Honors students in the sample. So there should be 7-3=4 non-Honors students in the sample.

Let’s go through the coding to draw these samples.
Check out the how we use the subset function to pull out the Honors students from the rest of the populations:

set.seed(1837)
honors=subset(FakeSchool,Honors=="Yes")
honors
##    Students Sex class  GPA Honors
## 1     Alice   F    Fr 3.80    Yes
## 2      Brad   M    Fr 2.60    Yes
## 8    Andrea   F    So 4.00    Yes
## 9     Betsy   F    So 4.00    Yes
## 10    Chris   M    So 4.00    Yes
## 11    Dylan   M    So 3.50    Yes
## 15     Adam   M    Jr 3.98    Yes
## 17   Cassie   F    Jr 3.75    Yes
## 18    Derek   M    Jr 3.10    Yes
## 19    Faith   F    Jr 2.50    Yes
## 22   Angela   F    Sr 4.00    Yes
## 23      Bob   M    Sr 3.80    Yes

Next, we take a SRS of size 3 from the Honors students:

honors.samp=popsamp(3,honors)
honors.samp
##    Students Sex class  GPA Honors
## 15     Adam   M    Jr 3.98    Yes
## 19    Faith   F    Jr 2.50    Yes
## 18    Derek   M    Jr 3.10    Yes

The same method will work for non-Honors students.

set.seed(17365)
nonhonors=subset(FakeSchool,Honors=="No") 
nonhonors.samp=popsamp(4,nonhonors) 
nonhonors.samp
##    Students Sex class GPA Honors
## 26    Frank   M    Sr 2.0     No
## 28    Grace   F    Sr 1.4     No
## 13     Eric   M    So 2.1     No
## 25    Diana   F    Sr 2.9     No

We can put this together to create our stratified sample.

##    Students Sex class  GPA Honors
## 15     Adam   M    Jr 3.98    Yes
## 19    Faith   F    Jr 2.50    Yes
## 18    Derek   M    Jr 3.10    Yes
## 26    Frank   M    Sr 2.00     No
## 28    Grace   F    Sr 1.40     No
## 13     Eric   M    So 2.10     No
## 25    Diana   F    Sr 2.90     No
## [1] 2.568571

The sample mean for the stratified sample, \(\bar{x}_{strat}\), is 2.5685714.

Strengths

  • Representative of the population, because elements from all strata are included in the sample.

  • Ensures that specific groups are represented, sometimes even proportionally, in the sample.

  • Since each stratified sample will be distributed similarly, the amount of variability between samples is decreased.

  • Allows comparisons to be made between strata, if necessary. For example, a stratified sample allows you to easily compare the mean GPA of Honors students to the mean GPA of non-Honors students.

Weaknesses

  • Requires prior knowledge of the population. You have to know some characteristics about the population to be able to split into strata.

Cluster Sample

In cluster sampling the population is partitioned into groups, called clusters. The clusters, which are composed of elements, are not necessarily of the same size. Each element should belong to one cluster only and none of the elements of the population should be left out.

The clusters, and not the elements, become the units to be sampled.

Whenever a cluster is sampled, every element within it is observed.

In cluster sampling, usually only a few clusters are sampled.

Hence, in order to increase the precision of the estimates, the population should be partitioned into clusters in such a way that the clusters will have similar mean values. As the elements inside the clusters are not sampled, the variance within clusters does not contribute to the sampling variance of the estimators.

Cluster sampling is often more cost effective than other sampling designs, as one does not have to sample all the clusters. However, if the size of a cluster is large it might not be possible to observe all its elements.

Cluster sampling could be used when natural groups are evident in the population.
The clusters should all be similar each other: each cluster should be a small scale representation of the population.
To take a cluster sample, a random sample of the clusters is chosen. The elements of the randomly chosen clusters make up the sample.

Let’s take now a cluster sample using the grade level (freshmen, sophomore, junior, senior) of FakeSchool as the clusters.
Let’s take a random sample of 2 of them. Remember that this is really a basic-level example (single-stage cluster sampling).

set.seed(17393)
clusters=sample(FakeSchool$class,2,replace=FALSE)
clusters
## [1] Fr Sr
## Levels: Fr Jr So Sr
cluster1=subset(FakeSchool,class==clusters[1])
cluster2=subset(FakeSchool,class==clusters[2])
clust.samp=rbind(cluster1,cluster2)
clust.samp
##    Students Sex class  GPA Honors
## 1     Alice   F    Fr 3.80    Yes
## 2      Brad   M    Fr 2.60    Yes
## 3     Caleb   M    Fr 2.25     No
## 4     Daisy   F    Fr 2.10     No
## 5      Faye   F    Fr 2.00     No
## 6       Eva   F    Fr 1.80     No
## 7     Georg   M    Fr 1.40     No
## 22   Angela   F    Sr 4.00    Yes
## 23      Bob   M    Sr 3.80    Yes
## 24     Carl   M    Sr 3.10     No
## 25    Diana   F    Sr 2.90     No
## 26    Frank   M    Sr 2.00     No
## 27       Ed   M    Sr 1.50     No
## 28    Grace   F    Sr 1.40     No
xbar.clust=mean(clust.samp$GPA)
xbar.clust
## [1] 2.475

The sample mean for the clustered sample, \(\bar{x}_{clust}\), is 2.475.

Strengths

  • Makes it possible to sample if there is no list of the entire population, but there is only a list of subpopulations.
    For example, there is not a list of all church members in the United States. However, there is a list of churches that you could sample and then acquire the members list from each of the selected churches.

Weaknesses

  • Not always representative of the population. Elements within clusters could be similar to one another based on some characteristic(s). This can lead to over-representation or under-representation of those characteristics in the sample.

Some considerations about the different sampling methods

Note that here we have not discussed about the error of the sampling estimates, we will review some related ideas when talking of sample size.

In general, cluster sampling is more economical and feasible than SRS.

However, we must point out that the standard errors of estimates obtained from cluster sampling are often high compared with those obtained from samples of the same number of listing units chosen by other sampling designs.

The reason for this situation is that listing units within the same cluster are often homogeneous with respect to many characteristics.

For example, households on the same block are often quite similar with respect to socioeconomic status, ethnicity, and other variables.

Because of homogeneity among listing units within the same cluster, selection of more than one household within the same cluster, as is done in cluster sampling, is in a sense redundant.

The effect of this redundancy becomes evident in the high standard errors of estimates that are often seen in cluster sampling.

If we were to choose between cluster sampling and some alternative design solely on the basis of cost or feasibility, cluster sampling would inevitably be the sampling design of choice.

On the other hand, if we were to choose a design solely on the basis of reliability of estimates, then cluster sampling would rarely be the design of choice.

However, because it is possible to take a larger sample for a fixed cost with cluster sampling, greater precision may be sometimes attained than is possible with other methods.

Generally, in choosing between cluster sampling and alternatives, we use criteria that incorporate both reliability and cost.

In fact, we generally choose the sampling design that gives the lowest possible standard error at a specified cost or, conversely, the sampling design that yields, at the lowest cost, estimates having pre-specified standard error (precision).