Load the required library:
library(tigerstats)
Recall that most fundamental epidemiological/biomedical research questions actually break down into 2 parts:
Descriptive Statistics: What relationship can we observe between the variables, in the sample?
Inferential Statistics: Supposing we see a relationship in the sample data, how much evidence is provided for a relationship in the population? Does the data provide lots of evidence for a relationship in the population, or could the relationship we see in the sample be due just to chance variation in the sampling process that gave us the data? [we ignore for the moment problems related to bias (systematic errors), we only consider here the random variation related to the sampling mechanism]
Both parts of answering research questions involve dealing with the sample.
In order to make valid conclusions about any research question, we first need to make sure we are dealing with a good sample.
Here we will discuss various techniques for drawing samples, with some notes about the strengths and weaknesses of these different sampling techniques.
An important distinction that we want to make sure has been made clear before we go any further is the distinction between a sample and a population.
A population is the set of all subjects of interest.
A sample is the subset of the population in which we measure data.
Let’s consider these two definitions with a research question.
Research Question: In the United States, what is the mean height of adult males (18 years +)?
The population that we are dealing with in this case is all U.S. adult males. One way to find an exact answer to this research question would be to survey the entire population. However, this is nearly impossible! It would be much quicker and easier to measure only a subset of the population, a sample.
However, if we want our sample to be an accurate reflection of the population, we can’t just choose any sample that we wish. The way in which we collect our sample is very important.
For the time being, let’s suppose that we were able to choose an
appropriate sample (and we’ll talk more about how this is done
later).
Suppose that our sample of U.S. men is an accurate representation of the
U.S. population of men. Then, we might discuss two different means: the
mean height of the sample and the mean height of the
population. These are both descriptions, as
opposed to inferences. There are a couple of
differences, however.
Mean Height of the Sample
Mean Height of the Population
Our goal is to use the information we’ve gathered from the sample to infer, or predict, something about the population. For our example, we want to predict the population mean, using our knowledge of the sample. The accuracy of our sample mean relies heavily upon how well our sample represents the population at large. If our sample does a poor job at representing the population, then any inferences that we make about the population are also going to be poor. Thus, it is very important to select a good sample!
Note: If we already knew everything about a population, it would be useless to gather a sample in order to infer something about the population. We would already have this information! Using statistics as an inferential tool means that you don’t have information about the entire population to start with. If you are able to sample the entire population, this would be called a census.
There are 2 main kinds of sampling:
Random Sampling
Non-Random Sampling
Non-Random sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection. This sampling method depends heavily on the expertise of the researchers. It is carried out by observation, and researchers use it widely in qualitative research.Not all members of the population have an equal chance of participating in the study, unlike random sampling, where each member of the population has a known chance of being selected. Non-probability sampling is most useful for exploratory studies like a pilot survey (deploying a survey to a smaller sample compared to pre-determined sample size). Researchers use this method in studies where it is not possible to draw using random probability sampling due to time or cost considerations. We do not explore here further this issue, (most of the times related also to the data availability issues). Note that in clinical and epidemiological studies we assume that the ideal sampling scheme is to have a random sample of the target population, even if the treatment assignment in the sample is not random (unless you are doing a Randomized clinical Trial, the gold standard!) ….
There are basically four different methods of random sampling:
The simple random sample (SRS) is the basic type of sample. The other types have been included just to give you comparisons to the SRS and also to aid you in the future if you want to deepen further these topics.
It will be helpful to work with an example as we describe each of these methods, so let’s use the following set of 28 students from FakeSchool as our population from which we will sample.
data(FakeSchool)
head(FakeSchool)
## Students Sex class GPA Honors
## 1 Alice F Fr 3.80 Yes
## 2 Brad M Fr 2.60 Yes
## 3 Caleb M Fr 2.25 No
## 4 Daisy F Fr 2.10 No
## 5 Faye F Fr 2.00 No
## 6 Eva F Fr 1.80 No
This is a data frame with 28 observations on the following 5 variables:
Students: Name of each student
Sex: sex of the student
class: class rank of the student
GPA: grade point average
Honors: whether or not he student is in the Honors Program
Keep in mind that we would not know information about an entire population in real life, We are using this “population” for demonstration purposes only
Our goal is to describe how the different sampling techniques are implemented, and comment on the strengths and weaknesses of them.
We can easily compute the true mean GPA for the students at
FakeSchool by averaging the values in the fourth column of the
dataset.
This will be the population mean. We will call it \(\mu\) (“mu”).
mu <- mean(~GPA,data=FakeSchool)
mu
## [1] 2.766429
Again, the population parameter, \(\mu\), is not typically known !!! If it were known, there would be no reason to estimate it. However, the point of this example is to practice selecting different types of samples and to compare the performance of these different sampling techniques.
In simple random sampling, for a given sample size \(n\) every set of \(n\) members of the population has the same chance to be the sample that is actually selected.
We often use the acronym SRS as an abbreviation for “simple random sampling”.
Intuitively, let’s think of simple random sampling as follows: we find a big box, and for each member of the population we put into the box a ticket that has the name of the individual written on it. All tickets are the same size and shape. Mix up the tickets thoroughly in the box. Then pull out a ticket at random, set it aside, pull out another ticket, set it aside, and so on until the desired number of tickets have been selected.
Let’s select a simple random sample of 7 elements without replacement.
We can accomplish this easily with the built in function
popsamp
in R. This function requires two pieces of
information:
Remember that sampling without replacement means that once
we draw an element from the population, we do not put it back so that it
can be drawn again. We would not want to draw with replacement
as this could possibly result with a sample containing the same person
more than once. This would not be a good representation of the entire
school. By default, the popsamp
function always samples
without replacement. If you want to sample with replacement, you would
need to add a third argument to the function: replace=TRUE
.
Typically, we will sample without replacement in most of cases. Some
exceptions are used in the propensity score-based methods, for
the purpose of matching (see examples in block 3).
Since we may want to access this sample later, it’s a good idea to store our sample in an object.
set.seed(314159)
srs <- popsamp(7,FakeSchool)
srs
## Students Sex class GPA Honors
## 10 Chris M So 4.0 Yes
## 2 Brad M Fr 2.6 Yes
## 16 Brittany F Jr 3.9 No
## 20 Eliott M Jr 1.9 No
## 21 Garth M Jr 1.1 No
## 23 Bob M Sr 3.8 Yes
## 13 Eric M So 2.1 No
Let’s now calculate the mean GPA for the 7 sampled students. This will be the sample mean, \(\bar{x}_{srs}\). We will use the subscript ‘srs’ to remind ourselves that this is the sample mean for the simple random sample.
xbar.srs <- mean(~GPA,data=srs)
xbar.srs
## [1] 2.771429
Strengths
The selection of one element does not affect the selection of others.
Each possible sample, of a given size, has an equal chance of being selected.
Simple random samples tend to be good representations of the population.
Requires little knowledge of the population.
Weaknesses
If there are small subgroups within the population, a SRS may not give an accurate representation of that subgroup. In fact, it may not include it at all. This is especially true if the sample size is small, as in our example.
If the population is large and widely dispersed, it can be costly (both in time and money) to collect the data.
In a systematic sample, the members of the population are
put in a row.
Then 1 out of every \(k\) members are
selected.
The starting point is randomly chosen from the first \(k\) elements and then elements are sampled
at the same location in each of the subsequent segments of size \(k\).
To illustrate the idea, let’s take a 1-in-4 systematic sample from our FakeSchool population.
We will start by randomly selecting our starting element.
set.seed(49464)
start=sample(1:4,1)
start
## [1] 4
So, we will start with element 4, which is Daisy and choose every 4th element after that for our sample.
## Students Sex class GPA Honors
## 4 Daisy F Fr 2.1 No
## 8 Andrea F So 4.0 Yes
## 12 Felipe M So 3.0 No
## 16 Brittany F Jr 3.9 No
## 20 Eliott M Jr 1.9 No
## 24 Carl M Sr 3.1 No
## 28 Grace F Sr 1.4 No
## [1] 2.771429
The mean GPA of the systematic sample, the sample mean, \(\bar{x}_{sys}\), is 2.7714286.
Strengths
Assures a random sampling of the population.
When the population is an ordered list, a systematic sample could give a better representation of the population than a SRS.
Can be used in situations where a SRS is difficult or impossible. It is especially useful when the population that you are studying is arranged in time.
For example,suppose you are interested in the average amount of money that people spend at the grocery store on a Wednesday evening. A systematic sample could be used by selecting every 10th person that walks into the store.
Weaknesses
Not every combination has an equal chance of being selected. Many combinations will never be selected using a systematic sample.
Beware of periodicity in the population. If, after ordering, the selections match some pattern in the list (skip interval), the sample may not be representative of the population.
In a stratified sample, the population must first be
separated into homogeneous groups, or strata.
Each element only belongs to one stratum and the stratum consist of
elements that are alike in some way.
A simple random sample is then drawn from each stratum, which is
combined to make the stratified sample.
Let’s take a stratified sample of 7 elements from FakeSchool using
the following strata: Honors, Not Honors.
First, let’s determine how many elements belong to each strata:
## Honors
## No Yes
## 16 12
So there are 12 Honors students at FakeSchool and 16 non-Honors students at FakeSchool.
There are various ways to determine how many students to include from
each stratum. For example, you could choose to select the same number of
students from each stratum.
Another strategy is to use a proportionate stratified
sample.
In a proportionate stratified sample, the number of students
selected from each stratum is proportional to the representation of the
strata in the population.
For example, \(\frac{12}{28}\) X 100% =
42.8571429% of the population are Honors students.
This means that there should be 0.4285714 X 7 = 3 Honors students in the
sample. So there should be 7-3=4 non-Honors students in the sample.
Let’s go through the coding to draw these samples.
Check out the how we use the subset
function to pull out
the Honors students from the rest of the populations:
set.seed(1837)
honors=subset(FakeSchool,Honors=="Yes")
honors
## Students Sex class GPA Honors
## 1 Alice F Fr 3.80 Yes
## 2 Brad M Fr 2.60 Yes
## 8 Andrea F So 4.00 Yes
## 9 Betsy F So 4.00 Yes
## 10 Chris M So 4.00 Yes
## 11 Dylan M So 3.50 Yes
## 15 Adam M Jr 3.98 Yes
## 17 Cassie F Jr 3.75 Yes
## 18 Derek M Jr 3.10 Yes
## 19 Faith F Jr 2.50 Yes
## 22 Angela F Sr 4.00 Yes
## 23 Bob M Sr 3.80 Yes
Next, we take a SRS of size 3 from the Honors students:
honors.samp=popsamp(3,honors)
honors.samp
## Students Sex class GPA Honors
## 15 Adam M Jr 3.98 Yes
## 19 Faith F Jr 2.50 Yes
## 18 Derek M Jr 3.10 Yes
The same method will work for non-Honors students.
set.seed(17365)
nonhonors=subset(FakeSchool,Honors=="No")
nonhonors.samp=popsamp(4,nonhonors)
nonhonors.samp
## Students Sex class GPA Honors
## 26 Frank M Sr 2.0 No
## 28 Grace F Sr 1.4 No
## 13 Eric M So 2.1 No
## 25 Diana F Sr 2.9 No
We can put this together to create our stratified sample.
## Students Sex class GPA Honors
## 15 Adam M Jr 3.98 Yes
## 19 Faith F Jr 2.50 Yes
## 18 Derek M Jr 3.10 Yes
## 26 Frank M Sr 2.00 No
## 28 Grace F Sr 1.40 No
## 13 Eric M So 2.10 No
## 25 Diana F Sr 2.90 No
## [1] 2.568571
The sample mean for the stratified sample, \(\bar{x}_{strat}\), is 2.5685714.
Strengths
Representative of the population, because elements from all strata are included in the sample.
Ensures that specific groups are represented, sometimes even proportionally, in the sample.
Since each stratified sample will be distributed similarly, the amount of variability between samples is decreased.
Allows comparisons to be made between strata, if necessary. For example, a stratified sample allows you to easily compare the mean GPA of Honors students to the mean GPA of non-Honors students.
Weaknesses
In cluster sampling the population is partitioned into groups, called clusters. The clusters, which are composed of elements, are not necessarily of the same size. Each element should belong to one cluster only and none of the elements of the population should be left out.
The clusters, and not the elements, become the units to be sampled.
Whenever a cluster is sampled, every element within it is observed.
In cluster sampling, usually only a few clusters are sampled.
Hence, in order to increase the precision of the estimates, the population should be partitioned into clusters in such a way that the clusters will have similar mean values. As the elements inside the clusters are not sampled, the variance within clusters does not contribute to the sampling variance of the estimators.
Cluster sampling is often more cost effective than other sampling designs, as one does not have to sample all the clusters. However, if the size of a cluster is large it might not be possible to observe all its elements.
Cluster sampling could be used when natural groups are evident in the
population.
The clusters should all be similar each other: each cluster should be a
small scale representation of the population.
To take a cluster sample, a random sample of the
clusters is chosen. The elements of the randomly chosen clusters make up
the sample.
Let’s take now a cluster sample using the grade level (freshmen,
sophomore, junior, senior) of FakeSchool as the clusters.
Let’s take a random sample of 2 of them. Remember that this is really a
basic-level example (single-stage cluster sampling).
set.seed(17393)
clusters=sample(FakeSchool$class,2,replace=FALSE)
clusters
## [1] Fr Sr
## Levels: Fr Jr So Sr
cluster1=subset(FakeSchool,class==clusters[1])
cluster2=subset(FakeSchool,class==clusters[2])
clust.samp=rbind(cluster1,cluster2)
clust.samp
## Students Sex class GPA Honors
## 1 Alice F Fr 3.80 Yes
## 2 Brad M Fr 2.60 Yes
## 3 Caleb M Fr 2.25 No
## 4 Daisy F Fr 2.10 No
## 5 Faye F Fr 2.00 No
## 6 Eva F Fr 1.80 No
## 7 Georg M Fr 1.40 No
## 22 Angela F Sr 4.00 Yes
## 23 Bob M Sr 3.80 Yes
## 24 Carl M Sr 3.10 No
## 25 Diana F Sr 2.90 No
## 26 Frank M Sr 2.00 No
## 27 Ed M Sr 1.50 No
## 28 Grace F Sr 1.40 No
xbar.clust=mean(clust.samp$GPA)
xbar.clust
## [1] 2.475
The sample mean for the clustered sample, \(\bar{x}_{clust}\), is 2.475.
Strengths
Weaknesses
Note that here we have not discussed about the error of the sampling estimates, we will review some related ideas when talking of sample size.
In general, cluster sampling is more economical and feasible than SRS.
However, we must point out that the standard errors of estimates obtained from cluster sampling are often high compared with those obtained from samples of the same number of listing units chosen by other sampling designs.
The reason for this situation is that listing units within the same cluster are often homogeneous with respect to many characteristics.
For example, households on the same block are often quite similar with respect to socioeconomic status, ethnicity, and other variables.
Because of homogeneity among listing units within the same cluster, selection of more than one household within the same cluster, as is done in cluster sampling, is in a sense redundant.
The effect of this redundancy becomes evident in the high standard errors of estimates that are often seen in cluster sampling.
If we were to choose between cluster sampling and some alternative design solely on the basis of cost or feasibility, cluster sampling would inevitably be the sampling design of choice.
On the other hand, if we were to choose a design solely on the basis of reliability of estimates, then cluster sampling would rarely be the design of choice.
However, because it is possible to take a larger sample for a fixed cost with cluster sampling, greater precision may be sometimes attained than is possible with other methods.
Generally, in choosing between cluster sampling and alternatives, we use criteria that incorporate both reliability and cost.
In fact, we generally choose the sampling design that gives the lowest possible standard error at a specified cost or, conversely, the sampling design that yields, at the lowest cost, estimates having pre-specified standard error (precision).
Randomization is a design technique to ensure the comparability of treatment groups in clinical trials by introducing a deliberate element of chance. First of all, it tends to balance known and unknown covariates and, thus, to produce structural equality of the treatment groups (exchangeability). Second, by ensuring effective blinding of treatment allocations from investigators and patients, randomization helps to avoid bias caused by the selection of patients. Finally, randomization contributes to the internal validity of a trial that provides the basis for statistical inference (and the estimation of causal effects in the population). The randomization concept was firstly introduced by R. A. Fisher in his classic book “The Design of Experiments” (1935). The importance of randomization for clinical trials was first noted in the 1940s by the epidemiologist Sir A. Bradford Hill who realized that successful blinding of treatment allocations was impossible without randomization. Since that time, regulators have advocated the use of randomization in their guidelines and several different randomization procedures have been proposed in the literature. It has been noticed that different randomization procedures behave differently, e.g., concerning their susceptibility to bias and their potential to control power and type-I-error probability. An overview containing the latest developments can be found in Rosenberger and Lachin (2016).
We will use here the R library randomizr to illustrate some basic examples of randomization procedures.
Install and then load the required library:
library(randomizr)
There are two main purposes for the generation of randomization sequences. 1. The first purpose is the generation of a single sequence for the allocation of patients in a clinical trial. 2. The second purpose is the generation of multiple sequences in order to assess the statistical properties of a randomization procedure.
Suppose that we are conducting an experiment among the 592 individuals in the built-in HairEyeColor dataset. As we’ll see, there are many ways to randomly assign subjects to treatments. We’ll step through three common designs, each associated with one of the five randomizr functions: simple_ra(), complete_ra(), block_ra(). [Not covered here: cluster_ra(), and block_and_cluster_ra().]
We first need to transform the dataset, which has each row describe a type of subject, to a new dataset in which each row describes an individual subject.
# Load built-in dataset
data(HairEyeColor)
HairEyeColor <- data.frame(HairEyeColor)
head(HairEyeColor)
## Hair Eye Sex Freq
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
dim(HairEyeColor)
## [1] 32 4
# Transform so each row is a subject
# Columns describe subject's hair color, eye color, and gender
hec <- HairEyeColor[rep(1:nrow(HairEyeColor),
times = HairEyeColor$Freq), 1:3]
N <- nrow(hec)
# Fix the rownames
rownames(hec) <- NULL
head(hec)
## Hair Eye Sex
## 1 Black Brown Male
## 2 Black Brown Male
## 3 Black Brown Male
## 4 Black Brown Male
## 5 Black Brown Male
## 6 Black Brown Male
dim(hec)
## [1] 592 3
Typically, researchers know some basic information about their subjects before deploying treatment. For example, they usually know how many subjects there are in the study population (N), and they usually know some basic demographic information about each subject.
Our new dataset has 592 subjects. We have three pre-treatment covariates, Hair, Eye, and Sex, which describe the hair color, eye color, and gender of each subject.
For didactic purposes, we now will create simulated potential outcomes.
We’ll call the untreated outcome Y0 and we’ll call the treated outcome Y1.
Imagine that in the absence of any intervention, the outcome (Y0) is correlated with our pre-treatment covariates.
Imagine further that the effectiveness of the intervention also varies according to these covariates, i.e., the difference between Y1 and Y0 is correlated with the pre-treatment covariates.
If we were really running an experiment, we would only observe either Y0 or Y1 for each subject, but since we are simulating, we generate both.
Our inferential target is the average treatment effect (ATE), which is defined as the average difference between Y0 and Y1.
# Set a the random seed for reproducibility
set.seed(343)
# Create untreated and treated outcomes for all subjects
hec <- within(hec,{
Y0 <- rnorm(n = N,mean = (2*as.numeric(Hair) + -4*as.numeric(Eye) + -6*as.numeric(Sex)), sd = 5)
Y1 <- Y0 + 6*as.numeric(Hair) + 4*as.numeric(Eye) + 2*as.numeric(Sex)
})
hist(hec$Y0, xlab="Potential Outcome Score if untreated")
hist(hec$Y1, xlab="Potential Outcome Score if treated")
hist(hec$Y1-hec$Y0, xlab="Distribution of potential outcomes differences")
abline(v=25.26, col="red", lwd=2)
# Calculate the true ATE
with(hec, mean(Y1 - Y0))
## [1] 25.26351
Therefore, the ATE on the population is 25.26 : imagine for example that the intervention is aimed to improve a certain score of “self-confidence” and in this example there is an average improvement in the score in the treated of 25.26 units with respect to the untreated. This comes from an ideal experiment : treating all subjects vs not treating all subjects ! We are now ready to randomly allocate treatment assignments to subjects. Let’s start by contrasting simple and complete random assignment.
Simple random assignment assigns all subjects to treatment with an equal probability by flipping a (possibly weighted) coin for each subject. The main trouble with simple random assignment is that the number of subjects assigned to treatment is itself a random number - depending on the random assignment, therefore a different number of subjects might be assigned to each group.
The simple_ra() function has one required argument N, the total number of subjects. If no other arguments are specified, simple_ra() assumes a two-group design and a 0.50 probability of assignment.
Z <- simple_ra(N = N)
table(Z)
## Z
## 0 1
## 301 291
To change the probability of assignment (i.e. using a weighted coin), you should specify the prob argument:
Z <- simple_ra(N = N, prob = 0.30)
table(Z)
## Z
## 0 1
## 402 190
If you specify num_arms without changing prob_each, simple_ra() will assume equal probabilities across all arms.
Z <- simple_ra(N = N, num_arms = 3)
table(Z)
## Z
## T1 T2 T3
## 191 215 186
You can also just specify the specific probabilities of your multiple arms. These probabilities must sum to 1.
Z <- simple_ra(N = N, prob_each = c(.2, .2, .6))
table(Z)
## Z
## T1 T2 T3
## 118 119 355
You can also name your treatment arms.
Z <- simple_ra(N = N, prob_each = c(.2, .2, .6),
conditions=c("control", "placebo", "treatment"))
table(Z)
## Z
## control placebo treatment
## 132 108 352
Complete random assignment is very similar to simple random assignment, except that the researcher can specify exactly how many units are assigned to each condition.
The syntax for complete_ra() is very similar to that of simple_ra(). The argument m is the number of units assigned to treatment in two-arm designs; it is analogous to simple_ra()’s prob. Similarly, the argument m_each is analogous to prob_each.
If you only specify N, complete_ra() assigns exactly half of the subjects to treatment.
Z <- complete_ra(N = N)
table(Z)
## Z
## 0 1
## 296 296
To change the number of units assigned, specify the m argument:
Z <- complete_ra(N = N, m = 200)
table(Z)
## Z
## 0 1
## 392 200
If you specify multiple arms, complete_ra() will assign an equal (within rounding!) number of units to treatment.
Z <- complete_ra(N = N, num_arms = 3)
table(Z)
## Z
## T1 T2 T3
## 197 198 197
You can also specify exactly how many units should be assigned to each arm. The total of m_each must equal N.
Z <- complete_ra(N = N, m_each = c(100, 200, 292))
table(Z)
## Z
## T1 T2 T3
## 100 200 292
You can also name your treatment arms:
Z <- complete_ra(N = N, m_each = c(100, 200, 292),
conditions = c("control", "placebo", "treatment"))
table(Z)
## Z
## control placebo treatment
## 100 200 292
Stratified random assignment is a powerful tool when used well. In this design, subjects are sorted into strata (usually according to their pre-treatment covariates), and then complete random assignment is conducted within each stratum. For example, a researcher might stratify on gender, assigning exactly half of the men and exactly half of the women to treatment.
Why stratify ?
The first reason is related to treatment effect heterogeneity (HTE): is the treatment effect different for men versus women? Of course, such heterogeneity could be also explored if complete random assignment had been used, but stratifying on a covariate defends a researcher (somewhat) against claims of data dredging.
The second reason is to increase precision. If the stratifying variables are associated with the outcome, then strata may help to decrease sampling variability.
It’s important, however, not to overstate these advantages. The gains from a stratified design can often be realized through covariate adjustment alone.
Stratifying can also produce some complications for estimation. For example, it can produce different probabilities of assignment for different subjects, related to their characteristics. This complication is typically addressed in one of two ways: “controlling for strata” in a regression context, or the use of inverse probability weights (IPW), in which units are weighted by the inverse of the probability that the unit is in the condition that it is in (see for more details examples in block 3).
In the R library the word “block” is used in place of “stratum”. I think this name could be confused with block randomization that instead is not the topic of this function (to explore block randomization see the package https://cran.r-project.org/web/packages/blockrand/index.html).
The only required argument to block_ra() is blocks, which is a vector of length N that describes which block (strata) a unit belongs to. Blocks (strata) can be a factor, character, or numeric variable. If no other arguments are specified, block_ra() assigns an approximately equal proportion of each block to treatment.
Z <- block_ra(blocks = hec$Hair)
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## 0 54 143 35 64
## 1 54 143 36 63
For multiple treatment arms, use the num_arms argument, with or without the conditions argument:
Z <- block_ra(blocks = hec$Hair, num_arms = 3)
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## T1 36 95 24 43
## T2 36 96 23 42
## T3 36 95 24 42
Z <- block_ra(blocks = hec$Hair, conditions = c("Control", "Placebo", "Treatment"))
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## Control 36 95 23 42
## Placebo 36 96 24 43
## Treatment 36 95 24 42
block_ra() provides a number of ways to adjust the number of subjects assigned to each conditions. The prob_each argument describes what proportion of each block should be assigned to treatment arm. Note of course, that block_ra() still uses complete random assignment within each block; the appropriate number of units to assign to treatment within each block is automatically determined.
Z <- block_ra(blocks = hec$Hair, prob_each = c(.3, .7))
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## 0 32 86 21 38
## 1 76 200 50 89
For finer control, use the block_m_each argument, which takes a matrix with as many rows as there are blocks, and as many columns as there are treatment conditions. Remember that the rows are in the same order as sort(unique(blocks)), a command that is good to run before constructing a block_m_each matrix.
sort(unique(hec$Hair))
## [1] Black Brown Red Blond
## Levels: Black Brown Red Blond
block_m_each <- rbind(c(78, 30),
c(186, 100),
c(51, 20),
c(87,40))
block_m_each
## [,1] [,2]
## [1,] 78 30
## [2,] 186 100
## [3,] 51 20
## [4,] 87 40
Z <- block_ra(blocks = hec$Hair, block_m_each = block_m_each)
table(Z, hec$Hair)
##
## Z Black Brown Red Blond
## 0 78 186 51 87
## 1 30 100 20 40
How to create blocks? In the HairEyeColor dataset, we could make blocks for each unique combination of hair color, eye color, and sex.
blocks <- with(hec, paste(Hair, Eye, Sex, sep = "_"))
Z <- block_ra(blocks = blocks)
head(table(blocks, Z))
## Z
## blocks 0 1
## Black_Blue_Female 4 5
## Black_Blue_Male 5 6
## Black_Brown_Female 18 18
## Black_Brown_Male 16 16
## Black_Green_Female 1 1
## Black_Green_Male 2 1
Whenever you conduct a random assignment for use in an experiment, save it! At a minimum, the random assignment should be saved with an id variable in a csv.
hec <- within(hec,{
Z_blocked <- complete_ra(N = N, m_each = c(100, 200, 292),
conditions = c("control", "placebo", "treatment"))
id_var <- 1:nrow(hec)
})
write.csv(hec[,c("id_var", "Z_blocked")], file = "MyRandomAssignment.csv")
Note that there are many others package that can be used to perform randomization procedures in R: https://cran.r-project.org/web/views/ClinicalTrials.html
Let’s now conduct a small simulation with our HairEyeColor dataset, in order to obtain a distribution of ATE estimates under both simple and complete random assignment: remember that the true effect that we previously obtained was 25.26.
sims <- 1000
# Set up empty vectors to collect results
simple_ests <- rep(NA, sims)
complete_ests <- rep(NA, sims)
# Loop through simulation 2000 times
for(i in 1:sims){
hec <- within(hec,{
# Conduct both kinds of random assignment
Z_simple <- simple_ra(N = N)
Z_complete <- complete_ra(N = N)
# Define the observed outcomes
Y_simple <- Y1*Z_simple + Y0*(1-Z_simple)
Y_complete <- Y1*Z_complete + Y0*(1-Z_complete)
})
# Estimate ATE under both types of randomization
fit_simple <- lm(Y_simple ~ Z_simple, data=hec)
fit_complete <- lm(Y_complete ~ Z_complete, data=hec)
# Save the estimates
simple_ests[i] <- coef(fit_simple)[2]
complete_ests[i] <- coef(fit_complete)[2]
}
mean(simple_ests)
## [1] 25.24697
mean(complete_ests)
## [1] 25.27613
As we expect from randomized assignment, we obtain ATE estimates consistent with the true one.
Since we are statisticians and we always want to have a look to the variability across these estimates, let’s do it. This simulation allows us to measure the standard error directly, since the vectors simple_ests and complete_ests describe the sampling distribution of each design. (In this case, the standard deviation is the standard error of the sampling distribution of the estimator).
sd(simple_ests)
## [1] 0.6032299
sd(complete_ests)
## [1] 0.604826
Therefore we can also evaluate the 95% confidence intervals around estimates using the normal approximation as:
c(mean(simple_ests)- 1.96*sd(simple_ests), mean(simple_ests)+ 1.96*sd(simple_ests))
## [1] 24.06464 26.42930
c(mean(complete_ests)-1.96*sd(complete_ests),mean(complete_ests)+1.96*sd(complete_ests))
## [1] 24.09067 26.46159
Therefore we have obtained in both cases unbiased estimator of the true ATE ! Nice work done by the randomization process that we applied.
Fisher, R. A. (1935). The design of experiments. Oliver & Boyd.
Armitage P (1982). “The Role of Randomization in Clinical Trials.” Statistics in Medicine, 1(4), 345-352. doi:10.1002/sim.4780010412.
Chalmers I (1999). “Why Transition from Alternation to Randomisation in Clinical Trials Was Made.” BMJ, 319(7221), 1372. doi:10.1136/bmj.319.7221.1372.
ICH E9 (1998). “Statistical Principles for Clinical Trials.” Current version dated 1998-02-05. Last access 2014-09-01., URL http://www.ich.org/.
Rosenberger WF, Lachin JM (2016). Randomization in Clinical Trials: Theory and Practice. John Wiley & Sons. doi:10.1002/9781118742112.
The randomised controlled trial (RCT) is regarded by many as the gold standard method for evaluating interventions. We have discussed some of the limitations of this approach that can make it less suitable for evaluating effectiveness in the real-world. But in this section we’ll look at an example of RCT data and introduce the basic methods that can be used to analyse the results.
A RCT is effective simply because it is designed to counteract all of the systematic biases that can confound the relationship between the exposure and the outcome of interest.
RCTs have become such a bedrock of medical research that standards for reporting them have been developed. The CONSORT flowchart is a useful way of documenting the flow of participants through a trial. CONSORT stands for Consolidated Standards of Reporting Trials, which are endorsed by many medical journals. Indeed, if you plan to publish an intervention study in one of those journals, you are likely to be required to show you have followed the guidelines. The relevant information is available on the ‘Enhancing the QUAlity and Transparency Of health Research’ EQUATOR network website. The EQUATOR network site covers not only RCTs but also the full spectrum guidelines of many types of clinical research designs.
Statisticians often complain that researchers will come along with a collection of data and ask for advice as to how to analyse it. Sir Ronald Fisher (one of the most famous statisticians of all time) commented:
“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”
-Sir Ronald Fisher, Presidential Address to the First Indian Statistical Congress, 1938.
His point was that very often the statistician would have advised doing something different in the first place, had they been consulted at the outset. Once the data are collected, it may be too late to rescue the study from a fatal flaw.
The answer to the question “How should I analyse my data?” depends crucially on what hypothesis is being tested. In the case of an intervention trial, the hypothesis will usually be “did intervention X make a difference to outcome Y in people with condition Z?” There is, in this case, a clear null hypothesis – that the intervention was ineffective, and the outcome of the intervention group would have been just the same if it had not been done. The null hypothesis significance testing approach answers just that question: it tells you how likely your data are if the the null hypothesis was true. To do that, you compare the distribution of outcome scores in the intervention group and the control group. In this context we don’t just look simply at the difference in means between two groups, usually we consider whether that difference is greater than a pre-specified effect size.
To illustrate data analysis, we will use a real dataset that can be
retrieved from the ESRC data
archive. We will focus only on a small subset of the data, which
comes from an intervention study in which teaching assistants
administered an individual reading and language intervention to children
with Down syndrome. A wait-list RCT design was used, but we
will focus here on just the first two phases, in which half the children
were randomly assigned to intervention, and the remainder formed a
control group.
Several language and reading measures were included in the study, giving
a total of 11 outcomes. Here we will illustrate the analysis with just
one of the outcomes - letter-sound coding - which was administered at
baseline (t1) and immediately after the intervention (t2).
library(here)
dsdat <- read.csv(here("datasets", "dse-rli-trial-data-archive.csv"))
dsdat <- dsdat[dsdat$included==1,]
dsdat$group <- as.factor(dsdat$group)
levels(dsdat$group)<-c("Intervention","Control")
dsdat$pretest <- dsdat$letter_sound_t1
dsdat$posttest <- dsdat$letter_sound_t2
pirateplot(posttest~group,theme=1,cex.lab=1.5,data=dsdat, point.o=1, xlab="", ylab="Post-intervention score")
Figure shows results on letter-sound coding after one group had received the intervention. This test had also been administered at baseline, but we will focus first just on the outcome results.
Raw data should always be inspected prior to any data analysis, in
order to just check that the distribution of scores looks sensible. One
hears of horror stories where, for instance, an error code of 999 got
included in an analysis, distorting all the statistics. Or where an
outcome score was entered as 10 rather than 100. Visualising the data is
useful when checking whether the results are in the right numerical
range for the particular outcome measure.
The pirate plot (https://www.psychologicalscience.org/observer/yarrr-the-pirates-guide-to-r)
is a useful way of showing means and distributions as well as individual
data points.
A related step is to check whether the distribution of the data meets the assumptions of the proposed statistical analysis. Many common statistical procedures assume that continuous variables are normally distributed. Statistical approaches to checking of assumptions are beyond the scope of this section, but just eyeballing the data is useful, and can detect obvious cases of non-normality, cases of ceiling or floor effects, or clumpy data, where only certain values are possible. Data with these features may need special treatment (as for example the application of scale transformations or non-parametric test). For the data in Figure, although neither distribution has an classically normal distribution, we do not see major problems with ceiling or floor effects, and there is a reasonable spread of scores in both groups. To show now only basic methods, we will use parametric approaches (i.e. exploring differences in mean values!)
mytab <- psych::describeBy(dsdat$posttest, group=dsdat$group,mat=TRUE,digits=3)
mytab <- mytab[,c(2,4:6)]
colnames(mytab)<-c('Group','N','Mean','SD')
mytab[1:2,1]<-c("Intervention","Control")
ft <- flextable(mytab)
ft
Group | N | Mean | SD |
---|---|---|---|
Intervention | 28 | 22.286 | 7.282 |
Control | 26 | 16.346 | 9.423 |
The next step is just to compute some basic statistics to get a feel for the effect size. As discussed, the standard deviation (SD) is a key statistic for measuring an intervention effect. In these results, one mean is higher than the other, but there is overlap between the groups. Statistical analysis gives us a way of quantifying how much confidence we can place in the group difference: in particular, how likely is it that there is no real impact of intervention and the observed results just reflect the play of chance. In this case we can see that the difference between means is around 6 points and the average SD is around 8, so the effect size (Cohen’s d) is about .75 - a large effect size for a language intervention.
The simplest way of measuring the intervention effect is therefore to just compare outcome (posttest) measures on a t-test. We can use a one-tailed test with confidence, given that we anticipate outcomes will be better after intervention. One-tailed tests are often treated with suspicion, because they can be used by researchers engaged in p-hacking, but where we predict a directional effect, they are entirely appropriate and give greater power than a two-tailed test. When reporting the result of a t-test, researchers should always report all the statistics: the value of t, the degrees of freedom, the means and SDs, and the confidence interval around the mean difference, as well as the p-value. This not only helps readers understand the magnitude and reliability of the effect of interest: it also allows for the study to readily be incorporated in a meta-analysis. Results from a t-test are shown in the following Table. [Note that with a one-tailed test, the confidence interval on one side will extend to infinity: this is because a one-tailed test assumes that the true result is greater than a specified mean value, and disregards results that go in the opposite direction].
myt1 <- t.test(dsdat$posttest~dsdat$group,var.equal=T,alternative='greater')
mytabt <- c(round(myt1$statistic,3),round(myt1$parameter,0), round(myt1$p.value,3),round(myt1$estimate[1]-myt1$estimate[2],3),round(myt1$conf.int,3))
mytabt <- as.matrix(mytabt,ncol=6)
mytabt <- as.data.frame(t(mytabt))
colnames(mytabt)<-c('t','df','p','mean diff.','lowerCI','upperCI')
flextable(mytabt)
t | df | p | mean diff. | lowerCI | upperCI |
---|---|---|---|---|---|
2.602 | 52 | 0.006 | 5.94 | 2.117 | Inf |
Interpreting this table, we can conclude that data offer evidence in rejecting the null hypothesis, i.e. in the direction of a substantial effect of the intervention. The mean difference here can be interpreted as an estimate of the ATE (Average Treatment Effect) in the causal language.
The t-test on outcomes is easy to do, but it misses an opportunity to control for one unwanted source of variation, namely individual differences in the initial level of the language measure. For this reason, researchers often prefer to take difference scores: the difference between outcome and baseline measures, and apply a t-test to these. While this had some advantages over reliance on raw outcome measures, it also has disadvantages, because the amount of change that is possible from baseline to outcome is not the same for everyone. A child with a very low score at baseline has more “room for improvement” than one who has an average score. For this reason, analysis of difference scores is not generally recommended.
Rather than taking difference scores, it is preferable to analyse
differences in outcome measures after making a statistical adjustment
that takes into account the initial baseline scores, using a method
known as analysis of covariance or ANCOVA. In practice, this method
usually gives results that are very similar to those you would obtain
from an analysis of difference scores, but the precision, and hence the
statistical power is greater.
However, the data do need to meet certain assumptions of the method. We
can for example start with a plot to check if there is a linear
relationship between pretest vs posttest scores in both groups -
i.e. the points cluster around a straight line.
#plot to check linearity by eye
ggscatter(
dsdat, x = "pretest", y = "posttest",
color = "group", add = "reg.line"
)+
stat_regline_equation(
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~"), color = group)
)
Inspection of the plot confirms that the relationship between pretest and posttest looks reasonably linear in both groups. Note that it also shows that there are rather slightly more children with very low scores at pretest in the control group. This is just a chance finding - the kind of thing that can easily happen when you have relatively small numbers of children randomly assigned to groups. Randomization does not always guarantees “perfect” balance, especially for small sample size. The important is balance on average… On average pre-tests scores are not significantly different between the two groups, at is expected from randomization:
mytcheck <- t.test(dsdat$pretest~dsdat$group)
mytcheck
##
## Welch Two Sample t-test
##
## data: dsdat$pretest by dsdat$group
## t = 0.94182, df = 49.914, p-value = 0.3508
## alternative hypothesis: true difference in means between group Intervention and group Control is not equal to 0
## 95 percent confidence interval:
## -2.539301 7.022817
## sample estimates:
## mean in group Intervention mean in group Control
## 15.35714 13.11538
Now, let’s estimate the ATE (Average Treatment Effect) with only group as a factor, and then instead adding the pretest level: we use different methods (anova test and linear regression) that are substantially equivalent:
aov.1 <- dsdat %>% anova_test(posttest ~ group)
aov.1.full <- aov(posttest ~ group, data=dsdat)
aov.1.full$coefficients # the same as the simple linear regression model
## (Intercept) groupControl
## 22.28571 -5.93956
res.aov <- dsdat %>% anova_test(posttest ~ pretest +group)
resaov.1.full <- aov(posttest ~ pretest +group, data=dsdat)
resaov.1.full$coefficients # the same as the bivariable linear regression model
## (Intercept) pretest groupControl
## 10.3645402 0.7762625 -4.1993676
mylm1 <- lm(posttest~group,data=dsdat) # is equivalent to run aov.1
mylm <- lm(posttest~pretest+group,data=dsdat) #is equivalent to run res.aov
To better visualize results:
tab1 <- as.data.frame(get_anova_table(aov.1))
tab2 <- as.data.frame(get_anova_table(res.aov))
tab3 <- as.data.frame(summary(mylm1)$coefficients)
tab4 <- as.data.frame(summary(mylm)$coefficients)
source <-c('Intercept','Pretest','Group')
tab4 <- cbind(source,tab4)
colnames(tab4)[5]<-'p'
colnames(tab3)[4]<-'p'
tab1$p <- round(tab1$p,3)
tab2$p <- round(tab2$p,3)
tab3$p <- round(tab3$p,3)
tab4$p <- round(tab4$p,3)
tab1<-tab1[,-6]
tab2<-tab2[,-6]
ft1<-flextable(tab1)
ft1<-set_caption(ft1,'Analysis of posttest only (ANOVA)')
ft1
Effect | DFn | DFd | F | p | ges |
---|---|---|---|---|---|
group | 1 | 52 | 6.773 | 0.012 | 0.115 |
ft2 <- flextable(tab2)
ft2<-set_caption(ft2,'Analysis adjusted for pretest scores (ANCOVA)')
ft2
Effect | DFn | DFd | F | p | ges |
---|---|---|---|---|---|
pretest | 1 | 51 | 94.313 | 0.000 | 0.649 |
group | 1 | 51 | 9.301 | 0.004 | 0.154 |
# ft3 shows the equivalent results from linear regression
ft3<-flextable(tab3)
ft3<-set_caption(ft3,'Linear regression with group as predictor' )
ft3
Estimate | Std. Error | t value | p |
---|---|---|---|
22.28571 | 1.583656 | 14.072320 | 0.000 |
-5.93956 | 2.282291 | -2.602455 | 0.012 |
ft4<-flextable(tab4)
ft4<-set_caption(ft4,'Linear regression with pretest and group as predictor' )
ft4
source | Estimate | Std. Error | t value | p |
---|---|---|---|---|
Intercept | 10.3645402 | 1.55058301 | 6.684286 | 0.000 |
Pretest | 0.7762625 | 0.07993237 | 9.711491 | 0.000 |
Group | -4.1993676 | 1.37698465 | -3.049684 | 0.004 |
The table shows the same data analysed first of all by using ANOVA or simple linear regression to compare only the post-test scores (i.e. the ATE effect), then using ANCOVA or the bi-variable linear regression model to adjust scores for the baseline (pretest) values or the linear regression to do the same thing (i.e. the CATE effect: conditional on the pre-test value). The effect size in the ANOVA/ANCOVA approaches is shown as ges, which stands for “generalised eta squared”. You can notice that the ATE estimated by the regression coefficient of the simple linear regression model with only group as an independent variable is the same as the value of the mean difference reported by the t-test. Of note: if you want to deepen this method (ANCOVA),see for example: https://www.datanovia.com/en/lessons/ancova-in-r/.
As an example of an observational study, consider the UCBAdmissions dataset, which is one of the standard R datasets, and refers to the outcome of applications to 6 departments at University of California at Berkeley, by gender. The raw original data (4526 observations on 3 variables) have been already cross-tabulated in a 3-dimensional array : for the six largest departments results of admissions are cross-tabulated by sex.
UCBAdmissions
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
For convenience in plotting we abbreviate the outcome text:
dimnames(UCBAdmissions)[[1]] <- c("Adm", "Rej")
The best way to get an overview of multi-way structures is by using the ftable function:
ftable(UCBAdmissions)
## Dept A B C D E F
## Admit Gender
## Adm Male 512 353 120 138 53 22
## Female 89 17 202 131 94 24
## Rej Male 313 207 205 279 138 351
## Female 19 8 391 244 299 317
Thus, in Department A, 512 males were admitted while 313 were rejected, and so on.
The question of interest is whether there is any bias against admitting female applicants.
The following coerces the contingency table to a data frame:
ucb <- as.data.frame(UCBAdmissions)
dim(ucb)
## [1] 24 4
head(ucb)
## Admit Gender Dept Freq
## 1 Adm Male A 512
## 2 Rej Male A 313
## 3 Adm Female A 89
## 4 Rej Female A 19
## 5 Adm Male B 353
## 6 Rej Male B 207
The relationship between the contingency table and the data frame is that each entry in the contingency table becomes a record in the data frame, with a variable Freq holding the entry and the classification as variables.
The function effx calculates the effects of an exposure on a response, possibly stratified by a stratifying variable, and/or controlled for one or more confounding variables.The function is a wrapper for glm.
Effects are calculated as differences in means for a numerical response, odds ratios/relative risks for a binary response, and rate ratios/rate differences for a failure or count response. The k-1 effects for a categorical exposure with k levels are relative to a baseline which, by default, is the first level. The effect of a quantitative numerical exposure is calculated per unit of exposure.The exposure variable can be numeric or a factor, but if it is an ordered factor the order will be ignored.
library(Epi)
effx(response=Admit=="Rej", type="binary", exposure=Gender, weight=Freq, data=ucb)
## ---------------------------------------------------------------------------
## response : Admit == "Rej"
## type : binary
## exposure : Gender
##
## Gender is a factor with levels: Male / Female
## baseline is Male
## effects are measured as odds ratios
## ---------------------------------------------------------------------------
##
## effect of Gender on Admit == "Rej"
## number of observations 24
##
## Effect 2.5% 97.5%
## 1.84 1.62 2.09
##
## Test for no effects of exposure on 1 df: p-value= <2e-16
Thus, at a first glance, it seems that women are 80% more likely to be rejected, but if we take Department (Dept) into account by adding control=Dept, it does not seem to be so:
effx(response=Admit=="Rej", type="binary", exposure=Gender, control=Dept,weight=Freq, data=ucb)
## ---------------------------------------------------------------------------
## response : Admit == "Rej"
## type : binary
## exposure : Gender
## control vars : Dept
##
## Gender is a factor with levels: Male / Female
## baseline is Male
## effects are measured as odds ratios
## ---------------------------------------------------------------------------
##
## effect of Gender on Admit == "Rej"
## controlled for Dept
##
## number of observations 24
##
## Effect 2.5% 97.5%
## 0.905 0.772 1.060
##
## Test for no effects of exposure on 1 df: p-value= 0.216
In fact now it looks like women are 10% less likely to be rejected - though not significantly so (see the confidence interval).
What effx does here is to fit a statistical model (logistic regression) for the admission odds that allows each department its own admission odds, but assuming that the admission odds ratio between women and men within each department is the same (i.e. no interaction between Gender and Department). And, it is this common odds ratio that is reported. What we see here is a classical example of confounding : department is associated with both sex and admission probability, so ignoring department in the analysis will give a distorted picture of the effect of sex per se on admission rates.
We can recover the original data format as follows:
ucb_disagg = ucb[rep(1:nrow(ucb), ucb$Freq),
-grep("Freq", names(ucb))]
ucb_disagg$outcome = ifelse(ucb_disagg$Admit=="Rej", 1,0)
dim(ucb_disagg)
## [1] 4526 4
head(ucb_disagg)
## Admit Gender Dept outcome
## 1 Adm Male A 0
## 1.1 Adm Male A 0
## 1.2 Adm Male A 0
## 1.3 Adm Male A 0
## 1.4 Adm Male A 0
## 1.5 Adm Male A 0
With this fomat, we can easily apply the glm function:
glm.one <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg)
summary(glm.one)
##
## Call:
## glm(formula = outcome ~ Gender, family = binomial(link = "logit"),
## data = ucb_disagg)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22013 0.03879 5.675 1.38e-08 ***
## GenderFemale 0.61035 0.06389 9.553 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6044.3 on 4525 degrees of freedom
## Residual deviance: 5950.9 on 4524 degrees of freedom
## AIC: 5954.9
##
## Number of Fisher Scoring iterations: 4
As you can see the exponential of the Gender coefficient corresponds to the odds ratio previously estimated. Now, if we add in the model also the Department:
glm.two <- glm(outcome ~ Gender+Dept, binomial(link = "logit"), data=ucb_disagg)
summary(glm.two)
##
## Call:
## glm(formula = outcome ~ Gender + Dept, family = binomial(link = "logit"),
## data = ucb_disagg)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.58205 0.06899 -8.436 <2e-16 ***
## GenderFemale -0.09987 0.08085 -1.235 0.217
## DeptB 0.04340 0.10984 0.395 0.693
## DeptC 1.26260 0.10663 11.841 <2e-16 ***
## DeptD 1.29461 0.10582 12.234 <2e-16 ***
## DeptE 1.73931 0.12611 13.792 <2e-16 ***
## DeptF 3.30648 0.16998 19.452 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6044.3 on 4525 degrees of freedom
## Residual deviance: 5187.5 on 4519 degrees of freedom
## AIC: 5201.5
##
## Number of Fisher Scoring iterations: 5
Again, the exponential of the Gender coefficient is 0.90, as before. Note that being Department a factor variable with 6 levels, one is fixed as the reference value and 5 coefficients are estimated for the other levels vs the reference.
To recover confidence intervals around the odds ratio we use the standard error of the coefficient’s estimate:
lower.b = round(exp(glm.two$coefficients[2]-1.96*0.08085),3)
upper.b = round(exp(glm.two$coefficients[2]+1.96*0.08085),3)
c(round(exp(glm.two$coefficients[2]),3), lower.b, upper.b)
## GenderFemale GenderFemale GenderFemale
## 0.905 0.772 1.060
If we further look into rejection fractions for women versus men in different departments, replacing control=Dept with strata=Dept, it seems as if the only Department where there is a substantial sex difference is in A, where women are less likely to be rejected:
effx(response=Admit=="Rej", type="binary", exposure=Gender, strata=Dept,weight=Freq, data=ucb)
## ---------------------------------------------------------------------------
## response : Admit == "Rej"
## type : binary
## exposure : Gender
## stratified by : Dept
##
## Gender is a factor with levels: Male / Female
## baseline is Male
## Dept is a factor with levels: A/B/C/D/E/F
## effects are measured as odds ratios
## ---------------------------------------------------------------------------
##
## effect of Gender on Admit == "Rej"
## stratified by Dept
##
## number of observations 24
##
## Effect 2.5% 97.5%
## strata A level Female vs Male 0.349 0.209 0.584
## strata B level Female vs Male 0.803 0.340 1.890
## strata C level Female vs Male 1.130 0.855 1.500
## strata D level Female vs Male 0.921 0.686 1.240
## strata E level Female vs Male 1.220 0.825 1.810
## strata F level Female vs Male 0.828 0.455 1.510
##
## Test for effect modification on 5 df: p-value= 0.00114
This correspond using glm to :
glm.s1 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="A")
glm.s2 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="B")
glm.s3 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="C")
glm.s4 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="D")
glm.s5 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="E")
glm.s6 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="F")
round(exp(glm.s1$coefficients[2]),3)
## GenderFemale
## 0.349
round(exp(glm.s2$coefficients[2]),3)
## GenderFemale
## 0.803
round(exp(glm.s3$coefficients[2]),3)
## GenderFemale
## 1.133
round(exp(glm.s4$coefficients[2]),3)
## GenderFemale
## 0.921
round(exp(glm.s5$coefficients[2]),3)
## GenderFemale
## 1.222
round(exp(glm.s6$coefficients[2]),3)
## GenderFemale
## 0.828
The final analysis shows that there are differences between Departments in the women/men odds ratio of being rejected, although the only one where there is a significantly different rejection rate is in A. From the stratified analysis we see that the departments where women are more likely to be rejected are the departments with the highest proportion of women among applicants.
We can visually appreciate the distribution of rejection rates across Departments by gender using the mosaicplot:
mosaicplot(UCBAdmissions, sort=3:1, col=TRUE, main="", ylab="", xlab="", off=c(0,1,10))
The function basically draws rectangles with an area proportional to each of the entries in the UCBAdmissions.
Finally, we can make an ad hoc analysis of admissions, excluding Department A to get an indication of whether the other Departments differ with respect to admission odds ratio between men and women:
effx(response=Admit=="Rej", type="binary", exposure=Gender, strata=Dept,weight=Freq,
data=transform(subset(ucb, Dept!="A"), Dept=factor(Dept)))
## ---------------------------------------------------------------------------
## response : Admit == "Rej"
## type : binary
## exposure : Gender
## stratified by : Dept
##
## Gender is a factor with levels: Male / Female
## baseline is Male
## Dept is a factor with levels: B/C/D/E/F
## effects are measured as odds ratios
## ---------------------------------------------------------------------------
##
## effect of Gender on Admit == "Rej"
## stratified by Dept
##
## number of observations 20
##
## Effect 2.5% 97.5%
## strata B level Female vs Male 0.803 0.340 1.89
## strata C level Female vs Male 1.130 0.855 1.50
## strata D level Female vs Male 0.921 0.686 1.24
## strata E level Female vs Male 1.220 0.825 1.81
## strata F level Female vs Male 0.828 0.455 1.51
##
## Test for effect modification on 4 df: p-value= 0.635
Note that it is not enough to just restrict the departments not being A, because the variable Dept will still be a factor with A as one of the levels. Therefore we must redefine Dept to a factor with only the actually occurring levels in the subset. We see that there is no indication of differential rejection rates between departments B-F. However, this p-value is biased since it is based on data where we deliberately excluded a department on the grounds that it was an outlier in a specific direction.
To conclude, when we estimate this kind of association measures from observational data, a lot of attention is devoted to try to disentangle possible confounders or correctly taking into account effect modifiers.
When a confounder is in fact also an effect modifier, the strata analysis is the most correct approach, even if we should take into account that dividing the study population in strata we lose sample size (and therefore statistical power).
In the regression model approach there is the possibility to explicitly add an interaction term in the equation:
glm.int <- glm(outcome ~ Gender+Dept+Gender*Dept, binomial(link = "logit"), data=ucb_disagg)
summary(glm.int)
##
## Call:
## glm(formula = outcome ~ Gender + Dept + Gender * Dept, family = binomial(link = "logit"),
## data = ucb_disagg)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.49212 0.07175 -6.859 6.94e-12 ***
## GenderFemale -1.05208 0.26271 -4.005 6.21e-05 ***
## DeptB -0.04163 0.11319 -0.368 0.71304
## DeptC 1.02764 0.13550 7.584 3.34e-14 ***
## DeptD 1.19608 0.12641 9.462 < 2e-16 ***
## DeptE 1.44908 0.17681 8.196 2.49e-16 ***
## DeptF 3.26187 0.23119 14.109 < 2e-16 ***
## GenderFemale:DeptB 0.83205 0.51039 1.630 0.10306
## GenderFemale:DeptC 1.17700 0.29956 3.929 8.53e-05 ***
## GenderFemale:DeptD 0.97009 0.30262 3.206 0.00135 **
## GenderFemale:DeptE 1.25226 0.33032 3.791 0.00015 ***
## GenderFemale:DeptF 0.86318 0.40266 2.144 0.03206 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6044.3 on 4525 degrees of freedom
## Residual deviance: 5167.3 on 4514 degrees of freedom
## AIC: 5191.3
##
## Number of Fisher Scoring iterations: 5
In this case (using all data), we see that a significant interaction is present between Gender and Department: the reference is Dept A; it seems that in Departments C, D, E and F women are more likely to be rejected than men but remind that this is calculated with respect to what happens in Dept A. Instead in Dept B this there is no significant gender effect with respect to Dept A. The marginal negative effect for female that we observe in the model (exp(-1.05)) seems therefore only due to the “protective” effect in Dept A.
If we exclude Dept A from the dataset (as before we also did):
glm.intNoA <- glm(outcome ~ Gender+Dept+Gender*Dept, binomial(link = "logit"), data=ucb_disagg, subset=Dept!="A")
summary(glm.intNoA)
##
## Call:
## glm(formula = outcome ~ Gender + Dept + Gender * Dept, family = binomial(link = "logit"),
## data = ucb_disagg, subset = Dept != "A")
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.53375 0.08754 -6.097 1.08e-09 ***
## GenderFemale -0.22002 0.43759 -0.503 0.615
## DeptC 1.06927 0.14448 7.401 1.35e-13 ***
## DeptD 1.23771 0.13599 9.101 < 2e-16 ***
## DeptE 1.49071 0.18379 8.111 5.02e-16 ***
## DeptF 3.30349 0.23657 13.964 < 2e-16 ***
## GenderFemale:DeptC 0.34494 0.46066 0.749 0.454
## GenderFemale:DeptD 0.13804 0.46266 0.298 0.765
## GenderFemale:DeptE 0.42021 0.48123 0.873 0.383
## GenderFemale:DeptF 0.03113 0.53349 0.058 0.953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4511.1 on 3592 degrees of freedom
## Residual deviance: 3971.6 on 3583 degrees of freedom
## AIC: 3991.6
##
## Number of Fisher Scoring iterations: 5
Now as expected we observe no significant marginal effect of gender at all. The only significant effect in the rejection rates are for the contrasts between Dept C, D, E, and F: they have all a higher risk of rejecting admissions with respect to Dept B.
Suppose that an estimate is desired of the average price of tablets of a tranquilizer. A random sample of pharmacies is selected. The estimate is required to be within 10 cents of the true average price with 95% confidence. Based on a small pilot study, the standard deviation in price can be estimated as 85 cents. How many pharmacies should be randomly selected ?
We can do this simple calculation not using a particular R library:
z_alpha <- 1.96
sigma <- 0.85
prec <- 0.10
n <- (z_alpha^2*sigma^2)/prec^2
sample_size <- ceiling(n)
sample_size
## [1] 278
As a result, a sample of 278 pharmacies should be taken.
It is important to note that, to calculate n, an anticipated value of the population standard deviation, is required. In the absence of knowledge of this, a rough guide is provided by the largest minus the smallest anticipated values of the measurement of concern divided by 4.
If instead we would to determine the sample size necessary to be 95% confident of estimating the average price of tablets within 5% of the true value, and we know, based on pilot survey data, that the true price should be about 1 dollar:
z_alpha <- 1.96
sigma <- 0.85
eps <- 0.05
hyp_mu <- 1
n <- (z_alpha^2*sigma^2)/(eps^2*hyp_mu^2)
sample_size <- ceiling(n)
sample_size
## [1] 1111
Hence, 1111 pharmacies should be sampled in order to be 95% confident that the resulting estimate will fall between 0.95 and 1.05 dollars, if the true average price is 1 dollar.
The formula below provide the sample size needed under the requirement of population proportion interval estimate at 0.95 confidence level, margin of error E, and planned proportion estimate p.
Example: Using a 0.50 planned proportion estimate, find the sample size needed to achieve 0.05 margin of error for the estimate at 0.95 confidence level.
zstar <- qnorm(.975)
p = 0.5
E = 0.05
n <- zstar^2*p*(1-p) / E^2
minsamp <- ceiling(n)
minsamp
## [1] 385
Another similar example: a district medical officer seeks to estimate the proportion of children in the district receiving appropriate childhood vaccinations. Assuming a simple random sample of a community is to be selected, how many children must be studied if the resulting estimate is to fall within 5 percentage points of the true proportion with 95% confidence ? In the following, we assume a value of 0.5 for the unknown proportion:
zstar <- qnorm(.975)
p = 0.5
eps = 0.05
n <- zstar^2*((1-p)/(p* (eps)^2))
minsamp <- ceiling(n)
minsamp
## [1] 1537
A sample of 1537 children would be needed.
It should be noted that this approach is valid if simple random sampling is used. This is rarely the case in an actual observational study. In case of a different sampling scheme, a design effect should be considered.This is a more advanced topic, not covered in this course.
The diet data frame has 337 rows and 14 columns. The data concern a subsample of subjects drawn from larger cohort studies of the incidence of coronary heart disease (CHD). These subjects had all completed a 7-day weighed dietary survey while taking part in validation studies of dietary questionnaire methods. Upon the closure of the MRC Social Medicine Unit, from where these studies were directed, it was found that 46 CHD events had occurred in this group, thus allowing a study of the relationship between diet and the incidence of CHD. We now load the R library required and the dataset.
library(Epi)
data(diet)
First of all we want to estimate the overall incidence rate of CHD:we should compute the follow up time in years for each subject in the study.
attach(diet)
y <- cal.yr(dox)-cal.yr(doe)
Then we compute the total follow up of the study and the number of incident cases:
Y <- sum(y)
D <- sum(chd)
Finally, assuming a constant rate, we estimate the incidence rate as follows:
rate <- D/Y
rate
## [1] 0.009992031
And we can use the log(rate) in order to derive the 95% confidence interval:
erf <- exp(1.96/sqrt(D))
c(rate, rate/erf, rate*erf)
## [1] 0.009992031 0.007484256 0.013340094
Of note: as we have seen, the likelihood for a constant rate based on the number of events D and the risk time Y is proportional to a Poisson likelihood for the observation D with mean rate*Y. Hence we can estimate the rate also using a Poisson regression model; in the model we need the log of the follow up time for each person as the offset variable:
m1 <- glm(chd~1,offset=log(y), family=poisson, data=diet)
ci.exp(m1)
## exp(Est.) 2.5% 97.5%
## (Intercept) 0.009992031 0.007484355 0.01333992
the function glm with family=Poisson fits a Poisson regression model (see Block 3 for other examples) here with one parameter, the intercept (called ‘1’) while including the log-person-time as a covariate with a fixed coefficient of 1; this is what is called an offset.
Another example using the rate in the original scale: suppose 15 events are observed during 5532 person-years in a given study cohort.Let’s now estimate the underlying incidence rate λ (in 1000 person-years: therefore 5.532) and to get an approximate confidence interval:
D <- 15
Y <- 5.532
rate <- D / Y
SE.rate <- rate/sqrt(D)
c(rate, SE.rate, rate + c(-1.96, 1.96)*SE.rate )
## [1] 2.7114967 0.7001054 1.3392901 4.0837034
Now, if we want to estimate the sample size required in a study to estimate an incidence rate within a pre-specified precision, let’s follow this example: based on data from previously conducted studies, we expect the rate to be about 50 per 10.000 pyrs. We want to determine the size of the sample that will be required to estimate the incidence rate in that population within ± 5 per 10.000 pyrs.
We are here imposing that the margin of precision should be E=1.96*S.e.(rate), so we can derive the standard error of the rate as:
se.rate <- (5/1.96)
then, we can derive the number of cases needed by:
number.cases <- (50/se.rate)**2
number.cases
## [1] 384.16
and finally, we need to derive the person-time to observe that number of cases as:
person.years <- number.cases/50
person.years*10000
## [1] 76832
Therefore, we could follow 76832 subjects for one year (or 38416 for two years…etc) in order to observe 384 events and be able to estimate a 95% confidence interval of the required precision.
We can use the epiR library to do sample size calculations:
library(epiR)
Suppose we wish to test, at the 5% level of significance, the hypothesis that cholesterol means in a population are equal in two study years against the one-sided alternative that the mean is higher in the second of the two years.
Suppose that equal sized samples will be taken in each year, but that these will not from the same individuals (i.e. the two samples are drawn independently).
Our test is to have a power of 0.95 at detecting a difference of 0.5 mmol/L. The standard deviation of serum cholesterol is assumed to be 1.4 mmol/L.
Here we are in a very standard situation of random sampling, with two independent samples: the worthwhile difference is 0.5 and since we don’t know the actual means in the groups, we can substitute any values for ‘mu1’ and ‘mu2’, so long as the difference is 0.5, in the input of the function:
epi.sscompc(treat = 5, control = 4.5, n = NA, sigma = 1.4, power = 0.95,
r = 1, design = 1, sided.test = 1, conf.level = 0.95)
## $n.total
## [1] 340
##
## $n.treat
## [1] 170
##
## $n.control
## [1] 170
##
## $power
## [1] 0.95
##
## $delta
## [1] 0.5
To satisfy the study requirements 340 individuals need to be tested: 170 in the first year and 170 in the second year.
This is equivalent to ask for the power of the test, given n in each group, using the power.t.test formula:
power.t.test(n = 170, delta = 0.5, sd = 1.4,sig.level = 0.05,
type = "two.sample",alternative = "one.sided")
##
## Two-sample t test power calculation
##
## n = 170
## delta = 0.5
## sd = 1.4
## sig.level = 0.05
## power = 0.9496262
## alternative = one.sided
##
## NOTE: n is number in *each* group
Now, instead of using random sampling of two independent groups, we test two times the same subjects, for example at 5 years of distance: for sake of simplicity, we assume the same standard deviation as before, even if in this case the standard deviation of the distribution of differences should be taken into account:
power.t.test(power=0.95, delta = 0.5, sd = 1.4,sig.level = 0.05,
type = "paired",alternative = "one.sided")
##
## Paired t test power calculation
##
## n = 86.21841
## delta = 0.5
## sd = 1.4
## sig.level = 0.05
## power = 0.95
## alternative = one.sided
##
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs
We need now 87 pairs of subjects (the same subject with two measures).
Note that the sample size requirement is well below that of 170 (per sample) of the unpaired case.
Provided that the correlation between the before and after measurements is high and positive (as is likely in practice), pairing will give a considerable saving in sample numbers.
However, there may be costs elsewhere. It may be difficult to keep track of the 87 individuals over the 5-year period. Furthermore, bias may also be present because the individuals concerned may be more health conscious simply by virtue of being studied. This may lead them to (say) consume less saturated fat than other members of the population. Sometimes, moreover, no data can be found from which to calculate the standard deviation of the differences.
Let’s consider a randomized clinical trial (RCT) to evaluate comparative efficacy of two HIV treatments:
Load the pwr library: note that here we are considering the difference between the proportions.
library(pwr)
Call the function with the required parameters: the effect size is calculated using a transformation of the difference between proportions (se if interested in details about the formulas used: Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.)):
p1=0.8
p2=0.6
h.calc = 2*asin(sqrt(p1))-2*asin(sqrt(p2))
pwr.2p.test(h =h.calc , n =NULL, sig.level =0.05, power =0.80)
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.4421432
## n = 80.29912
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: same sample sizes
81 patients per arm should be randomized.
Note that if you use the one-sided test the required sample size is lower (this is a general finding):
pwr.2p.test(h =h.calc , n =NULL, sig.level =0.05, power =0.80, alternative="greater")
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.4421432
## n = 63.25171
## sig.level = 0.05
## power = 0.8
## alternative = greater
##
## NOTE: same sample sizes
Of note, again: samples are sometimes drawn using stratification and/or clustering, in which case a complex sampling design has been used. This has a fundamental effect on sample size requirements, compared with the simple random sampling (SRS). A full discussion is well beyond the scope of this course, but note that in general we should expect sensible stratification to decrease the required sample size (lower variance), but clustering often to increase it, to maintain the same power.
In case of paired binary data (for example: pre-post treatment outcomes) the statistical test to be used is the McNemar’s Test.
The McNemar sample size calculations are not included in the library, so here it is the R code:
pwr.mcnemar <- function(p10, p01, alpha, n, power) {
pdisc <- p10 + p01
pdiff <- p10 - p01
if (missing(power) && !missing(n)) {
x1 <- (pdiff * sqrt(n) - qnorm(1 - alpha / 2) * sqrt(pdisc)) /
sqrt(pdisc - pdiff ^ 2)
x2 <- (-pdiff * sqrt(n) - qnorm(1 - alpha / 2) * sqrt(pdisc)) /
sqrt(pdisc - pdiff ^ 2)
power <- pnorm(x1) + pnorm(x2)
} else if (missing(n) && !missing(power)) {
n <- ((qnorm(1 - alpha / 2) * sqrt(pdisc) + qnorm(power) *
sqrt(pdisc - pdiff ^ 2)) / pdiff) ^ 2
} else {
stop("Must supply one of `n` or `power`, but not both.")
}
c("n" = n, "power" = power)
}
Computes sample size or power for McNemar’s test for paired categorical data.
p10 : Probability of pre-test success and post-test failure.
p01 : Probability of pre-test failure and post-test success.
alpha : Specified significance level
n : Sample size. Cannot be left blank if power is missing.
power: Statistical power. Cannot be left blank if n is missing.
H0: Both groups have the same success probability. H1: The success probability is not equal between the Groups.
Post success | Post failure | |
---|---|---|
Pre success | p11 | p10 |
Pre failure | p01 | p00 |
Reference for details: Connor R. J. 1987. Sample size for testing differences in proportions for the paired-sample design. Biometrics 43(1):207-211. page 209.
pwr.mcnemar(p10=0.20, p01=0.30,alpha=0.05,power=0.90)
## n power
## 521.2043 0.9000
We want to see if there’s an association between gender and flossing teeth among college students. We randomly sample 100 students (males and females) and ask whether or not they floss daily. We want to carry out a chi-square test of association to determine if there’s an association between these two variables. We set our significance level to 0.01. To determine effect size we need to propose an alternative hypothesis, which in this case is a table of proportions. We propose the following:
prob <- matrix(c(0.1,0.2,0.4,0.3), ncol=2,
dimnames = list(c("M","F"),c("Floss","No Floss")))
prob
## Floss No Floss
## M 0.1 0.4
## F 0.2 0.3
This says we sample even proportions of male and females, but believe that 10% more females floss.
We now use the ES.w2 function to calculate effect size for chi-square tests of association:
ES.w2(prob)
## [1] 0.2182179
Of note: the degrees of freedom of the test are the product of the number of levels df = (2 - 1) * (2 - 1) = 1
And now let’s calculate the power:
pwr.chisq.test(w = ES.w2(prob), N = 100, df = 1, sig.level = 0.01)
##
## Chi squared power calculation
##
## w = 0.2182179
## N = 100
## df = 1
## sig.level = 0.01
## power = 0.3469206
##
## NOTE: N is the number of observations
Being only 35% this is not a very powerful experiment.
We can ask: how many students should I survey if I wish to achieve 90% power?
pwr.chisq.test(w = ES.w2(prob), power = 0.9, df = 1, sig.level = 0.01)
##
## Chi squared power calculation
##
## w = 0.2182179
## N = 312.4671
## df = 1
## sig.level = 0.01
## power = 0.9
##
## NOTE: N is the number of observations
About 313 students.
If you don’t suspect association in either direction, or you don’t feel like building a matrix with a specific hypothesis, you can try to use a conventional effect size.
For example, how many students should we sample to detect a small effect? [This range of values for the effect size was proposed by Cohen]:
cohen.ES(test = "chisq", size = "small")
##
## Conventional effect size from Cohen (1982)
##
## test = chisq
## size = small
## effect.size = 0.1
Therefore:
pwr.chisq.test(w = 0.1, power = 0.9, df = 1, sig.level = 0.01)
##
## Chi squared power calculation
##
## w = 0.1
## N = 1487.939
## df = 1
## sig.level = 0.01
## power = 0.9
##
## NOTE: N is the number of observations
1488 students need to be enrolled in our study. Perhaps more than we thought we might need!
We could consider reframing the question as a simpler two-sample proportion test. What sample size do we need to detect a small effect in gender on the proportion of students who floss with 90% power and a significance level of 0.01?
pwr.2p.test(h = 0.2, sig.level = 0.01, power = 0.9)
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.2
## n = 743.9694
## sig.level = 0.01
## power = 0.9
## alternative = two.sided
##
## NOTE: same sample sizes
About 744 per group.
A graduate student is investigating the effectiveness of a fitness program. She wants to see if there is a correlation between the weight of a participant at the beginning of the program and the participant’s weight change after 6 months. She suspects there is a a small positive linear relationship between these two quantities (say 0.1). She will measure this relationship with the correlation coefficient, r, and conduct a correlation test to determine if the estimated correlation is statistically greater than 0. How many subjects does she need to sample to detect this small positive (i.e., r > 0) relationship with 80% power and 0.01 significance level?
Here there is nothing tricky about the effect size argument, r. It is simply the hypothesized correlation. It can take values ranging from -1 to 1.
cohen.ES(test = "r", size = "small")
##
## Conventional effect size from Cohen (1982)
##
## test = r
## size = small
## effect.size = 0.1
pwr.r.test(r = 0.1, sig.level = 0.01, power = 0.8, alternative = "greater")
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 999.2054
## r = 0.1
## sig.level = 0.01
## power = 0.8
## alternative = greater
She needs to observe about a 1000 students.
The default is a two-sided test. We specify alternative = “greater” since we believe there is small positive effect.
If she just wants to detect a small effect in either direction (positive or negative correlation), use the default settings of a two.sided test, which we can do by removing the alternative argument from the function:
pwr.r.test(r = 0.1, sig.level = 0.01, power = 0.8)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 1162.564
## r = 0.1
## sig.level = 0.01
## power = 0.8
## alternative = two.sided
Now she needs to observe 1163 students: as always two-sided tests are more “expensive” in terms of sample size.
A case-control study of the relationship between smoking and CHD (coronary heart disease) is planned. A sample of men with newly diagnosed CHD will be compared for smoking status with a sample of controls. Assuming an equal number of cases and controls, how many study subject are required to detect an odds ratio of 2.0 with 0.90 power using a two-sided 0.05 test? Previous surveys have shown that around 0.30 of males without CHD are smokers.
We use the epi.sscc function, that require the following arguments:
*OR = scalar, the expected study odds ratio.
*p0 = scalar, the prevalence of exposure among the controls.
*n = scalar, the total number of subjects in the study (i.e. the number of cases plus the number of controls).
*power = scalar, the required study power.
*r = scalar, the number in the control group divided by the number in the case group.
*phi.coef = scalar, the correlation between case and control exposures for matched pairs. Ignored when method = “unmatched”.
*design= scalar, the design effect.The design effect is used to take into account the possible presence of clustering in the random sampling of the data. The design effect is a measure of the variability between clusters and is generally calculated as the ratio of the variance calculated assuming a complex sample design divided by the variance calculated assuming simple random sampling.
*sided.test = Use a two-sided test if you wish to evaluate whether or not the odds of exposure in cases is greater than or less than the odds of exposure in controls. Use a one-sided test to evaluate whether or not the odds of exposure in cases is greater than the odds of exposure in controls.
*conf.level = scalar, the level of confidence in the computed result.
*method = a character string defining the method to be used. Options are unmatched or matched.
*fleiss = logical, indicating whether or not the Fleiss correction should be applied (a continuity correction factor is used when you use a continuous probability distribution to approximate a discrete probability distribution). This argument is ignored when method = “matched”.
library(epiR)
epi.sscc(OR = 2.0, p0 = 0.30, n = NA, power = 0.90, r = 1, phi.coef = 0,
design = 1, sided.test = 2, conf.level = 0.95, method = "unmatched")
## $n.total
## [1] 376
##
## $n.case
## [1] 188
##
## $n.control
## [1] 188
##
## $power
## [1] 0.9
##
## $OR
## [1] 2
A total of 376 men need to be sampled: 188 cases and 188 controls.
Suppose we wish to determine the power to detect an odds ratio of 2.0 using a two-sided 0.05 test when 188 cases and 940 controls are available (that is, the ratio of controls to cases is 5:1). Assume the prevalence of smoking in males without CHD is 0.30. Here we use the Fleiss correction because we are dealing with an inverse problem (i.e. estimating power given the sample size). See for technical details: Fleiss JL et al., (2003). Statistical Methods for Rates and Proportions. Wiley, New York, 3rd edition.
n <- 188 + 940
epi.sscc(OR = 2.0, p0 = 0.30, n = n, power = NA, r = 5, phi.coef = 0,
design = 1, sided.test = 2, conf.level = 0.95, method = "unmatched",
fleiss = TRUE)
## $n.total
## [1] 1128
##
## $n.case
## [1] 188
##
## $n.control
## [1] 940
##
## $power
## [1] 0.9880212
##
## $OR
## [1] 2
The power of this study, with the given sample size allocation is 0.99.
We wish to conduct a case-control study to assess whether bladder cancer may be associated with past exposure to cigarette smoking. Cases will be patients with bladder cancer and controls will be patients hospitalized for injury. It is assumed that 20% of controls will be smokers or past smokers, and we wish to detect an odds ratio of 2 with power 90%. Three controls will be recruited for every case.
How many subjects need to be enrolled in the study?
epi.sscc(OR = 2.0, p0 = 0.20, n = NA, power = 0.90, r = 3, phi.coef = 0,
design = 1, sided.test = 2, conf.level = 0.95, method = "unmatched")
## $n.total
## [1] 619
##
## $n.case
## [1] 155
##
## $n.control
## [1] 464
##
## $power
## [1] 0.9
##
## $OR
## [1] 2
A total of 619 subjects need to be enrolled in the study: 155 cases and 464 controls.
We use the epi.sscohortc function, that require the following arguments:
A cohort (exposure-based) study of smoking and coronary heart disease (CHD) in middle aged men is planned. A sample of men will be selected at random from the population and those that agree to participate will be asked to complete a questionnaire. The follow-up period will be 5 years. The investigators would like to be 0.90 confident of being able to detect when the relative risk of CHD is 1.4 for smokers, using a one-sided 0.05 significance test. Previous evidence suggests that the incidence rate of death in non-smokers is 413 per 100000 person-years. Assuming equal numbers of smokers and non-smokers are sampled, how many men should be sampled overall?
irexp1 = 1.4 * (5 * 413)/100000
irexp0 = (5 * 413)/100000
epi.sscohortc(irexp1 = irexp1, irexp0 = irexp0, n = NA, power = 0.90, r = 1, design = 1, sided.test = 1, conf.level = 0.95)
## $n.total
## [1] 12130
##
## $n.exp1
## [1] 6065
##
## $n.exp0
## [1] 6065
##
## $power
## [1] 0.9
##
## $irr
## [1] 1.4
##
## $or
## [1] 1.411908
Over a 5-year period the estimated chance of death in the non exposed (irexp0) is 0.02065. Thus, rounding up to the next highest even number, 12130 men should be sampled (6065 smokers and 6065 nonsmokers).
Since in this design we are extracting a simple random sample from the overall population we could know in advance the expected prevalence of the exposure, and therefore we could also be more flexible, in allowing unbalancement between exposed and not exposed.
The imbalance is set through the parameter r.
If the proportion of men smoking is one-third, then in a random sample we would expect to find n1 = n/3. Hence n2 = 2n/3 and thus r = n1/n2 = 0.5. Using this new value for r:
epi.sscohortc(irexp1 = irexp1, irexp0 = irexp0, n = NA, power = 0.90, r = 0.50, design = 1, sided.test = 1, conf.level = 0.95)
## $n.total
## [1] 13543.5
##
## $n.exp1
## [1] 4514.5
##
## $n.exp0
## [1] 9029
##
## $power
## [1] 0.9
##
## $irr
## [1] 1.4
##
## $or
## [1] 1.411908
Rounding up gives a total sample size requirement of 13544. We would expect that, after random sampling about a third of these would be nonsmokers. Notice that 13544 is not exactly divisible by 3, but any further rounding up would not be justifiable because we cannot guarantee the exact proportion of smokers.
Say, for example, we are only able to enroll 5000 subjects into the study described above. What is the minimum and maximum detectable relative risk ?
irexp0 = (5 * 413)/100000
epi.sscohortc(irexp1 = NA, irexp0 = irexp0, n = 5000, power = 0.90,
r = 1, design = 1, sided.test = 1, conf.level = 0.95)
## $n.total
## [1] 5000
##
## $n.exp1
## [1] 2500
##
## $n.exp0
## [1] 2500
##
## $power
## [1] 0.9
##
## $irr
## [1] 0.5047104 1.6537812
##
## $or
## numeric(0)
The minimum detectable relative risk > 1 is 1.65. The maximum detectable relative risk < 1 is 0.50.
A study is to be carried out to assess the effect of a new treatment for the reproductive period in dairy cattle. What is the required sample size if we expect the proportion of cows responding in the treatment (exposed) group to be 0.30 and the proportion of cows responding in the control (unexposed) group to be 0.15? The required power for this study is 0.80 using a two-sided 0.05 test.
epi.sscohortc(irexp1 = 0.30, irexp0 = 0.15, n = NA, power = 0.80,
r = 1, design = 1, sided.test = 2, conf.level = 0.95)
## $n.total
## [1] 242
##
## $n.exp1
## [1] 121
##
## $n.exp0
## [1] 121
##
## $power
## [1] 0.8
##
## $irr
## [1] 2
##
## $or
## [1] 2.428571
A total of 242 cows are required: 121 in the treatment (exposed) group and 121 in the control (unexposed) group.
Calculate the maximum sample size required to estimate the prevalence of respiratory tract infection, with a precision of 5%, in a target population consisting of children in a particular region of a developing country. N.B: An estimate of the population prevalence is not known. However, we can obtain a range of sample sizes required corresponding to a wide range of values for p, say from 0.1 to 0.9.
A case-control study is carried out to determine the efficacy of a vaccine for the prevention of childhood tuberculosis. Assume that 50% of the controls are not vaccinated. If the number of cases and controls are equal, what sample size is needed to detect, with 80% power and 5% type I error, an odds ratio of at least 2 in the target population ?
A randomised trial is to be conducted comparing two new treatments aimed at increasing the weights of malnourished children with a control group. The minimal worthwhile benefit is an increase in mean weight of 2.5kg, with respect to the controls, and the standard deviations of weight changes are believed to be 3.5kg. What are the required sample sizes, assuming that the control group is twice as large as each of the two treatment groups and an 80% power is required for each comparison?
Initial Data Analysis is a crucial first step on the long way to the final result, be it a statistical inference in a scientific paper or a commercial report if you for example are working for a pharma company. This long way is often bumpy, highly iterative and time consuming. However, IDA might be the most important part of data analysis, because it also helps to generate new/alternative hypothesis, which then determine the final result. Thus, we now provide some simplest and effective ways to explore data in R. Moreover, we will also show how compare groups with simple statistical tests (WARNING: it is not a matter of a priori defined effect size here, like it is in randomized clinical trials ; we use these tools just for descriptive purposes). Note that the IDA phase depend heavily on the analysist’s experience and knowledge about the “nature” of the dataset; in the following I just put together a series of (uncorrelated) examples in order to give you an overview of the basic tools.
Let’s first of all install some useful libraries:
library(DataExplorer) # IDA
library(tidyverse) # for everything good in R ;)
library(SmartEDA) # IDA
library(dlookr) # IDA
library(funModeling) # IDA
library(ISLR) # for the Wage dataset
library(ggstatsplot) # publication ready visualizations with statistical details
library(flextable) # beautifying tables
library(summarytools) # IDA
library(psych) # psychological research: descr. stats, FA, PCA etc.
library(skimr) # summary stats
library(gtsummary) # publication ready summary tables
library(moments) # skewness, kurtosis and related tests
library(ggpubr) # publication ready data visualization in R
library(PerformanceAnalytics) # econometrics for performance and risk analysis
library(performance) # Assessment of Regression Models Performance (just for outliers here)
library(PMCMRplus) # Calculate Pairwise Multiple Comparisons
library(see) # Model Visualization Toolbox
library(ggcorrplot) # correlation plots
library(ggside) # add metadata to ggplots
and then set your working directory:
setwd(here())
We now consider the built-in dataset airquality that report daily air quality measurements in New York, from May to September 1973.The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data). This is obviously an observational study.
introduce(airquality) %>% t()
## [,1]
## rows 153
## columns 6
## discrete_columns 0
## continuous_columns 6
## all_missing_columns 0
## total_missing_values 44
## complete_rows 111
## total_observations 918
## memory_usage 6376
names(airquality)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
This is a data frame with 153 observations on 6 variables. Specifically the 6 variables are: “Ozone”, Solar.R”, “Wind”, “Temp”, “Month” “Day”. We can also plot basic information (from introduce) for the data.
plot_intro(airquality)
Another useful function is:
ft <- status(airquality) %>% flextable()
ft <- colformat_double(
x = ft,
digits = 2)
ft
variable | q_zeros | p_zeros | q_na | p_na | q_inf | p_inf | type | unique |
---|---|---|---|---|---|---|---|---|
Ozone | 0 | 0.00 | 37 | 0.24 | 0 | 0.00 | integer | 67 |
Solar.R | 0 | 0.00 | 7 | 0.05 | 0 | 0.00 | integer | 117 |
Wind | 0 | 0.00 | 0 | 0.00 | 0 | 0.00 | numeric | 31 |
Temp | 0 | 0.00 | 0 | 0.00 | 0 | 0.00 | integer | 40 |
Month | 0 | 0.00 | 0 | 0.00 | 0 | 0.00 | integer | 5 |
Day | 0 | 0.00 | 0 | 0.00 | 0 | 0.00 | integer | 31 |
For each variable it returns: quantity and percentage of zeros (q_zeros and p_zeros respectively). Same metrics for NA values (q_NA/p_na), and eventually infinite values (q_inf/p_inf). Last two columns indicates data type and quantity of unique values.
Simple bar plots with frequency distribution of all categorical variables in a dataset are already quite useful, because they provide a quick overview about the meaningfulness of the categorization, and whether there are some typing mistakes in the data. {DataExplorer} package provides a simple plot_bar() function which does just that. However, plotting a target discrete variable by another discrete variable is even more useful. It is some sort of a visual contingency table (see the second plot below). For this use the by = argument and give it the second categorical variable.
Let’s use for this example the dataset diamonds a dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
detach("package:yarrr", unload = TRUE)
# it contains another dataset called 'diamonds', but we want the ggplot2 one
introduce(diamonds) %>% t()
## [,1]
## rows 53940
## columns 10
## discrete_columns 3
## continuous_columns 7
## all_missing_columns 0
## total_missing_values 0
## complete_rows 53940
## total_observations 539400
## memory_usage 3457760
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
plot_bar(diamonds)
This function produce the barplot of the 3 categorical variables in our dataset.
plot_bar(diamonds, by = "cut")
This function as explained above produce a stratified barplot, that help us in visualizing a possible relationship across the type of cut with respect to color and clarity. The ExpCatViz function from {SmartEDA} package also plots each categorical variable with a bar plot, but displays proportions instead of counts.
ExpCatViz(diamonds, Page=c(1,3))
## $`0`
Now we upload the Wage dataset, that report wage and other data for a group of 3000 male workers in the Mid-Atlantic region.
introduce(Wage) %>% t()
## [,1]
## rows 3000
## columns 11
## discrete_columns 7
## continuous_columns 4
## all_missing_columns 0
## total_missing_values 0
## complete_rows 3000
## total_observations 33000
## memory_usage 164120
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
There is a specific research question to be checked here, i.e. if education level is associated with the job. Namely, the more educated we get, the more likely we’ll end up working with information (e.g. with data…) and the less likely we’ll end up working in a factory. Let’s just visualize this possible relationship:
ExpCatViz(
Wage %>%
select(education, jobclass),
target="education")
## [[1]]
The plot nearly indicates that education level is associated with the job. However, taking into account that we are working with a sample of data, without a proper statistical test and a p-value this hypothesis can not be tested and remains … well … a speculation. Fortunately,the ggbarstats() function from {ggstatsplot} package does all the above in one line of code and even goes one step further. Namely:
ggbarstats(
data = Wage,
x = jobclass,
y = education,
label = "both")
How do you interpret the various statistical tests reported?
[Note that I’m here interested just in the frequentist tests approaches, i.e. the global Pearson’s chi-squared test and the proportion test for x variable (jobclass) for each level of y (education)].
Descriptive statistics (location, dispersion and shape parameters) are usually needed to describe numerical variables, or for a numeric variable separated in groups by some categorical variable, like control & treatment. Three functions from {dlookr} package, namely describe(), univar_numeric() and diagnose_numeric() are able do it. Be careful with the describe() function though, because it also exists in {Hmisc} and {psych} packages too. Thus, in order to avoid the confusion, simply write dlookr:: in front of describe() function, which then provides the most common descriptive stats, like counts, number o missing values, mean, standard deviation, standard error of the mean, IQR, skewness, kurtosis and 17 quantiles.
Now we explore the dataset iris that is the famous (Fisher’s) iris data set with the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
introduce(iris) %>% t()
## [,1]
## rows 150
## columns 5
## discrete_columns 1
## continuous_columns 4
## all_missing_columns 0
## total_missing_values 0
## complete_rows 150
## total_observations 750
## memory_usage 7976
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
ft1 <- dlookr::describe(iris) %>% flextable()
ft1 <- colformat_double(
x = ft1,
digits = 2)
ft1
described_variables | n | na | mean | sd | se_mean | IQR | skewness | kurtosis | p00 | p01 | p05 | p10 | p20 | p25 | p30 | p40 | p50 | p60 | p70 | p75 | p80 | p90 | p95 | p99 | p100 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length | 150 | 0 | 5.84 | 0.83 | 0.07 | 1.30 | 0.31 | -0.55 | 4.30 | 4.40 | 4.60 | 4.80 | 5.00 | 5.10 | 5.27 | 5.60 | 5.80 | 6.10 | 6.30 | 6.40 | 6.52 | 6.90 | 7.25 | 7.70 | 7.90 |
Sepal.Width | 150 | 0 | 3.06 | 0.44 | 0.04 | 0.50 | 0.32 | 0.23 | 2.00 | 2.20 | 2.34 | 2.50 | 2.70 | 2.80 | 2.80 | 3.00 | 3.00 | 3.10 | 3.20 | 3.30 | 3.40 | 3.61 | 3.80 | 4.15 | 4.40 |
Petal.Length | 150 | 0 | 3.76 | 1.77 | 0.14 | 3.50 | -0.27 | -1.40 | 1.00 | 1.15 | 1.30 | 1.40 | 1.50 | 1.60 | 1.70 | 3.90 | 4.35 | 4.64 | 5.00 | 5.10 | 5.32 | 5.80 | 6.10 | 6.70 | 6.90 |
Petal.Width | 150 | 0 | 1.20 | 0.76 | 0.06 | 1.50 | -0.10 | -1.34 | 0.10 | 0.10 | 0.20 | 0.20 | 0.20 | 0.30 | 0.40 | 1.16 | 1.30 | 1.50 | 1.80 | 1.80 | 1.90 | 2.20 | 2.30 | 2.50 | 2.50 |
#save_as_docx(ft1, path="ft1.docx")
hist(iris$Sepal.Length)
Here, we can also see how useful can be the collaboration of {dlookr} with {tidyverse} packages, like {dplyr} and its group_by() function which calculates descriptive statistics per group. And if you don’t need such a monstrous table, but only want to have the median instead of 17 quantiles, use univar_numeric() function.
ft2 <- iris %>%
group_by(Species) %>%
univar_numeric()
## Adding missing grouping variables: `Species`
knitr::kable(ft2, digits=2)
|
The diagnose_numeric() function reports the usual 5-number-summary (which is actually a box-plot in a table form) and the number of zeros, negative values and outliers:
ft3 <- iris %>%
diagnose_numeric() %>%
flextable()
ft3 <- colformat_double(
x = ft3,
digits = 2)
ft3
variables | min | Q1 | mean | median | Q3 | max | zero | minus | outlier |
---|---|---|---|---|---|---|---|---|---|
Sepal.Length | 4.30 | 5.10 | 5.84 | 5.80 | 6.40 | 7.90 | 0 | 0 | 0 |
Sepal.Width | 2.00 | 2.80 | 3.06 | 3.00 | 3.30 | 4.40 | 0 | 0 | 4 |
Petal.Length | 1.00 | 1.60 | 3.76 | 4.35 | 5.10 | 6.90 | 0 | 0 | 0 |
Petal.Width | 0.10 | 0.30 | 1.20 | 1.30 | 1.80 | 2.50 | 0 | 0 | 0 |
{SmartEDA} with its ExpNumStat() function also provides a very comprehensive descriptive statistics table. Moreover we can choose to describe the whole variables, grouped variables, or even both at the same time. If we call the argument by = with a big letter A, we’ll get statistics for every numeric variable in the dataset. The big G delivers descriptive stats per GROUP, but we’ll need to specify a group in the next argument gr =. Using GA, would give you both. We can also specify the quantiles we need and identify the lower hinge, upper hinge and number of outliers, if we want to.
ft4 <- ExpNumStat(iris, by="A", Outlier=TRUE, Qnt = c(.25, .75), round = 2) %>% flextable()
ft4 <- colformat_double(
x = ft4,
digits = 2)
ft4
Vname | Group | TN | nNeg | nZero | nPos | NegInf | PosInf | NA_Value | Per_of_Missing | sum | min | max | mean | median | SD | CV | IQR | Skewness | Kurtosis | 25% | 75% | LB.25% | UB.75% | nOutliers |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Petal.Length | All | 150.00 | 0.00 | 0.00 | 150.00 | 0.00 | 0.00 | 0.00 | 0.00 | 563.70 | 1.00 | 6.90 | 3.76 | 4.35 | 1.77 | 0.47 | 3.50 | -0.27 | -1.40 | 1.60 | 5.10 | -3.65 | 10.35 | 0.00 |
Petal.Width | All | 150.00 | 0.00 | 0.00 | 150.00 | 0.00 | 0.00 | 0.00 | 0.00 | 179.90 | 0.10 | 2.50 | 1.20 | 1.30 | 0.76 | 0.64 | 1.50 | -0.10 | -1.34 | 0.30 | 1.80 | -1.95 | 4.05 | 0.00 |
Sepal.Length | All | 150.00 | 0.00 | 0.00 | 150.00 | 0.00 | 0.00 | 0.00 | 0.00 | 876.50 | 4.30 | 7.90 | 5.84 | 5.80 | 0.83 | 0.14 | 1.30 | 0.31 | -0.57 | 5.10 | 6.40 | 3.15 | 8.35 | 0.00 |
Sepal.Width | All | 150.00 | 0.00 | 0.00 | 150.00 | 0.00 | 0.00 | 0.00 | 0.00 | 458.60 | 2.00 | 4.40 | 3.06 | 3.00 | 0.44 | 0.14 | 0.50 | 0.32 | 0.18 | 2.80 | 3.30 | 2.05 | 4.05 | 4.00 |
ft5 <- ExpNumStat(iris, by="G", gp="Species", Outlier=TRUE, Qnt = c(.25, .75), round = 2) %>% flextable()
ft5 <- colformat_double(
x = ft5,
digits = 2)
ft5
Vname | Group | TN | nNeg | nZero | nPos | NegInf | PosInf | NA_Value | Per_of_Missing | sum | min | max | mean | median | SD | CV | IQR | Skewness | Kurtosis | 25% | 75% | LB.25% | UB.75% | nOutliers |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Petal.Length | Species:setosa | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 73.10 | 1.00 | 1.90 | 1.46 | 1.50 | 0.17 | 0.12 | 0.18 | 0.10 | 0.80 | 1.40 | 1.58 | 1.14 | 1.84 | 4.00 |
Petal.Length | Species:versicolor | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 213.00 | 3.00 | 5.10 | 4.26 | 4.35 | 0.47 | 0.11 | 0.60 | -0.59 | -0.07 | 4.00 | 4.60 | 3.10 | 5.50 | 1.00 |
Petal.Length | Species:virginica | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 277.60 | 4.50 | 6.90 | 5.55 | 5.55 | 0.55 | 0.10 | 0.78 | 0.53 | -0.26 | 5.10 | 5.88 | 3.94 | 7.04 | 0.00 |
Petal.Width | Species:setosa | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 12.30 | 0.10 | 0.60 | 0.25 | 0.20 | 0.11 | 0.43 | 0.10 | 1.22 | 1.43 | 0.20 | 0.30 | 0.05 | 0.45 | 2.00 |
Petal.Width | Species:versicolor | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 66.30 | 1.00 | 1.80 | 1.33 | 1.30 | 0.20 | 0.15 | 0.30 | -0.03 | -0.49 | 1.20 | 1.50 | 0.75 | 1.95 | 0.00 |
Petal.Width | Species:virginica | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 101.30 | 1.40 | 2.50 | 2.03 | 2.00 | 0.27 | 0.14 | 0.50 | -0.13 | -0.66 | 1.80 | 2.30 | 1.05 | 3.05 | 0.00 |
Sepal.Length | Species:setosa | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 250.30 | 4.30 | 5.80 | 5.01 | 5.00 | 0.35 | 0.07 | 0.40 | 0.12 | -0.35 | 4.80 | 5.20 | 4.20 | 5.80 | 0.00 |
Sepal.Length | Species:versicolor | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 296.80 | 4.90 | 7.00 | 5.94 | 5.90 | 0.52 | 0.09 | 0.70 | 0.10 | -0.60 | 5.60 | 6.30 | 4.55 | 7.35 | 0.00 |
Sepal.Length | Species:virginica | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 329.40 | 4.90 | 7.90 | 6.59 | 6.50 | 0.64 | 0.10 | 0.67 | 0.11 | -0.09 | 6.23 | 6.90 | 5.21 | 7.91 | 1.00 |
Sepal.Width | Species:setosa | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 171.40 | 2.30 | 4.40 | 3.43 | 3.40 | 0.38 | 0.11 | 0.48 | 0.04 | 0.74 | 3.20 | 3.68 | 2.49 | 4.39 | 2.00 |
Sepal.Width | Species:versicolor | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 138.50 | 2.00 | 3.40 | 2.77 | 2.80 | 0.31 | 0.11 | 0.48 | -0.35 | -0.45 | 2.52 | 3.00 | 1.81 | 3.71 | 0.00 |
Sepal.Width | Species:virginica | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 148.70 | 2.20 | 3.80 | 2.97 | 3.00 | 0.32 | 0.11 | 0.38 | 0.35 | 0.52 | 2.80 | 3.18 | 2.24 | 3.74 | 3.00 |
ft6 <- ExpNumStat(iris, by="GA", gp="Species", Outlier=TRUE, Qnt = c(.25, .75), round = 2) %>% flextable()
ft6 <- colformat_double(
x = ft6,
digits = 2)
ft6
Vname | Group | TN | nNeg | nZero | nPos | NegInf | PosInf | NA_Value | Per_of_Missing | sum | min | max | mean | median | SD | CV | IQR | Skewness | Kurtosis | 25% | 75% | LB.25% | UB.75% | nOutliers |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Petal.Length | Species:All | 150.00 | 0.00 | 0.00 | 150.00 | 0.00 | 0.00 | 0.00 | 0.00 | 563.70 | 1.00 | 6.90 | 3.76 | 4.35 | 1.77 | 0.47 | 3.50 | -0.27 | -1.40 | 1.60 | 5.10 | -3.65 | 10.35 | 0.00 |
Petal.Length | Species:setosa | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 73.10 | 1.00 | 1.90 | 1.46 | 1.50 | 0.17 | 0.12 | 0.18 | 0.10 | 0.80 | 1.40 | 1.58 | 1.14 | 1.84 | 4.00 |
Petal.Length | Species:versicolor | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 213.00 | 3.00 | 5.10 | 4.26 | 4.35 | 0.47 | 0.11 | 0.60 | -0.59 | -0.07 | 4.00 | 4.60 | 3.10 | 5.50 | 1.00 |
Petal.Length | Species:virginica | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 277.60 | 4.50 | 6.90 | 5.55 | 5.55 | 0.55 | 0.10 | 0.78 | 0.53 | -0.26 | 5.10 | 5.88 | 3.94 | 7.04 | 0.00 |
Petal.Width | Species:All | 150.00 | 0.00 | 0.00 | 150.00 | 0.00 | 0.00 | 0.00 | 0.00 | 179.90 | 0.10 | 2.50 | 1.20 | 1.30 | 0.76 | 0.64 | 1.50 | -0.10 | -1.34 | 0.30 | 1.80 | -1.95 | 4.05 | 0.00 |
Petal.Width | Species:setosa | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 12.30 | 0.10 | 0.60 | 0.25 | 0.20 | 0.11 | 0.43 | 0.10 | 1.22 | 1.43 | 0.20 | 0.30 | 0.05 | 0.45 | 2.00 |
Petal.Width | Species:versicolor | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 66.30 | 1.00 | 1.80 | 1.33 | 1.30 | 0.20 | 0.15 | 0.30 | -0.03 | -0.49 | 1.20 | 1.50 | 0.75 | 1.95 | 0.00 |
Petal.Width | Species:virginica | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 101.30 | 1.40 | 2.50 | 2.03 | 2.00 | 0.27 | 0.14 | 0.50 | -0.13 | -0.66 | 1.80 | 2.30 | 1.05 | 3.05 | 0.00 |
Sepal.Length | Species:All | 150.00 | 0.00 | 0.00 | 150.00 | 0.00 | 0.00 | 0.00 | 0.00 | 876.50 | 4.30 | 7.90 | 5.84 | 5.80 | 0.83 | 0.14 | 1.30 | 0.31 | -0.57 | 5.10 | 6.40 | 3.15 | 8.35 | 0.00 |
Sepal.Length | Species:setosa | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 250.30 | 4.30 | 5.80 | 5.01 | 5.00 | 0.35 | 0.07 | 0.40 | 0.12 | -0.35 | 4.80 | 5.20 | 4.20 | 5.80 | 0.00 |
Sepal.Length | Species:versicolor | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 296.80 | 4.90 | 7.00 | 5.94 | 5.90 | 0.52 | 0.09 | 0.70 | 0.10 | -0.60 | 5.60 | 6.30 | 4.55 | 7.35 | 0.00 |
Sepal.Length | Species:virginica | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 329.40 | 4.90 | 7.90 | 6.59 | 6.50 | 0.64 | 0.10 | 0.67 | 0.11 | -0.09 | 6.23 | 6.90 | 5.21 | 7.91 | 1.00 |
Sepal.Width | Species:All | 150.00 | 0.00 | 0.00 | 150.00 | 0.00 | 0.00 | 0.00 | 0.00 | 458.60 | 2.00 | 4.40 | 3.06 | 3.00 | 0.44 | 0.14 | 0.50 | 0.32 | 0.18 | 2.80 | 3.30 | 2.05 | 4.05 | 4.00 |
Sepal.Width | Species:setosa | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 171.40 | 2.30 | 4.40 | 3.43 | 3.40 | 0.38 | 0.11 | 0.48 | 0.04 | 0.74 | 3.20 | 3.68 | 2.49 | 4.39 | 2.00 |
Sepal.Width | Species:versicolor | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 138.50 | 2.00 | 3.40 | 2.77 | 2.80 | 0.31 | 0.11 | 0.48 | -0.35 | -0.45 | 2.52 | 3.00 | 1.81 | 3.71 | 0.00 |
Sepal.Width | Species:virginica | 50.00 | 0.00 | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 148.70 | 2.20 | 3.80 | 2.97 | 3.00 | 0.32 | 0.11 | 0.38 | 0.35 | 0.52 | 2.80 | 3.18 | 2.24 | 3.74 | 3.00 |
{summarytools} and {psych} packages also provide useful tables with descriptive stats:
iris %>%
group_by(Species) %>%
descr()
## Descriptive Statistics
## iris
## Group: Species = setosa
## N: 50
##
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## ----------------- -------------- ------------- -------------- -------------
## Mean 1.46 0.25 5.01 3.43
## Std.Dev 0.17 0.11 0.35 0.38
## Min 1.00 0.10 4.30 2.30
## Q1 1.40 0.20 4.80 3.20
## Median 1.50 0.20 5.00 3.40
## Q3 1.60 0.30 5.20 3.70
## Max 1.90 0.60 5.80 4.40
## MAD 0.15 0.00 0.30 0.37
## IQR 0.18 0.10 0.40 0.48
## CV 0.12 0.43 0.07 0.11
## Skewness 0.10 1.18 0.11 0.04
## SE.Skewness 0.34 0.34 0.34 0.34
## Kurtosis 0.65 1.26 -0.45 0.60
## N.Valid 50.00 50.00 50.00 50.00
## Pct.Valid 100.00 100.00 100.00 100.00
##
## Group: Species = versicolor
## N: 50
##
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## ----------------- -------------- ------------- -------------- -------------
## Mean 4.26 1.33 5.94 2.77
## Std.Dev 0.47 0.20 0.52 0.31
## Min 3.00 1.00 4.90 2.00
## Q1 4.00 1.20 5.60 2.50
## Median 4.35 1.30 5.90 2.80
## Q3 4.60 1.50 6.30 3.00
## Max 5.10 1.80 7.00 3.40
## MAD 0.52 0.22 0.52 0.30
## IQR 0.60 0.30 0.70 0.48
## CV 0.11 0.15 0.09 0.11
## Skewness -0.57 -0.03 0.10 -0.34
## SE.Skewness 0.34 0.34 0.34 0.34
## Kurtosis -0.19 -0.59 -0.69 -0.55
## N.Valid 50.00 50.00 50.00 50.00
## Pct.Valid 100.00 100.00 100.00 100.00
##
## Group: Species = virginica
## N: 50
##
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## ----------------- -------------- ------------- -------------- -------------
## Mean 5.55 2.03 6.59 2.97
## Std.Dev 0.55 0.27 0.64 0.32
## Min 4.50 1.40 4.90 2.20
## Q1 5.10 1.80 6.20 2.80
## Median 5.55 2.00 6.50 3.00
## Q3 5.90 2.30 6.90 3.20
## Max 6.90 2.50 7.90 3.80
## MAD 0.67 0.30 0.59 0.30
## IQR 0.78 0.50 0.67 0.38
## CV 0.10 0.14 0.10 0.11
## Skewness 0.52 -0.12 0.11 0.34
## SE.Skewness 0.34 0.34 0.34 0.34
## Kurtosis -0.37 -0.75 -0.20 0.38
## N.Valid 50.00 50.00 50.00 50.00
## Pct.Valid 100.00 100.00 100.00 100.00
describeBy(iris,
iris$Species)
##
## Descriptive statistics by group
## group: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 5.01 0.35 5.0 5.00 0.30 4.3 5.8 1.5 0.11 -0.45
## Sepal.Width 2 50 3.43 0.38 3.4 3.42 0.37 2.3 4.4 2.1 0.04 0.60
## Petal.Length 3 50 1.46 0.17 1.5 1.46 0.15 1.0 1.9 0.9 0.10 0.65
## Petal.Width 4 50 0.25 0.11 0.2 0.24 0.00 0.1 0.6 0.5 1.18 1.26
## Species* 5 50 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN
## se
## Sepal.Length 0.05
## Sepal.Width 0.05
## Petal.Length 0.02
## Petal.Width 0.01
## Species* 0.00
## ------------------------------------------------------------
## group: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 5.94 0.52 5.90 5.94 0.52 4.9 7.0 2.1 0.10 -0.69
## Sepal.Width 2 50 2.77 0.31 2.80 2.78 0.30 2.0 3.4 1.4 -0.34 -0.55
## Petal.Length 3 50 4.26 0.47 4.35 4.29 0.52 3.0 5.1 2.1 -0.57 -0.19
## Petal.Width 4 50 1.33 0.20 1.30 1.32 0.22 1.0 1.8 0.8 -0.03 -0.59
## Species* 5 50 2.00 0.00 2.00 2.00 0.00 2.0 2.0 0.0 NaN NaN
## se
## Sepal.Length 0.07
## Sepal.Width 0.04
## Petal.Length 0.07
## Petal.Width 0.03
## Species* 0.00
## ------------------------------------------------------------
## group: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 6.59 0.64 6.50 6.57 0.59 4.9 7.9 3.0 0.11 -0.20
## Sepal.Width 2 50 2.97 0.32 3.00 2.96 0.30 2.2 3.8 1.6 0.34 0.38
## Petal.Length 3 50 5.55 0.55 5.55 5.51 0.67 4.5 6.9 2.4 0.52 -0.37
## Petal.Width 4 50 2.03 0.27 2.00 2.03 0.30 1.4 2.5 1.1 -0.12 -0.75
## Species* 5 50 3.00 0.00 3.00 3.00 0.00 3.0 3.0 0.0 NaN NaN
## se
## Sepal.Length 0.09
## Sepal.Width 0.05
## Petal.Length 0.08
## Petal.Width 0.04
## Species* 0.00
Why do we need to explore the distribution of numerical variables?
Well, many statistical tests depend on symmetric and (more or less..) normally distributed data.
Histograms and density plots allow us the first glimpse on the data. For example, {DataExplorer} package provides very intuitive functions for getting histogram and density plots of all continuous variables at once, namely plot_histogram() and plot_density(). Moreover, they both collaborate perfectly with {dplyr} package, which is always a good think!
plot_histogram(Wage)
plot_density(Wage)
# works perfectly with dplyr!
airquality %>%
select(Ozone, Wind) %>%
plot_density()
So, looking at two variables displayed above, we can see that Wind is distributed kind of symmetric while Ozone is not. But how can we measure the symmetry of data? And when is data symmetric enough?
Skewness measures the lack of symmetry. A data is symmetric if it looks the same to the left and to the right of the central point (usually: the mean!). The skewness for a perfectly normal distribution is zero, so that any symmetric data should have a skewness near zero. Positive values for the skewness indicate data that are skewed to the right, which means that most of the data is actually on the left side of the plot, like on our Ozone plot. Negative values would then indicate skewness to the left, with most of data being on the right side of the plot. Using skewness() function from {moments} package shows that the skewness of Ozone is indeed positive and is far away from the zero, which suggests that Ozone is not-normally distributed.
General guidelines for the measure of skewness are following:
But here again, how far from zero would be far enough in order to say that data is significantly skewed and therefore not-normally distributed?
Well, D’Agostino skewness test from {moments} package provides a p-value for that. For instance, a p-value for Ozone is small, which rejects the Null Hypothesis about not-skewed data, saying that Ozone data is actually significantly skewed. In contrast the p-value for Wind is above the usual significance threshold of 0.05, so that we can treat Wind data as not-skewed, and therefore - normal.
Always pay attention the the fact that p-values depend heavily on the sample size: small deviations in large samples could be significant, large deviations in small samples could be not!!!
skewness(airquality$Ozone, na.rm = T)
## [1] 1.225681
skewness(airquality$Wind, na.rm = T)
## [1] 0.3443985
agostino.test(airquality$Ozone)
##
## D'Agostino skewness test
##
## data: airquality$Ozone
## skew = 1.2257, z = 4.6564, p-value = 3.219e-06
## alternative hypothesis: data have a skewness
agostino.test(airquality$Wind)
##
## D'Agostino skewness test
##
## data: airquality$Wind
## skew = 0.3444, z = 1.7720, p-value = 0.07639
## alternative hypothesis: data have a skewness
So, there is a strong evidence for Ozone, instead a p value above the standard threshold of 5% is observed for Wind, therefore there is no enough evidence to reject the null hypothesis of gaussianity for Wind.
Kurtosis is a measure of heavy tails, or outliers present in the distribution. The kurtosis value for a normal distribution is around three. Here again, we’d need to do a proper statistical test which will give us a p-value saying whether kurtosis result is significantly far away from three. {moments} package provides Anscombe-Glynn kurtosis test for that. For instance, Ozone has a Kurtosis value of 4.1 which is significantly far away from 3, indicating a not normally distributed data and probable presence of outliers. In contrast, the Kurtosis for Wind is around 3 and the p-value tells us that Wind distribution is fine.
anscombe.test(airquality$Ozone)
##
## Anscombe-Glynn kurtosis test
##
## data: airquality$Ozone
## kurt = 4.1841, z = 2.2027, p-value = 0.02762
## alternative hypothesis: kurtosis is not equal to 3
anscombe.test(airquality$Wind)
##
## Anscombe-Glynn kurtosis test
##
## data: airquality$Wind
## kurt = 3.0688, z = 0.4478, p-value = 0.6543
## alternative hypothesis: kurtosis is not equal to 3
Now, finally, the normality of the distribution itself can be checked. It’s useful, because it helps us to determine a correct statistical test. For instance, if the data is normally distributed, we should use parametric tests, like t-test or ANOVA. If, however, the data is not-normally distributed, we should use non-parametric tests, like Mann-Whitney or Kruskal-Wallis. So, the normality check is not just another strange statistical paranoia, but it’s actually helpful. There are two main ways to check the normality: using a Quantile-Quantile plot and using a proper statistical test. And… we need them both!
{DataExplorer} package provides a simple and elegant plot_qq() function, which produces Quantile-Quantile plots either for all continuous variables in the dataset, or, even for every group of a categorical variable, if the argument by = is specified. The qq-plot shows on the x-axis the theoretical quantile you would expect from a standard normal distribution. The y-axis show the observed quantiles.If both sets of quantiles came from the same (standardized gaussian) distribution, we should see the points forming a line that’s roughly straight.
plot_qq(iris)
plot_qq(iris, by = "Species")
Cool, right? But plot_normality() function from {dlookr} package visualizes not only Quantile-Quantile plot, but also the histogram of the original data and histograms of two of the most common normalizing transformations of data, namely log & square root transformations, in case the normality assumption was not met. This allows us to see, whether transformation actually improves something or not, because its not always the case. Here we could also use {dplyr} syntax in order to quickly visualize several groups.
iris %>%
group_by(Species) %>%
plot_normality(Petal.Length)
However, we still don’t know, when our data is normally distributed. As we have said, the QQ-plot can be interpreted in following way: if points are situated close to the diagonal line, the data is probably normally distributed. But here we go again! How close is close enough? It’s actually very subjective! That is why, I like to explore QQ-plots using {ggpubr} package, which goes one step further and shows confidence intervals, which help to decide whether the deviation from normality is big or not. For example, if all or most of the data fall into these confidence intervals, we can conclude that data is normally distributed. However, in order to to be sure, we’d need to actually do a statistical test, which is in most cases a Shapiro-Wilk Normality test (but always remember here the sample size issue!).
ggqqplot(iris, "Sepal.Length", facet.by = "Species")
Very intuitive normality() function from {dlookr} package performs the Shapiro-Wilk Normality test with every numeric variable in the dataset. For example, we have seen that variable Wind in airquality dataset has a nice skewness and kurtosis, so, it suppose to be normally distributed, while variable Ozone suppose to be not-normally distributed, right? And indeed, normality() function totally confirms that.
normality(airquality) %>%
mutate_if(is.numeric, ~round(., 3)) %>%
flextable()
vars | statistic | p_value | sample |
---|---|---|---|
Ozone | 0.879 | 0.000 | 153 |
Solar.R | 0.942 | 0.000 | 153 |
Wind | 0.986 | 0.118 | 153 |
Temp | 0.976 | 0.009 | 153 |
Month | 0.888 | 0.000 | 153 |
Day | 0.953 | 0.000 | 153 |
Moreover, via the collaboration with {dplyr} package and it’s group_by() function we can conduct around 2000 normality tests in seconds and only few lines of code:
diamonds %>%
group_by(cut, color, clarity) %>%
normality()
## # A tibble: 1,932 × 7
## variable cut color clarity statistic p_value sample
## <chr> <ord> <ord> <ord> <dbl> <dbl> <dbl>
## 1 carat Fair D I1 0.888 0.373 4
## 2 carat Fair D SI2 0.867 0.0000183 56
## 3 carat Fair D SI1 0.849 0.00000379 58
## 4 carat Fair D VS2 0.931 0.0920 25
## 5 carat Fair D VS1 0.973 0.893 5
## 6 carat Fair D VVS2 0.871 0.127 9
## 7 carat Fair D VVS1 0.931 0.493 3
## 8 carat Fair D IF 0.990 0.806 3
## 9 carat Fair E I1 0.941 0.596 9
## 10 carat Fair E SI2 0.867 0.000000807 78
## # ℹ 1,922 more rows
So, why don’t we just do our Shapiro-Wilk tests all the time and forget all those skewnesses and visualizations? Well, because given enough data, as we said before, due to the relationship between test power and sample size, the Shapiro-Wilk test will always find some non-normality even in perfectly symmetric bell-shaped data… Here is an example of a vector with less than 300 values, where the Shapiro-Wilk test shows highly significant deviation from normality, while a density plot shows a bell curved data distribution. Moreover, tests or skewness, kurtosis and Quantile-Quantile plot all indicate normally distributed data. Thus, it’s always better to check several options before making a final conclusion about normality of the data.
bla <- Wage %>%
filter(education == "1. < HS Grad") %>%
select(age)
normality(bla) %>% flextable()
vars | statistic | p_value | sample |
---|---|---|---|
age | 0.9855188 | 0.008259363 | 268 |
plot_density(bla)
summary(bla)
## age
## Min. :18.00
## 1st Qu.:33.00
## Median :41.50
## Mean :41.79
## 3rd Qu.:50.25
## Max. :75.00
agostino.test(bla$age)
##
## D'Agostino skewness test
##
## data: bla$age
## skew = 0.24361, z = 1.64889, p-value = 0.09917
## alternative hypothesis: data have a skewness
anscombe.test(bla$age)
##
## Anscombe-Glynn kurtosis test
##
## data: bla$age
## kurt = 2.6282, z = -1.3503, p-value = 0.1769
## alternative hypothesis: kurtosis is not equal to 3
ggqqplot(bla$age)
When numerical variables are not normally distributed, is useful to visualize their box-plots. Using the intuitive plot_boxplot() function from {DataExplorer} package with an argument by = which specifies a grouping, variable, will put all groups of all numeric variables into the boxes. Such visualization however immediately creates the next question - do these groups differ significantly? We can not tell that from just staring at the picture…
plot_boxplot(iris, by = "Species")
or:
ExpNumViz(iris, target = "Species", Page = c(2,2))
## $`0`
If we use a ggbetweenstats() function from {ggstatsplot} package, we’d check hypotheses using only a few intuitive arguments. For instance:
This simple code not only provides you with a p-value which tells you whether there are significant differences between groups, but also conducts a correct multiple pairwise comparisons to see between which groups exactly these differences are. ggbetweenstats() even adjusts the p-values for multiple comparisons with the Holm method automatically and produces bunch of other statistical details on top of the amazing visualization. Violin plots here!
ggbetweenstats(
data = iris,
x = Species,
y = Sepal.Length,
type = "np")
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
This topic is really very complex! Usually when I prepare data for
regression modelling I use another R library called mice, but
here - just for IDA purposes- we can highlight the usage of some more
easier tools. The first function plot_na_intersect()
shows you which variables have missing values and how many. The
visualization consists of four parts. The bottom left, which is the most
basic, visualizes the case of cross(intersection)-combination. The
x-axis is the variable including the missing value, and the y-axis
represents the case of a combination of variables. And on the marginal
of the two axes, the frequency of the case is expressed as a bar graph.
Finally, the visualization at the top right expresses the number of
variables including missing values in the data set, and the number of
observations including missing values and complete cases.
And the second function imputate_na() imputes missing values
with different machine learning methods. For instance, using K
nearest neighbors algorithm, we could impute 37 missing values in
Ozone variable, and even visually check the quality of our imputation in
only one line of code. Using the imputate_na() function, we
only need to specify 4 arguments:
plot_na_intersect(airquality)
index1 <- is.na(airquality$Ozone)
index2 <- is.na(airquality$Solar.R)
table(index1,index2)
## index2
## index1 FALSE TRUE
## FALSE 111 5
## TRUE 35 2
This is the explanation of the above plot: we have 111 observations complete, then 2 observation with both the values missing, 5 with only Solar.R missing and 35 with only Ozone missing.
plot(imputate_na(airquality, Ozone, Temp, method = "median"))
The check_outliers() function from {performance} package provides an easy way to identify and visualize outliers with different methods. If you want to have an aggressive method and clean out a lot of outliers, go with the zscore method, but if you don’t have much data, go with less conservative method, for example interquartile range. This is a general function, that checks for and locates influential observations (i.e., “outliers”) via several distance and/or clustering methods. If several methods are selected, the returned “Outlier” vector will be a composite outlier score, made of the average of the binary (0 or 1) results of each method. It represents the probability of each observation of being classified as an outlier by at least one method. The decision rule used by default is to classify as outliers observations which composite outlier score is superior or equal to 0.5 (i.e., that were classified as outliers by at least half of the methods). The Z-score, or standard score, is a way of describing a data point as deviance from a central value, in terms of standard deviations from the mean.The default threshold to classify outliers is 1.96 (threshold = list(“zscore” = 1.96)), corresponding to the 2.5 gaussian percentile (assuming the data is normally distributed). Importantly, the Z-score method is univariate: it is computed column by column. If a dataframe is passed, the Z-score is calculated for each variable separately, and the maximum (absolute) Z-score is kept for each observations. Thus, all observations that are extreme on at least one variable might be detected as outliers. Thus, this method is not suitable for high dimensional data (with many columns), returning too liberal results (detecting many outliers).
plot(check_outliers(airquality$Wind, method = "zscore"))
The IQR (interquartile range) is a robust method developed by the statistician John Tukey, which often appears in box-and-whisker plots. The interquartile range is the range between the first and the third quartiles. Tukey considered as outliers any data point that fell outside of either 1.5 times the IQR below the first or above the third quartile. Similar to the Z-score method, this is a univariate method for outliers detection, returning outliers detected for at least one column, and might thus not be suited to high dimensional data. The distance score for the IQR is the absolute deviation from the median of the upper and lower IQR thresholds. Then, this value is divided by the IQR threshold, to “standardize” it and facilitate interpretation.
check_outliers(airquality$Wind, method = "iqr")
## 2 outliers detected: cases 9, 48.
## - Based on the following method and threshold: iqr (2).
## - For variable: airquality$Wind.
##
## -----------------------------------------------------------------------------
## Outliers per variable (iqr):
##
## $`airquality$Wind`
## Row Distance_IQR
## 9 9 1.527977
## 48 48 1.614060
The diagnose_outlier() function from {dlookr} not only counts outliers in every variable using interquartile range method, but also gets their percentages. Moreover, it calculates three different averages: the mean of every variable with outliers, without outliers and the mean of the outliers themselves. In this way we can see how strong the influence of outliers for every variable actually is. For instance the variable “depth” in “diamonds” data has over 2500 outliers. That’s a lot! However, the means with and without outliers are almost identical. Besides, the average of the outliers themselves is very similar to the original average of the whole data. In contrast, the variable “price” with over 3500 outliers is heavily influenced by them. The average of the outliers is almost 5 times higher, than the average without them.
ft7 <- diagnose_outlier(diamonds) %>% flextable()
ft7 <- colformat_double(
x = ft7,
digits = 2)
ft7
variables | outliers_cnt | outliers_ratio | outliers_mean | with_mean | without_mean |
---|---|---|---|---|---|
carat | 32 | 0.06 | 7.24 | 5.73 | 5.73 |
depth | 32 | 0.06 | 7.24 | 5.73 | 5.73 |
table | 32 | 0.06 | 7.24 | 5.73 | 5.73 |
price | 32 | 0.06 | 7.24 | 5.73 | 5.73 |
x | 32 | 0.06 | 7.24 | 5.73 | 5.73 |
y | 32 | 0.06 | 7.24 | 5.73 | 5.73 |
z | 32 | 0.06 | 7.24 | 5.73 | 5.73 |
Besides, {dlookr} can visualize the distribution of data with and without outliers, and, thank to collaboration with {dplyr}, we could choose to visualize only variables with over 5% of values being outliers:
airquality %>%
dplyr::select(Ozone, Wind) %>%
plot_outlier()
# Visualize variables with a ratio of outliers greater than 5%
diamonds %>%
plot_outlier(diamonds %>%
diagnose_outlier() %>%
filter(outliers_ratio > 5) %>%
select(variables) %>%
pull())
Similarly to imputate_na() function, {dlookr} package provides the imputate_outlier() function too, which allows us to impute outliers with several simple methods: mean, median, mode and capping. The last one, capping, is the fanciest, and it imputes the upper outliers with 95th percentile, and the bottom outliers with 5th percentile. Wrapping a simple plot() command around our result, would give us the opportunity to check the quality of imputation.
bla <- imputate_outlier(diamonds, carat, method = "capping")
plot(bla)
Remember that we should always investigate the reason why data are missing before try to impute them.. we will briefly discuss this topic in block 3.
In order to quickly check the relationship between numerical variables we can use correlate() function from {dlookr} package, which delivers correlation coefficients. If we don’t specify any target variable or the method, Pearson’s correlation between ALL variables will be calculated pairwise. {dlookr’s} plot_correlate() function is a bit more useful, because it visualizes these relationships. We can of course determine the method of calculations if we need to, be it a default “pearson”, or a non-parametric “kendall” or “spearman” correlations, which are more appropriate for not-normally or non-very-linearly distributed values with some outliers. The shape of each subplot shows the strength of the correlation, while the color shows the direction, where blue is positive and red is negative correlation. It’s fine for the pure initial/exploratory analysis, but, as always, the next logical step would be to test hypothesis and figure out which correlations are significant.
cc <- correlate(airquality, method="spearman")
cc
## # A tibble: 30 × 3
## var1 var2 coef_corr
## <fct> <fct> <dbl>
## 1 Solar.R Ozone 0.348
## 2 Wind Ozone -0.590
## 3 Temp Ozone 0.774
## 4 Month Ozone 0.138
## 5 Day Ozone -0.0562
## 6 Ozone Solar.R 0.348
## 7 Wind Solar.R -0.000977
## 8 Temp Solar.R 0.207
## 9 Month Solar.R -0.128
## 10 Day Solar.R -0.152
## # ℹ 20 more rows
plot(cc)
zz <- diamonds %>%
filter(cut %in% c("Premium", "Ideal")) %>%
group_by(cut) %>%
correlate()
plot(zz)
The ggcorrmat() function from {ggstatsplot} displays: - correlation coefficients, - a colored heatmap showing positive or negative correlations, - whether a particular correlation is significant or not (significantly different from zero, again pay attention to the sample size !), where not-significant correlations are simply crossed out.
Moreover, we can get the results in a table form with p-values and confidence intervals for correlation coefficients, if we want to, by simply using output = “dataframe” argument.
ggcorrmat(data = iris)
ft8 <- correlation::correlation(data = iris,
method= "pearson")
ft8
## # Correlation Matrix (pearson-method)
##
## Parameter1 | Parameter2 | r | 95% CI | t(148) | p
## -------------------------------------------------------------------------
## Sepal.Length | Sepal.Width | -0.12 | [-0.27, 0.04] | -1.44 | 0.152
## Sepal.Length | Petal.Length | 0.87 | [ 0.83, 0.91] | 21.65 | < .001***
## Sepal.Length | Petal.Width | 0.82 | [ 0.76, 0.86] | 17.30 | < .001***
## Sepal.Width | Petal.Length | -0.43 | [-0.55, -0.29] | -5.77 | < .001***
## Sepal.Width | Petal.Width | -0.37 | [-0.50, -0.22] | -4.79 | < .001***
## Petal.Length | Petal.Width | 0.96 | [ 0.95, 0.97] | 43.39 | < .001***
##
## p-value adjustment method: Holm (1979)
## Observations: 150
If any particular correlation catches your attention during the Initial Data Analysis, and you want to display it, use the ggscatterstats() function from {ggstatsplot} package, which delivers statistical details, that matter, namely:
ggscatterstats(
data = airquality,
x = Ozone,
y = Temp,
type = "np"
)
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.
Once you have inspected the data and eventually cleaned them from errors/outliers or manipulated some categorical variables in order to collapse categories, or possibly transformed continuous variables…etc etc … then you can begin to create the descriptive tables for your report. First of all, tbl_summary() function from {gtsummary} package summarizes all categorical variables by counts and percentages, while all numeric variables by median and IQR. The argument by = inside of tbl_summary() specifies a grouping variable. The add_p() function then conducts statistical tests with all variables and provides p-values. Remind: these statistical tests are just used for descriptive purposes and should be interpreted with caution! For numeric variables it uses the non-parametric Wilcoxon rank sum test for comparing two groups and the non-parametric Kruskal-Wallis rank sum test for more then two groups. Categorical variables are checked with Fisher’s exact test, if number of observations in any of the groups is below 5, or with Pearson’s Chi-squared test for more data. We use as example the dataset mtcars . The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
mtcars %>%
select(mpg, hp, am, gear, cyl) %>%
tbl_summary(by = am) %>%
add_p()
Characteristic | 0, N = 191 | 1, N = 131 | p-value2 |
---|---|---|---|
mpg | 17.3 (15.0, 19.2) | 22.8 (21.0, 30.4) | 0.002 |
hp | 175 (117, 193) | 109 (66, 113) | 0.046 |
gear | <0.001 | ||
3 | 15 (79%) | 0 (0%) | |
4 | 4 (21%) | 8 (62%) | |
5 | 0 (0%) | 5 (38%) | |
cyl | 0.009 | ||
4 | 3 (16%) | 8 (62%) | |
6 | 4 (21%) | 3 (23%) | |
8 | 12 (63%) | 2 (15%) | |
1 Median (IQR); n (%) | |||
2 Wilcoxon rank sum test; Fisher’s exact test |
As another example, for the dataset wage :
Wage %>%
select(age, wage, education, jobclass) %>%
tbl_summary(by = education) %>%
add_p()
Characteristic | 1. < HS Grad, N = 2681 | 2. HS Grad, N = 9711 | 3. Some College, N = 6501 | 4. College Grad, N = 6851 | 5. Advanced Degree, N = 4261 | p-value2 |
---|---|---|---|---|---|---|
age | 42 (33, 50) | 42 (33, 50) | 40 (32, 49) | 43 (34, 51) | 44 (38, 53) | <0.001 |
wage | 81 (70, 97) | 94 (78, 110) | 105 (89, 121) | 119 (100, 143) | 142 (117, 171) | <0.001 |
jobclass | <0.001 | |||||
1. Industrial | 190 (71%) | 636 (65%) | 342 (53%) | 274 (40%) | 102 (24%) | |
2. Information | 78 (29%) | 335 (35%) | 308 (47%) | 411 (60%) | 324 (76%) | |
1 Median (IQR); n (%) | ||||||
2 Kruskal-Wallis rank sum test; Pearson’s Chi-squared test |
As a fundamental first step in every scientific project involving data analysis is to explore the dataset in hand. In particular, in epidemiological and clinical research, it is often of interest to compare two or more groups with respect to a specific outcome. In observational studies in particular this step is important, in order to explore for example which variables could act as confounders in the setting under study.
library(gtsummary)
library(tidyverse)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## The following object is masked from 'package:tigerstats':
##
## tips
library(ggplot2)
library(forcats)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:dlookr':
##
## extract
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(haven)
data_hf <- read_sav(here("datasets","datiHF.sav"))
This dataset (simulated) is inspired by a real dataset of 784
patients diagnosed with Heart Failure (HF). This is an observational
cohort study. The index date correspond to the first visit of the
patient in a specific cardiological department.
We have a series of variables measured at an index visit,measured by the
cardiologist, and then we also have an “Outcome” binary variable that
represent 1=death, 0=alive (after a certain “follow up” time, we do not
focus on this aspect now ..). We have binary variables, categorical
[ordered/unordered] variables, discrete variables and continuous
variables.
Are you able to define which is which ??
Some clues …
binary variables: sex (1=Male; 0=Female) all others binary variables should be interpreted as 1= presence 0=absence of the factor
categorical unordered variable: SMOKER_BAS: 0=no 1=ex 2=current RYTHM_BAS: 1=Normal, 2=Atrial Fibrillation, 3=Pace Maker, 4=Others
categorical ordered variable: NYHA CLASS_BAS : 1, 2, 3, 4, are increasing levels of HF severity MR_BAS: 0= absence, 1, 2, 3, 4, are increasing levels of Mitral Regurgitation TR_BAS: 0= absence, 1, 2, 3, 4, are increasing levels of Tricuspid Regurgitation
First, we can make a table with some descriptive statistics of the entire cohort to look at:
data_hf %>%
select(-ID) %>%
mutate_if(is.factor,function(x)fct_explicit_na(x,na_level = "Unknown")) %>%
tbl_summary(type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c("{median} ({p25}, {p75})", "{min}, {max}","{p_miss}"),
missing = "no")
Characteristic | N = 7841 |
---|---|
AGE_BAS | |
Median (IQR) | 52 (42, 63) |
Range | 12, 86 |
% missing | 0 |
SEX | 545 (70%) |
WEIGHT_BAS | |
Median (IQR) | 77 (67, 87) |
Range | 37, 172 |
% missing | 0.9 |
HEIGHT_BAS | |
Median (IQR) | 172 (167, 178) |
Range | 115, 200 |
% missing | 1.0 |
BMI_BAS | |
Median (IQR) | 25.5 (23.1, 28.4) |
Range | 15.6, 49.1 |
% missing | 1.0 |
SMOKER_BAS | |
0 | 400 (52%) |
1 | 169 (22%) |
2 | 198 (26%) |
DIAGNOSISHF | 102 (13%) |
SYMPTOMS DURATION_BAS | |
Median (IQR) | 5 (1, 31) |
Range | 0, 365 |
% missing | 0.1 |
NYHA CLASS_BAS | |
1 | 328 (42%) |
2 | 286 (36%) |
3 | 133 (17%) |
4 | 37 (4.7%) |
REST DYSPNOEA_BAS | 37 (4.7%) |
SBP_BAS | |
Median (IQR) | 120 (110, 140) |
Range | 10, 220 |
% missing | 1.3 |
DBP_BAS | |
Median (IQR) | 80 (70, 85) |
Range | 50, 125 |
% missing | 1.3 |
THIRD SOUND_BAS | 188 (24%) |
DIABETES_BAS | 71 (9.1%) |
INSULIN_BAS | 7 (0.9%) |
COPD_BAS | 41 (5.2%) |
PERIPHERAL VASCULAR DISEASE_BAS | 43 (5.6%) |
DEPENDENT OEDEMA_BAS | 82 (10%) |
PULMONARY CRACKLES_BAS | 85 (11%) |
PULMONARY OEDEMA_BAS | 7 (0.9%) |
HEPATOMEGALY_BAS | 65 (8.7%) |
ALT_BAS | |
Median (IQR) | 24 (17, 38) |
Range | 0, 626 |
% missing | 39 |
AST_BAS | |
Median (IQR) | 22 (18, 29) |
Range | 7, 307 |
% missing | 39 |
GGT_BAS | |
Median (IQR) | 31 (18, 63) |
Range | 6, 473 |
% missing | 66 |
BILTOT_BAS | |
Median (IQR) | 0.90 (0.66, 1.21) |
Range | 0.00, 6.50 |
% missing | 52 |
ABNORMAL LIVER FUNCTION_BAS | 55 (7.0%) |
PRIOR STROKE_BAS | 25 (3.4%) |
PRIOR MI_BAS | 14 (1.8%) |
CARDIOMEGALY_BAS | 530 (68%) |
LVEF_BAS | |
Median (IQR) | 33 (25, 42) |
Range | 0, 74 |
% missing | 0 |
AORTIC STENOSIS_BAS | 3 (0.4%) |
MR_BAS | |
0 | 170 (22%) |
1 | 365 (47%) |
2 | 159 (20%) |
3 | 57 (7.3%) |
4 | 33 (4.2%) |
TR_BAS | |
0 | 332 (42%) |
1 | 404 (52%) |
2 | 32 (4.1%) |
3 | 12 (1.5%) |
4 | 4 (0.5%) |
SEVERE HEART VALVE DISEASE_BAS | 34 (4.3%) |
BSA_BAS | |
Median (IQR) | 1.91 (1.76, 2.06) |
Range | 0.00, 3.09 |
% missing | 1.3 |
LVESd_BAS | |
Median (IQR) | 51 (44, 59) |
Range | 21, 86 |
% missing | 8.4 |
LVEDd_BAS | |
Median (IQR) | 63 (58, 69) |
Range | 39, 95 |
% missing | 4.3 |
LVEDd.BAS_BAS | |
Median (IQR) | 33.1 (30.0, 36.4) |
Range | 0.0, 66.5 |
% missing | 6.0 |
LVESv_BAS | |
Median (IQR) | 104 (75, 145) |
Range | 24, 373 |
% missing | 6.3 |
LVEDV_BAS | |
Median (IQR) | 155 (125, 198) |
Range | 0, 478 |
% missing | 6.0 |
LVEDv.bNAS_BAS | |
Median (IQR) | 82 (66, 103) |
Range | 0, 247 |
% missing | 7.4 |
LVSF (FACC)_BAS | |
Median (IQR) | 19 (14, 25) |
Range | 0, 100 |
% missing | 4.0 |
HR_BAS | |
Median (IQR) | 68 (60, 79) |
Range | 0, 141 |
% missing | 0 |
RYTHM_BAS | |
1 | 678 (88%) |
2 | 57 (7.4%) |
3 | 33 (4.3%) |
4 | 2 (0.3%) |
QRS DURATION_BAS | |
Median (IQR) | 111 (97, 144) |
Range | 4, 320 |
% missing | 35 |
NA_BAS | |
Median (IQR) | 140.00 (138.00, 141.00) |
Range | 126.00, 148.00 |
% missing | 38 |
HB_BAS | |
Median (IQR) | 14.10 (13.00, 15.00) |
Range | 1.50, 149.00 |
% missing | 21 |
CREATININE_BAS | |
Median (IQR) | 0.99 (0.85, 1.15) |
Range | 0.04, 2.72 |
% missing | 18 |
EGFR_BAS | |
Median (IQR) | 74 (62, 89) |
Range | 23, 3,230 |
% missing | 18 |
OUTCOME | 191 (24%) |
1 n (%) |
We can also plot the distribution of the numerical variables in the group of patients who had the outcome vs. who had not, just to have a visual check on the distribution of values:
melted_data <- data_hf %>% select(where(function(x)length(unique(x))>7),-ID,OUTCOME) %>% melt(id.var="OUTCOME")
## Warning: attributes are not identical across measure variables; they will be
## dropped
ggplot(melted_data)+
geom_boxplot(aes(x=variable, y=value,fill=as.factor(OUTCOME))) +
facet_wrap( ~ variable, scales="free")+
scale_fill_discrete("Outcome")
## Warning: Removed 2944 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
We notice that there are some values in the continuous variables which are probably errors/outliers in the data collections so now (for simplicity) we force them to missing values:
data_hf %<>% mutate(HB_BAS=case_when(HB_BAS <50~HB_BAS),
EGFR_BAS=case_when(EGFR_BAS <1000~ EGFR_BAS),
LVEF_BAS=case_when(LVEF_BAS>0~LVEF_BAS))
Check if there are others problem with the functions check_outliers() ! in this kind of dataset however, it is always better to discuss with medical doctors in order to find which values are clearly errors…
data_hf %>%
select(-ID) %>%
mutate_if(is.factor,function(x)fct_explicit_na(x,na_level = "Unknown")) %>%
tbl_summary(type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c("{median} ({p25}, {p75})", "{min}, {max}","{p_miss}"),
missing = "no")
Characteristic | N = 7841 |
---|---|
AGE_BAS | |
Median (IQR) | 52 (42, 63) |
Range | 12, 86 |
% missing | 0 |
SEX | 545 (70%) |
WEIGHT_BAS | |
Median (IQR) | 77 (67, 87) |
Range | 37, 172 |
% missing | 0.9 |
HEIGHT_BAS | |
Median (IQR) | 172 (167, 178) |
Range | 115, 200 |
% missing | 1.0 |
BMI_BAS | |
Median (IQR) | 25.5 (23.1, 28.4) |
Range | 15.6, 49.1 |
% missing | 1.0 |
SMOKER_BAS | |
0 | 400 (52%) |
1 | 169 (22%) |
2 | 198 (26%) |
DIAGNOSISHF | 102 (13%) |
SYMPTOMS DURATION_BAS | |
Median (IQR) | 5 (1, 31) |
Range | 0, 365 |
% missing | 0.1 |
NYHA CLASS_BAS | |
1 | 328 (42%) |
2 | 286 (36%) |
3 | 133 (17%) |
4 | 37 (4.7%) |
REST DYSPNOEA_BAS | 37 (4.7%) |
SBP_BAS | |
Median (IQR) | 120 (110, 140) |
Range | 10, 220 |
% missing | 1.3 |
DBP_BAS | |
Median (IQR) | 80 (70, 85) |
Range | 50, 125 |
% missing | 1.3 |
THIRD SOUND_BAS | 188 (24%) |
DIABETES_BAS | 71 (9.1%) |
INSULIN_BAS | 7 (0.9%) |
COPD_BAS | 41 (5.2%) |
PERIPHERAL VASCULAR DISEASE_BAS | 43 (5.6%) |
DEPENDENT OEDEMA_BAS | 82 (10%) |
PULMONARY CRACKLES_BAS | 85 (11%) |
PULMONARY OEDEMA_BAS | 7 (0.9%) |
HEPATOMEGALY_BAS | 65 (8.7%) |
ALT_BAS | |
Median (IQR) | 24 (17, 38) |
Range | 0, 626 |
% missing | 39 |
AST_BAS | |
Median (IQR) | 22 (18, 29) |
Range | 7, 307 |
% missing | 39 |
GGT_BAS | |
Median (IQR) | 31 (18, 63) |
Range | 6, 473 |
% missing | 66 |
BILTOT_BAS | |
Median (IQR) | 0.90 (0.66, 1.21) |
Range | 0.00, 6.50 |
% missing | 52 |
ABNORMAL LIVER FUNCTION_BAS | 55 (7.0%) |
PRIOR STROKE_BAS | 25 (3.4%) |
PRIOR MI_BAS | 14 (1.8%) |
CARDIOMEGALY_BAS | 530 (68%) |
LVEF_BAS | |
Median (IQR) | 33 (25, 42) |
Range | 7, 74 |
% missing | 0.5 |
AORTIC STENOSIS_BAS | 3 (0.4%) |
MR_BAS | |
0 | 170 (22%) |
1 | 365 (47%) |
2 | 159 (20%) |
3 | 57 (7.3%) |
4 | 33 (4.2%) |
TR_BAS | |
0 | 332 (42%) |
1 | 404 (52%) |
2 | 32 (4.1%) |
3 | 12 (1.5%) |
4 | 4 (0.5%) |
SEVERE HEART VALVE DISEASE_BAS | 34 (4.3%) |
BSA_BAS | |
Median (IQR) | 1.91 (1.76, 2.06) |
Range | 0.00, 3.09 |
% missing | 1.3 |
LVESd_BAS | |
Median (IQR) | 51 (44, 59) |
Range | 21, 86 |
% missing | 8.4 |
LVEDd_BAS | |
Median (IQR) | 63 (58, 69) |
Range | 39, 95 |
% missing | 4.3 |
LVEDd.BAS_BAS | |
Median (IQR) | 33.1 (30.0, 36.4) |
Range | 0.0, 66.5 |
% missing | 6.0 |
LVESv_BAS | |
Median (IQR) | 104 (75, 145) |
Range | 24, 373 |
% missing | 6.3 |
LVEDV_BAS | |
Median (IQR) | 155 (125, 198) |
Range | 0, 478 |
% missing | 6.0 |
LVEDv.bNAS_BAS | |
Median (IQR) | 82 (66, 103) |
Range | 0, 247 |
% missing | 7.4 |
LVSF (FACC)_BAS | |
Median (IQR) | 19 (14, 25) |
Range | 0, 100 |
% missing | 4.0 |
HR_BAS | |
Median (IQR) | 68 (60, 79) |
Range | 0, 141 |
% missing | 0 |
RYTHM_BAS | |
1 | 678 (88%) |
2 | 57 (7.4%) |
3 | 33 (4.3%) |
4 | 2 (0.3%) |
QRS DURATION_BAS | |
Median (IQR) | 111 (97, 144) |
Range | 4, 320 |
% missing | 35 |
NA_BAS | |
Median (IQR) | 140.00 (138.00, 141.00) |
Range | 126.00, 148.00 |
% missing | 38 |
HB_BAS | |
Median (IQR) | 14.10 (13.00, 15.00) |
Range | 1.50, 19.50 |
% missing | 21 |
CREATININE_BAS | |
Median (IQR) | 0.99 (0.85, 1.15) |
Range | 0.04, 2.72 |
% missing | 18 |
EGFR_BAS | |
Median (IQR) | 74 (62, 89) |
Range | 23, 165 |
% missing | 18 |
OUTCOME | 191 (24%) |
1 n (%) |
melted_data <- data_hf %>% select(where(function(x)length(unique(x))>7),-ID,OUTCOME) %>% melt(id.var="OUTCOME")
## Warning: attributes are not identical across measure variables; they will be
## dropped
ggplot(melted_data)+
geom_boxplot(aes(x=variable, y=value,fill=as.factor(OUTCOME))) +
facet_wrap( ~ variable, scales="free")+
scale_fill_discrete("Outcome")
## Warning: Removed 2951 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
From the last plots, we can now decide for which variable it makes sense to use the mean (sd) and for which the quartiles are more appropriate…
Check normality for numerical variables also with the functions plot_qq() and plot_normality()
There are many missing values especially for some continuous variables… check if you can replace them, at least for the variables with up to 25% of missing (the others will probably be excluded from further analyses)*
Let’s prepare a first table of descriptive statistics between groups of patients stratified by the outcome to be then checked with the medical doctor that gave us the data : remind that this is just a descriptive analysis, in order to find some indication from the data about differences in parameters that can be associated with the outcome. Each finding should be discussed with the expert. These are univariable analyses, therefore a lot of confounding could be present here! Moreover, the statistical tests and the p-values here are reported just for completeness, but the interpretation should be very cautious: there is not an a priori effect size expected, as it generally happen in RCTs, and moreover p-values are related to sample size: for large sample size the probability to observe a statistical significant difference is high, even if from a clinical point of view the difference could be irrelevant!
data_hf %>%
select(-ID) %>%
tbl_summary(by=OUTCOME,statistic = list(AGE_BAS~"{mean} ({sd})",
LVEF_BAS~"{mean} ({sd})"),
missing = "no") %>%
add_overall() %>% # column overall
add_stat_label() %>%
add_p(test=list(AGE_BAS~"t.test", # default is the Mann-Whitney test for numeric variables
LVEF_BAS~"t.test")) %>%
bold_labels() %>%
modify_spanning_header(all_stat_cols() ~ "**Outcome**")
Characteristic | Outcome | p-value1 | ||
---|---|---|---|---|
Overall, N = 784 | 0, N = 593 | 1, N = 191 | ||
AGE_BAS, Mean (SD) | 52 (15) | 50 (14) | 58 (15) | <0.001 |
SEX, n (%) | 545 (70%) | 411 (69%) | 134 (70%) | 0.8 |
WEIGHT_BAS, Median (IQR) | 77 (67, 87) | 77 (68, 87) | 76 (65, 87) | 0.2 |
HEIGHT_BAS, Median (IQR) | 172 (167, 178) | 173 (167, 178) | 171 (165, 178) | 0.092 |
BMI_BAS, Median (IQR) | 25.5 (23.1, 28.4) | 25.6 (23.1, 28.7) | 25.5 (23.1, 28.4) | 0.6 |
SMOKER_BAS, n (%) | 0.8 | |||
0 | 400 (52%) | 300 (52%) | 100 (53%) | |
1 | 169 (22%) | 130 (23%) | 39 (21%) | |
2 | 198 (26%) | 147 (25%) | 51 (27%) | |
DIAGNOSISHF, n (%) | 102 (13%) | 57 (9.9%) | 45 (24%) | <0.001 |
SYMPTOMS DURATION_BAS, Median (IQR) | 5 (1, 31) | 5 (1, 27) | 8 (1, 48) | 0.047 |
NYHA CLASS_BAS, n (%) | <0.001 | |||
1 | 328 (42%) | 274 (46%) | 54 (28%) | |
2 | 286 (36%) | 217 (37%) | 69 (36%) | |
3 | 133 (17%) | 83 (14%) | 50 (26%) | |
4 | 37 (4.7%) | 19 (3.2%) | 18 (9.4%) | |
REST DYSPNOEA_BAS, n (%) | 37 (4.7%) | 19 (3.2%) | 18 (9.4%) | <0.001 |
SBP_BAS, Median (IQR) | 120 (110, 140) | 120 (110, 140) | 120 (110, 140) | 0.6 |
DBP_BAS, Median (IQR) | 80 (70, 85) | 80 (70, 85) | 80 (70, 80) | 0.019 |
THIRD SOUND_BAS, n (%) | 188 (24%) | 118 (20%) | 70 (37%) | <0.001 |
DIABETES_BAS, n (%) | 71 (9.1%) | 42 (7.1%) | 29 (15%) | <0.001 |
INSULIN_BAS, n (%) | 7 (0.9%) | 4 (0.7%) | 3 (1.6%) | 0.4 |
COPD_BAS, n (%) | 41 (5.2%) | 18 (3.0%) | 23 (12%) | <0.001 |
PERIPHERAL VASCULAR DISEASE_BAS, n (%) | 43 (5.6%) | 25 (4.3%) | 18 (9.4%) | 0.008 |
DEPENDENT OEDEMA_BAS, n (%) | 82 (10%) | 42 (7.1%) | 40 (21%) | <0.001 |
PULMONARY CRACKLES_BAS, n (%) | 85 (11%) | 51 (8.6%) | 34 (18%) | <0.001 |
PULMONARY OEDEMA_BAS, n (%) | 7 (0.9%) | 4 (0.7%) | 3 (1.6%) | 0.4 |
HEPATOMEGALY_BAS, n (%) | 65 (8.7%) | 38 (6.8%) | 27 (14%) | 0.002 |
ALT_BAS, Median (IQR) | 24 (17, 38) | 25 (17, 38) | 22 (17, 36) | 0.5 |
AST_BAS, Median (IQR) | 22 (18, 29) | 21 (18, 28) | 22 (16, 34) | 0.5 |
GGT_BAS, Median (IQR) | 31 (18, 63) | 30 (17, 50) | 38 (23, 90) | 0.006 |
BILTOT_BAS, Median (IQR) | 0.90 (0.66, 1.21) | 0.87 (0.60, 1.20) | 0.97 (0.72, 1.30) | 0.031 |
ABNORMAL LIVER FUNCTION_BAS, n (%) | 55 (7.0%) | 39 (6.6%) | 16 (8.4%) | 0.4 |
PRIOR STROKE_BAS, n (%) | 25 (3.4%) | 12 (2.1%) | 13 (7.1%) | 0.001 |
PRIOR MI_BAS, n (%) | 14 (1.8%) | 9 (1.5%) | 5 (2.6%) | 0.3 |
CARDIOMEGALY_BAS, n (%) | 530 (68%) | 390 (66%) | 140 (73%) | 0.053 |
LVEF_BAS, Mean (SD) | 33 (11) | 35 (11) | 30 (10) | <0.001 |
AORTIC STENOSIS_BAS, n (%) | 3 (0.4%) | 2 (0.3%) | 1 (0.5%) | 0.6 |
MR_BAS, n (%) | 0.002 | |||
0 | 170 (22%) | 141 (24%) | 29 (15%) | |
1 | 365 (47%) | 280 (47%) | 85 (45%) | |
2 | 159 (20%) | 117 (20%) | 42 (22%) | |
3 | 57 (7.3%) | 37 (6.2%) | 20 (10%) | |
4 | 33 (4.2%) | 18 (3.0%) | 15 (7.9%) | |
TR_BAS, n (%) | 0.004 | |||
0 | 332 (42%) | 260 (44%) | 72 (38%) | |
1 | 404 (52%) | 307 (52%) | 97 (51%) | |
2 | 32 (4.1%) | 19 (3.2%) | 13 (6.8%) | |
3 | 12 (1.5%) | 6 (1.0%) | 6 (3.1%) | |
4 | 4 (0.5%) | 1 (0.2%) | 3 (1.6%) | |
SEVERE HEART VALVE DISEASE_BAS, n (%) | 34 (4.3%) | 18 (3.0%) | 16 (8.4%) | 0.002 |
BSA_BAS, Median (IQR) | 1.91 (1.76, 2.06) | 1.91 (1.77, 2.06) | 1.91 (1.75, 2.05) | 0.2 |
LVESd_BAS, Median (IQR) | 51 (44, 59) | 50 (44, 57) | 55 (47, 62) | <0.001 |
LVEDd_BAS, Median (IQR) | 63 (58, 69) | 63 (58, 68) | 65 (60, 71) | 0.005 |
LVEDd.BAS_BAS, Median (IQR) | 33.1 (30.0, 36.4) | 33.0 (29.8, 36.0) | 33.9 (30.8, 38.0) | 0.002 |
LVESv_BAS, Median (IQR) | 104 (75, 145) | 101 (73, 137) | 119 (82, 163) | <0.001 |
LVEDV_BAS, Median (IQR) | 155 (125, 198) | 153 (125, 193) | 169 (127, 213) | 0.017 |
LVEDv.bNAS_BAS, Median (IQR) | 82 (66, 103) | 80 (66, 99) | 89 (68, 112) | 0.004 |
LVSF (FACC)_BAS, Median (IQR) | 19 (14, 25) | 20 (15, 26) | 16 (11, 22) | <0.001 |
HR_BAS, Median (IQR) | 68 (60, 79) | 67 (60, 78) | 71 (60, 80) | 0.2 |
RYTHM_BAS, n (%) | <0.001 | |||
1 | 678 (88%) | 530 (91%) | 148 (80%) | |
2 | 57 (7.4%) | 31 (5.3%) | 26 (14%) | |
3 | 33 (4.3%) | 22 (3.8%) | 11 (5.9%) | |
4 | 2 (0.3%) | 2 (0.3%) | 0 (0%) | |
QRS DURATION_BAS, Median (IQR) | 111 (97, 144) | 110 (96, 140) | 121 (106, 160) | <0.001 |
NA_BAS, Median (IQR) | 140.00 (138.00, 141.00) | 140.00 (138.00, 141.00) | 139.00 (138.00, 141.00) | 0.3 |
HB_BAS, Median (IQR) | 14.10 (13.00, 15.00) | 14.10 (13.20, 15.10) | 14.05 (12.70, 14.88) | 0.2 |
CREATININE_BAS, Median (IQR) | 0.99 (0.85, 1.15) | 0.97 (0.84, 1.11) | 1.09 (0.89, 1.24) | <0.001 |
EGFR_BAS, Median (IQR) | 74 (62, 89) | 77 (64, 90) | 67 (55, 82) | <0.001 |
1 Welch Two Sample t-test; Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test |
Very often in medical studies there is a specific parameter of interest, and the association between this parameter and the risk of outcome is explored. But as a first step we need to investigate if subjects classified on the basis of this parameter show relevant differences in others variables. This is important, since these differences can act as confounders in the successive steps. Let’s now prepare a second table of descriptive statistics, stratifying subjects in 3 groups based on the initial level of NYHA class (we will collapse in one the last two categories), always to be commented with the medical doctor that gave us the data : remember the problem of multiple comparisons here, for the p values. They should be in principle corrected to take into account the alpha-inflation.
t0 <- data_hf %>%
select(-ID,-OUTCOME) %>%
mutate(NYHACLASS_BAS=as.factor(ifelse(NYHACLASS_BAS!="1" & NYHACLASS_BAS!="2","3-4",NYHACLASS_BAS))) %>%
tbl_summary(by=NYHACLASS_BAS,statistic = list(AGE_BAS~"{mean} ({sd})",
LVEF_BAS~"{mean} ({sd})"),
missing = "no") %>%
add_overall() %>% # column overall
add_stat_label() %>%
add_p(list(AGE_BAS~"aov", # default is the Kruskal-Wallis test for numeric variables
LVEF_BAS~"aov"),
include=c(-TR_BAS)) %>% # this categorical variable is too sparse, categories 3 and 4 should be collapsed in one in case !
bold_labels() %>%
modify_spanning_header(all_stat_cols() ~ "**NYHA class**") %>%
modify_header(p.value ~ "**Global p-value**")
# table comparing class I and class II
t1 <- data_hf %>%
select(-ID,-OUTCOME) %>%
filter(NYHACLASS_BAS %in% c("1", "2"))%>%
tbl_summary(by = NYHACLASS_BAS, missing = "no",statistic = list(AGE_BAS~"{mean} ({sd})",
LVEF_BAS~"{mean} ({sd})")) %>%
add_p(test=list(AGE_BAS~"t.test", # default is the Mann-Whitney test for numeric variables
LVEF_BAS~"t.test"),
include=c(-MR_BAS,-RESTDYSPNOEA_BAS)) %>% # these categorical variables are too sparse, and RESTDYSPNOEA_BAS is linked to NYHA, so useless to crosstab
add_stat_label() %>%
modify_header(p.value ~ "**1 vs. 2**") %>%
# hide summary stat columns
modify_column_hide(all_stat_cols())
# table comparing grade I and III-IV
t2 <- data_hf %>%
mutate(NYHACLASS_BAS=as.factor(ifelse(NYHACLASS_BAS!="1" & NYHACLASS_BAS!="2","3-4",NYHACLASS_BAS))) %>%
select(-ID,-OUTCOME) %>%
filter(NYHACLASS_BAS %in% c("1", "3-4")) %>%
tbl_summary(by = NYHACLASS_BAS, missing = "no",statistic = list(AGE_BAS~"{mean} ({sd})",
LVEF_BAS~"{mean} ({sd})")) %>%
add_stat_label() %>%
add_p(test=list(AGE_BAS~"t.test", # default is the Mann-Whitney test for numeric variables
LVEF_BAS~"t.test"),
include=c(-MR_BAS)) %>%
modify_header(p.value ~ "**1 vs. 3-4**") %>%
# hide summary stat columns
modify_column_hide(all_stat_cols())
# table comparing grade II and III-IV
t3 <- data_hf %>%
mutate(NYHACLASS_BAS=as.factor(ifelse(NYHACLASS_BAS!="1" & NYHACLASS_BAS!="2","3-4",NYHACLASS_BAS))) %>%
select(-ID,-OUTCOME) %>%
filter(NYHACLASS_BAS %in% c("2", "3-4")) %>%
tbl_summary(by = NYHACLASS_BAS, missing = "no",statistic = list(AGE_BAS~"{mean} ({sd})",
LVEF_BAS~"{mean} ({sd})")) %>%
add_stat_label() %>%
add_p(test=list(AGE_BAS~"t.test", # default is the Mann-Whitney test for numeric variables
LVEF_BAS~"t.test")) %>%
modify_header(p.value ~ "**2 vs. 3-4**") %>%
# hide summary stat columns
modify_column_hide(all_stat_cols())
# merging the 3 tables together, and adding additional gt formatting
tbl_merge(list(t0, t1, t2,t3)) %>%
modify_spanning_header(
list(
all_stat_cols() ~ "**NYHA class**",
starts_with("p.value") ~ "**p-values**"
)
)
Characteristic | NYHA class | p-values | ||||||
---|---|---|---|---|---|---|---|---|
Overall, N = 784 | 1, N = 328 | 2, N = 286 | 3-4, N = 170 | Global p-value1 | 1 vs. 22 | 1 vs. 3-43 | 2 vs. 3-43 | |
AGE_BAS, Mean (SD) | 52 (15) | 48 (16) | 54 (14) | 55 (13) | <0.001 | <0.001 | <0.001 | 0.4 |
SEX, n (%) | 545 (70%) | 248 (76%) | 186 (65%) | 111 (65%) | 0.007 | 0.004 | 0.020 | >0.9 |
WEIGHT_BAS, Median (IQR) | 77 (67, 87) | 76 (67, 85) | 78 (67, 90) | 76 (66, 88) | 0.2 | 0.10 | 0.7 | 0.3 |
HEIGHT_BAS, Median (IQR) | 172 (167, 178) | 174 (168, 180) | 171 (166, 178) | 170 (163, 178) | <0.001 | 0.013 | <0.001 | 0.11 |
BMI_BAS, Median (IQR) | 25.5 (23.1, 28.4) | 24.9 (22.9, 27.7) | 26.1 (23.5, 29.3) | 25.8 (23.2, 29.7) | 0.003 | 0.002 | 0.013 | 0.9 |
SMOKER_BAS, n (%) | 0.2 | 0.074 | 0.3 | 0.6 | ||||
0 | 400 (52%) | 183 (57%) | 135 (49%) | 82 (49%) | ||||
1 | 169 (22%) | 67 (21%) | 60 (22%) | 42 (25%) | ||||
2 | 198 (26%) | 72 (22%) | 83 (30%) | 43 (26%) | ||||
DIAGNOSISHF, n (%) | 102 (13%) | 14 (4.3%) | 51 (18%) | 37 (23%) | <0.001 | <0.001 | <0.001 | 0.3 |
SYMPTOMS DURATION_BAS, Median (IQR) | 5 (1, 31) | 4 (0, 29) | 7 (2, 33) | 4 (1, 28) | 0.015 | 0.009 | 0.5 | 0.027 |
REST DYSPNOEA_BAS, n (%) | 37 (4.7%) | 0 (0%) | 0 (0%) | 37 (22%) | <0.001 | <0.001 | <0.001 | |
SBP_BAS, Median (IQR) | 120 (110, 140) | 120 (115, 140) | 125 (110, 139) | 120 (110, 140) | 0.025 | 0.2 | 0.006 | 0.12 |
DBP_BAS, Median (IQR) | 80 (70, 85) | 80 (70, 80) | 80 (70, 85) | 80 (70, 90) | 0.9 | 0.7 | 0.7 | 0.8 |
THIRD SOUND_BAS, n (%) | 188 (24%) | 35 (11%) | 72 (25%) | 81 (48%) | <0.001 | <0.001 | <0.001 | <0.001 |
DIABETES_BAS, n (%) | 71 (9.1%) | 21 (6.4%) | 30 (10%) | 20 (12%) | 0.081 | 0.067 | 0.057 | 0.8 |
INSULIN_BAS, n (%) | 7 (0.9%) | 2 (0.6%) | 1 (0.3%) | 4 (2.4%) | 0.11 | >0.9 | 0.2 | 0.067 |
COPD_BAS, n (%) | 41 (5.2%) | 7 (2.1%) | 18 (6.3%) | 16 (9.4%) | 0.002 | 0.009 | <0.001 | 0.3 |
PERIPHERAL VASCULAR DISEASE_BAS, n (%) | 43 (5.6%) | 8 (2.5%) | 21 (7.4%) | 14 (8.3%) | 0.007 | 0.005 | 0.005 | 0.7 |
DEPENDENT OEDEMA_BAS, n (%) | 82 (10%) | 10 (3.0%) | 24 (8.4%) | 48 (28%) | <0.001 | 0.004 | <0.001 | <0.001 |
PULMONARY CRACKLES_BAS, n (%) | 85 (11%) | 10 (3.0%) | 22 (7.7%) | 53 (31%) | <0.001 | 0.010 | <0.001 | <0.001 |
PULMONARY OEDEMA_BAS, n (%) | 7 (0.9%) | 0 (0%) | 1 (0.3%) | 6 (3.5%) | <0.001 | 0.5 | 0.001 | 0.012 |
HEPATOMEGALY_BAS, n (%) | 65 (8.7%) | 5 (1.6%) | 24 (8.7%) | 36 (22%) | <0.001 | <0.001 | <0.001 | <0.001 |
ALT_BAS, Median (IQR) | 24 (17, 38) | 22 (16, 31) | 23 (16, 38) | 27 (19, 51) | 0.004 | 0.2 | <0.001 | 0.028 |
AST_BAS, Median (IQR) | 22 (18, 29) | 20 (17, 25) | 22 (17, 29) | 24 (19, 36) | <0.001 | 0.068 | <0.001 | 0.026 |
GGT_BAS, Median (IQR) | 31 (18, 63) | 22 (16, 39) | 33 (21, 67) | 39 (23, 87) | <0.001 | 0.002 | <0.001 | 0.3 |
BILTOT_BAS, Median (IQR) | 0.90 (0.66, 1.21) | 0.80 (0.60, 1.09) | 0.90 (0.60, 1.25) | 1.00 (0.72, 1.60) | <0.001 | 0.2 | <0.001 | 0.013 |
ABNORMAL LIVER FUNCTION_BAS, n (%) | 55 (7.0%) | 11 (3.4%) | 18 (6.3%) | 26 (15%) | <0.001 | 0.085 | <0.001 | 0.003 |
PRIOR STROKE_BAS, n (%) | 25 (3.4%) | 5 (1.6%) | 13 (4.8%) | 7 (4.5%) | 0.066 | 0.023 | 0.069 | >0.9 |
PRIOR MI_BAS, n (%) | 14 (1.8%) | 5 (1.5%) | 4 (1.4%) | 5 (2.9%) | 0.4 | >0.9 | 0.3 | 0.3 |
CARDIOMEGALY_BAS, n (%) | 530 (68%) | 182 (55%) | 216 (76%) | 132 (78%) | <0.001 | <0.001 | <0.001 | 0.6 |
LVEF_BAS, Mean (SD) | 33 (11) | 39 (10) | 32 (9) | 26 (9) | <0.001 | <0.001 | <0.001 | <0.001 |
AORTIC STENOSIS_BAS, n (%) | 3 (0.4%) | 0 (0%) | 2 (0.7%) | 1 (0.6%) | 0.3 | 0.2 | 0.3 | >0.9 |
MR_BAS, n (%) | <0.001 | 0.012 | ||||||
0 | 170 (22%) | 97 (30%) | 53 (19%) | 20 (12%) | ||||
1 | 365 (47%) | 178 (54%) | 123 (43%) | 64 (38%) | ||||
2 | 159 (20%) | 37 (11%) | 75 (26%) | 47 (28%) | ||||
3 | 57 (7.3%) | 14 (4.3%) | 23 (8.0%) | 20 (12%) | ||||
4 | 33 (4.2%) | 2 (0.6%) | 12 (4.2%) | 19 (11%) | ||||
TR_BAS, n (%) | 0.001 | <0.001 | 0.062 | |||||
0 | 332 (42%) | 162 (49%) | 113 (40%) | 57 (34%) | ||||
1 | 404 (52%) | 162 (49%) | 154 (54%) | 88 (52%) | ||||
2 | 32 (4.1%) | 3 (0.9%) | 13 (4.5%) | 16 (9.4%) | ||||
3 | 12 (1.5%) | 1 (0.3%) | 4 (1.4%) | 7 (4.1%) | ||||
4 | 4 (0.5%) | 0 (0%) | 2 (0.7%) | 2 (1.2%) | ||||
SEVERE HEART VALVE DISEASE_BAS, n (%) | 34 (4.3%) | 2 (0.6%) | 13 (4.5%) | 19 (11%) | <0.001 | 0.002 | <0.001 | 0.012 |
BSA_BAS, Median (IQR) | 1.91 (1.76, 2.06) | 1.91 (1.76, 2.05) | 1.92 (1.77, 2.09) | 1.90 (1.75, 2.05) | 0.5 | 0.4 | 0.7 | 0.3 |
LVESd_BAS, Median (IQR) | 51 (44, 59) | 46 (41, 52) | 53 (48, 60) | 55 (50, 63) | <0.001 | <0.001 | <0.001 | 0.020 |
LVEDd_BAS, Median (IQR) | 63 (58, 69) | 61 (56, 65) | 65 (60, 70) | 66 (60, 73) | <0.001 | <0.001 | <0.001 | 0.2 |
LVEDd.BAS_BAS, Median (IQR) | 33.1 (30.0, 36.4) | 31.5 (29.3, 34.6) | 33.7 (30.6, 37.0) | 34.9 (31.7, 38.6) | <0.001 | <0.001 | <0.001 | 0.006 |
LVESv_BAS, Median (IQR) | 104 (75, 145) | 85 (66, 111) | 112 (85, 149) | 137 (101, 187) | <0.001 | <0.001 | <0.001 | <0.001 |
LVEDV_BAS, Median (IQR) | 155 (125, 198) | 142 (118, 170) | 163 (129, 206) | 186 (136, 233) | <0.001 | <0.001 | <0.001 | 0.015 |
LVEDv.bNAS_BAS, Median (IQR) | 82 (66, 103) | 74 (62, 87) | 84 (68, 106) | 96 (78, 120) | <0.001 | <0.001 | <0.001 | 0.001 |
LVSF (FACC)_BAS, Median (IQR) | 19 (14, 25) | 23 (18, 30) | 17 (13, 23) | 15 (11, 21) | <0.001 | <0.001 | <0.001 | 0.017 |
HR_BAS, Median (IQR) | 68 (60, 79) | 64 (58, 74) | 69 (60, 80) | 75 (68, 90) | <0.001 | <0.001 | <0.001 | <0.001 |
RYTHM_BAS, n (%) | 0.045 | 0.086 | 0.011 | 0.7 | ||||
1 | 678 (88%) | 297 (92%) | 242 (86%) | 139 (83%) | ||||
2 | 57 (7.4%) | 14 (4.3%) | 25 (8.9%) | 18 (11%) | ||||
3 | 33 (4.3%) | 11 (3.4%) | 12 (4.3%) | 10 (6.0%) | ||||
4 | 2 (0.3%) | 1 (0.3%) | 1 (0.4%) | 0 (0%) | ||||
QRS DURATION_BAS, Median (IQR) | 111 (97, 144) | 109 (97, 140) | 112 (96, 154) | 115 (104, 134) | 0.2 | 0.2 | 0.10 | 0.8 |
NA_BAS, Median (IQR) | 140.00 (138.00, 141.00) | 140.00 (138.00, 141.00) | 140.00 (138.00, 141.00) | 139.00 (137.00, 141.00) | 0.014 | 0.14 | 0.003 | 0.13 |
HB_BAS, Median (IQR) | 14.10 (13.00, 15.00) | 14.40 (13.40, 15.20) | 14.00 (12.80, 14.90) | 14.20 (12.90, 15.00) | 0.008 | 0.002 | 0.066 | 0.3 |
CREATININE_BAS, Median (IQR) | 0.99 (0.85, 1.15) | 0.95 (0.82, 1.10) | 1.01 (0.88, 1.17) | 1.00 (0.86, 1.18) | 0.007 | 0.002 | 0.035 | 0.7 |
EGFR_BAS, Median (IQR) | 74 (62, 89) | 80 (68, 95) | 72 (60, 86) | 70 (59, 85) | <0.001 | <0.001 | <0.001 | 0.8 |
1 One-way ANOVA; Pearson’s Chi-squared test; Kruskal-Wallis rank sum test; Fisher’s exact test | ||||||||
2 Welch Two Sample t-test; Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test | ||||||||
3 Welch Two Sample t-test; Fisher’s exact test; Wilcoxon rank sum test |
As you can see from the table there are several differences in groups with respect to the NYHA class. As before, take the p-values of the statistical tests with caution, this is just an initial data analysis.
The next step should be to define the scientific question of interest to be answered using this data:
Based on the discussion with medical doctors we will focus on the primary scientific question of interest and consequently we will apply the correct statistical approach.