1 Simple Linear Regression (a brief recap)

First of all, install and load the required libraries : rms, Hmisc and epiDisplay.

library(rms)
library(Hmisc)
library(epiDisplay)
library(ggplot2)
library(psych)
library(here)

Let’s simulate data from a sample of n=100 points along with population linear regression line. The conditional distribution of y|x can be thought of as a vertical slice at x. The unconditional distribution of y is shown on the y-axis. To envision the conditional normal distributions assumed for the underlying population, think of a bell-shaped curve coming out of the page, with its base along one of the vertical lines of points. The equal variance assumption (homoscedasticity) dictates that the series of Gaussian curves for all the different x values have equal variances.

n <- 100
set.seed(13)
x <- round(rnorm(n, .5, .25), 1)
y <- x + rnorm(n, 0, .1)
r <- c(-.2, 1.2)

Plot:

plot(x, y, axes=FALSE, xlim=r, ylim=r, xlab=expression(x), ylab=expression(y))
axis(1, at=r, labels=FALSE)     
axis(2, at=r, labels=FALSE)
abline(a=0,b=1)
histSpike(y, side=2, add=TRUE)
abline(v=.6, lty=2)

Simple linear regression is used when:

Only 2 variables are of interest
One variable is a response (continuous scale) and one is a predictor
The mean of the dependent variable is a quantity of interest [otherwise explore for example quantile regression]
No adjustment is needed for confounding or other between-subject variation
The investigator is interested in assessing the strength of the relationship between x and y in real data units, or in predicting y from x
A linear relationship is assumed (visual inspection is strongly recommended…)
Not when one only needs to test for association (use Pearson or Spearman’s correlation in case)

1.1 Interval Estimation: evaluating the uncertainty about predictions

Estimation of the confidence intervals (CI) for predictions depend on what you want to predict, if at individual level or at mean population level.

x1 <- c( 1,  3,  5,  6,  7,  9,  11)
y  <- c( 5, 10, 70, 58, 85, 89, 135)
dd <- datadist(x1, n.unique=5); options(datadist='dd')
f  <- ols(y ~ x1)
p1 <- Predict(f, x1=seq(1,11,length=100), conf.type='mean')
p2 <- Predict(f, x1=seq(1,11,length=100), conf.type='individual')
p  <- rbind(Mean=p1, Individual=p2)
ggplot(p, legend.position='none') +    
  geom_point(aes(x1, y), data=data.frame(x1, y, .set.=''))

Example usages:

Is a child of age x smaller than predicted for her age? Use the individual level, p2 (wider bands)
What is the best estimate of the population mean blood pressure for patients on treatment A? Use the mean population level, p1 (narrower bands)

1.2 Assessing the Goodness of Fit

It is crucial to verify the the assumptions underlying a linear regression model:

In a scatterplot the spread of y about the fitted line should be constant as x increases, and y vs. x should appear linear
Easier to see this with a plot of residuals vs estimated values
In this plot there should be no systematic patterns (no trend in central tendency, no change in spread of points with x)
Trend in central tendency indicates failure of linearity
qqnorm plot of residuals is a useful tool

Here an example: we fit a linear regression model where x and y should instead have been log transformed:

n <- 50
set.seed(2)
res <- rnorm(n, sd=.25)
x <- runif(n)
y <- exp(log(x) + res)
f <- ols(y ~ x)
plot(fitted(f), resid(f))

This plot depicts non-constant variance of the residuals, which might call for transforming y.

Now, we fit a linear model that should have been quadratic (functional form of X):

x <- runif(n, -1, 1)
y <- x ^ 2 + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))

Finally, we fit a correct model:

y <- x + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))

qqnorm(resid(f)); qqline(resid(f))

These plots shows the ideal situation of white noise (no trend, constant variance). The qq plot demonstrates approximate normality of residuals, for a sample of size n = 50.

1.3 Application example of simple linear regression: Hookworm & blood loss

The dataset concerns the relationship between hookworm and blood loss from a study conducted in 1970.

data(Suwit)
summary(Suwit)

##        id            worm            bloss      
##  Min.   : 1.0   Min.   :  32.0   Min.   : 5.03  
##  1st Qu.: 4.5   1st Qu.: 167.0   1st Qu.:14.02  
##  Median : 8.0   Median : 525.0   Median :33.80  
##  Mean   : 8.0   Mean   : 552.4   Mean   :33.45  
##  3rd Qu.:11.5   3rd Qu.: 796.5   3rd Qu.:41.70  
##  Max.   :15.0   Max.   :1929.0   Max.   :86.65

des(Suwit)

## HW and Blood loss SEAJTMH 1970; 
##  No. of observations =  15 
##   Variable      Class           Description   
## 1 id            numeric                       
## 2 worm          numeric         No. of worms  
## 3 bloss         numeric         Blood loss/day

summ(Suwit)

## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
## 
##   Var. name obs. mean   median  s.d.   min.   max.  
## 1 id        15   8      8       4.47   1      15    
## 2 worm      15   552.4  525     513.9  32     1929  
## 3 bloss     15   33.45  33.8    24.85  5.03   86.65

attach(Suwit)

The file is clean and ready for analysis (this happens here only for didactic purposes: in real life, you will usually spend a couple of hours - at minimum, if not days..- to clean datasets). For example, with this small sample size it is somewhat straightforward to verify that there is no repetition of ‘id’ and no missing values. The records have been sorted in ascending order of ‘worm’ (number of worms) ranging from 32 in the first subject to 1929 in the last one. Blood loss (‘bloss’) is however, not sorted. The 13th record has the highest blood loss of 86 ml per day, which is very high. The objective of this analysis is to predict blood loss using worm.

First of all, give a look to the data:

plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel")

A linear model using the above two variables seems reasonable.

lm1 <- lm(bloss ~ worm, data=Suwit)
lm1

## 
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
## 
## Coefficients:
## (Intercept)         worm  
##    10.84733      0.04092

Displaying the model by typing ‘lm1’ gives limited information (essentially, the estimated regression line coefficients). To get more information, one can look at the attributes of this model, its summary and attributes of its summary.

attr(lm1, "names")

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

summary(lm1)

## 
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.846 -10.812   0.750   4.356  34.390 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.847327   5.308569   2.043   0.0618 .  
## worm         0.040922   0.007147   5.725 6.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.74 on 13 degrees of freedom
## Multiple R-squared:  0.716,  Adjusted R-squared:  0.6942 
## F-statistic: 32.78 on 1 and 13 DF,  p-value: 6.99e-05

The first section of summary shows the formula that was ‘called’. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (34.38) compared to the minimum (-15.84) and the first quartile is further left (-10.81) of the median (0.75) than the third quartile (4.35) is. Otherwise, the median is close to zero. The third section gives coefficients of the intercept and the effect of ‘worm’ on blood loss. The intercept is 10.8 meaning that when there are no worms, the blood loss is estimated to be 10.8 ml per day. This is however, not significantly different from zero as the P value is 0.0618. The coefficient of ‘worm’ is 0.04 indicating that each worm is associated with an average increase of 0.04 ml of blood loss per day. Although the value is small, it is highly significantly different from zero. When there are many worms, the level of blood loss can be very substantial. The multiple R-squared value of 0.716 indicates that 71.6% of the variation in the data is explained by the model. The adjusted value is 0.6942. (The calculation of Rsquared is discussed in the analysis of variance section below). The last section describes more details of the residuals and hypothesis testing on the effect of ‘worm’ using the F-statistic. The P value from this section (0.0000699) is equal to that tested by the t-distribution in the coefficient section. This F-test more commonly appears in the analysis of variance table.

1.3.1 Analysis of variance table, R-squared and adjusted R-squared

summary(aov(lm1))

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## worm         1   6192    6192   32.78 6.99e-05 ***
## Residuals   13   2455     189                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above analysis of variance (aov) table breaks down the degrees of freedom, sum of squares and mean square of the outcome (blood loss) by sources (in this case there only two: worm + residuals).

The so-called ‘square’ is actually the square of difference between the value and the mean. The total sum of squares of blood loss is therefore:

SST <- sum((bloss-mean(bloss))^2)
SST

## [1] 8647.044

The sum of squares from residuals is:

SSR <- sum(residuals(lm1)^2)
SSR

## [1] 2455.468

See also the analysis of variance table. The sum of squares of worm or sum of squares of difference between the fitted values and the grand mean is:

SSW <- sum((fitted(lm1)-mean(bloss))^2)
SSW

## [1] 6191.577

The latter two sums add up to the first one. The R-squared is the proportion of sum of squares of the fitted values to the total sum of squares.

SSW/SST

## [1] 0.7160339

Instead of sum of squares, one may consider the mean square as the level of variation. In such a case, the number of worms can reduce the total mean square (or variance) by: (total mean square - residual mean square) / total mean square, or (variance - residual mean square) / variance.

resid.msq <- sum(residuals(lm1)^2)/lm1$df.residual
Radj      <- (var(bloss)- resid.msq)/var(bloss)
Radj

## [1] 0.6941903

This is the adjusted R-squared shown in summary(lm1) in the above section.

1.3.2 F-test

When the mean square of ‘worm’ is divided by the mean square of residuals, the result is:

F <- SSW/resid.msq; F

## [1] 32.78011

Using this F value with the two corresponding degrees of freedom (from ‘worm’ and residuals) the P value for testing the effect of ‘worm’ can be computed.

pf(F, df1=1, df2=13, lower.tail=FALSE)

## [1] 6.990401e-05

The function pf is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument ‘lower.tail’ is set to FALSE to obtain the right margin of the area under the curve of the F distribution. In summary, both the regression and analysis of variance give the same conclusion; that number of worms has a significant linear relationship with blood loss. Now the regression line can be drawn.

1.3.3 Regression line, fitted values and residuals

A regression line can be added to the scatter plot with the following command:

plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel", type="n")
abline(lm1)
points(worm, fitted(lm1), pch=18, col="blue")
segments(worm, bloss, worm, fitted(lm1), col="red")

The regression line has an intercept of 10.8 and a slope of 0.04. The expected value is the value of blood loss estimated from the regression line with a specific value of ‘worm’.A residual is the difference between the observed and expected value. The residuals can be drawn by adding the red segments.

The actual values of the residuals can be checked from the specific attribute of the defined linear model.

residuals(lm1) -> lm1.res
hist(lm1.res)

1.3.4 Checking the normality of residuals

Histogram of the residuals is not somehow convincing about their normal distribution shape. However, with such a small sample size, it is difficult to draw any conclusion. A better way to check normality is to plot the residuals against the expected normal score or (residual-mean) / standard deviation. A reasonably straight line would indicate normality. Moreover, a test for normality could be calculated.

a <- qqnorm(lm1.res)

shapiro.qqnorm(lm1.res, type="n")
text(a$x, a$y, labels=as.character(id))

If the residuals were perfectly normally distributed, the text symbols would have formed along the straight dotted line. The graph suggests that the largest residual (13th) is too high (positive) whereas the smallest value (7th) is not large enough (negative). However, the P value from the Shapiro-Wilk test is 0.08 suggesting that the possibility of residuals being normally distributed cannot be rejected.

Finally, the residuals are plotted against the fitted values to see if there is a pattern.

plot(fitted(lm1), lm1.res, xlab="Fitted values")

plot(fitted(lm1), lm1.res, xlab="Fitted values", type="n")
text(fitted(lm1), lm1.res, labels=as.character(id))
abline(h=0, col="blue")

There is no obvious pattern. The residuals are quite independent of the expected values. With this and the above findings from the qqnorm command we may conclude that the residuals are randomly and normally distributed.

The above two diagnostic plots for the model ‘lm1’ can also be obtained from:

par(mfrow=c(1,2))
plot(lm1, which=1:2)

detach(Suwit)

1.3.5 Final conclusion

From the analysis, it is clear that blood loss is associated with number of hookworms. On average, each additional worm is associated with an increase of 0.04 ml of blood loss. The remaining uncertainty of blood loss, apart from hookworm, is explained by random variation or other factors that were not measured.

2 Multiple linear regression (MLR)

Datasets usually contain many variables collected during a study. It is often useful to see the relationship between two variables within the different levels of another third, categorical variable, i.e. to verify for the presence of an interaction. (This is just a didactic example, usually more than 3 variables are involved in MLR models).

2.1 Example: Systolic blood pressure

A small survey on blood pressure was carried out. The objective is to see the hypertensive effect of subjects putting additional table salt on their meal.Gender of subjects is also measured.

data(BP)
attach(BP)
des(BP)

## cross-sectional survey on BP & risk factors 
##  No. of observations =  100 
##   Variable      Class           Description        
## 1 id            integer         id                 
## 2 sex           factor          sex                
## 3 sbp           integer         Systolic BP        
## 4 dbp           integer         Diastolic BP       
## 5 saltadd       factor          Salt added on table
## 6 birthdate     Date

Note that the maximum systolic and diastolic blood pressures are quite high. There are 20 missing values in saltadd. The frequencies of the categorical variables sex and saltadd are now inspected.

describe(data.frame(sex, saltadd))

##          vars   n mean  sd median trimmed mad min max range  skew kurtosis   se
## sex*        1 100 1.55 0.5      2    1.56   0   1   2     1 -0.20    -1.98 0.05
## saltadd*    2  80 1.54 0.5      2    1.55   0   1   2     1 -0.15    -2.00 0.06

The next step is to create a new age variable from birthdate. The calculation is based on 12th March 2001, the date of the survey (date of entry in the study).

age.in.days <- as.Date("2001-03-12") - birthdate

There is a leap year in every four years. Therefore, an average year will have 365.25 days.

class(age.in.days)

## [1] "difftime"

age <- as.numeric(age.in.days)/365.25

The function as.numeric is needed to transform the units of age (difftime); otherwise modelling would not be possible.

describeBy(sbp,saltadd)

## 
##  Descriptive statistics by group 
## group: no
##    vars  n   mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 37 137.46 29.62    132  136.39 23.72  80 201   121 0.44    -0.42 4.87
## ------------------------------------------------------------ 
## group: yes
##    vars  n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## X1    1 43 163.02 39.39    171  163.94 45.96  80 224   144 -0.25    -1.13 6.01

2.1.1 Recoding missing values into another category

The missing value group has the highest median and average systolic blood pressure. In order to create a new variable with three levels type:

saltadd1 <- saltadd
levels(saltadd1) <- c("no", "yes", "missing")
saltadd1[is.na(saltadd)] <- "missing"
summary(saltadd1)

##      no     yes missing 
##      37      43      20

summary(aov(age ~ saltadd1))

##             Df Sum Sq Mean Sq F value Pr(>F)
## saltadd1     2    115   57.42   0.448   0.64
## Residuals   97  12422  128.06

Since there is not enough evidence that the missing group is important and for additional reasons of simplicity, we will assume MCAR (missing completely at random) and we will ignore this group and continue the analysis with the original saltadd variable consisting of only two levels. Before doing this however, a simple regression model and regression line are first fitted.

lm1 <- lm(sbp ~ age)
summary(lm1)

## 
## Call:
## lm(formula = sbp ~ age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.231 -21.801  -3.679  15.127 105.395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  65.1465    14.8942   4.374 3.05e-05 ***
## age           1.8422     0.2997   6.147 1.71e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.56 on 98 degrees of freedom
## Multiple R-squared:  0.2782, Adjusted R-squared:  0.2709 
## F-statistic: 37.78 on 1 and 98 DF,  p-value: 1.712e-08

Although the R-squared is not very high, the p value is small indicating important influence of age on systolic blood pressure. A scatterplot of age against systolic blood pressure is now shown with the regression line added using the abline function. This function can accept many different argument forms, including a regression object. If this object has a coef method, and it returns a vector of length 1, then the value is taken to be the slope of a line through the origin, otherwise the first two values are taken to be the intercept and slope, as is the case for lm1.

plot(age, sbp, main = "Systolic BP by age", xlab = "Years", ylab = "mm.Hg")
abline(lm1)

Subsequent exploration of residuals suggests a non-significant deviation from normality and no pattern. Details of this can be adopted from the techniques already discussed and are omitted here. The next step is to provide different plot patterns for different groups of salt habits. Note that here the sample size is lower (we decided to omit missing values for saltadd, reducing the analysis to the complete cases):

lm2 <- lm(sbp ~ age + saltadd)
summary(lm2)

## 
## Call:
## lm(formula = sbp ~ age + saltadd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.369 -24.972  -0.125  22.456  64.805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.1291    15.7645   4.005 0.000142 ***
## age           1.5526     0.3118   4.979 3.81e-06 ***
## saltaddyes   22.9094     6.9340   3.304 0.001448 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.83 on 77 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.3331, Adjusted R-squared:  0.3158 
## F-statistic: 19.23 on 2 and 77 DF,  p-value: 1.679e-07

On the average, a one year increment of age is associated with an increase in systolic blood pressure by 1.5 mmHg. Adding table salt increases systolic blood pressure significantly by approximately 23 mmHg.

plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)

Note that the red dots corresponding to those who added table salt are higher than the blue circles.

The final task is to draw two separate regression lines for each group. We now have two regression lines to draw, one for each group. The intercept for non-salt users will be the first coefficient and for salt users will be the first plus the third. The slope for both groups is the same.

a0 <- coef(lm2)[1]
a1 <- coef(lm2)[1] + coef(lm2)[3]
b <- coef(lm2)[2]

plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)
abline(a = a0, b, col = "blue")
abline(a = a1, b, col = "red")

Note that X-axis does not start at zero. Thus the intercepts are out of the plot frame. The red line is for the red points of salt adders and the blue line is for the blue points of non-adders. In this model, age is assumed to have a constant effect on systolic blood pressure independently from added salt.

But look at the distributions of the points of the two colours: the red points are higher than the blue ones but mainly on the right half of the graph. To fit lines with different slopes, a new model with interaction term is created.

Therefore the next step is to estimate a model with different slopes (or different ‘b’ for the abline arguments) for the different lines. The model needs an interaction term between addsalt and age.

lm3 <- lm(sbp ~ age * saltadd)
summary(lm3)

## 
## Call:
## lm(formula = sbp ~ age * saltadd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.885 -24.891  -1.142  21.228  66.867 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     78.0066    20.3981   3.824 0.000267 ***
## age              1.2419     0.4128   3.009 0.003558 ** 
## saltaddyes     -12.2540    31.4574  -0.390 0.697965    
## age:saltaddyes   0.7199     0.6282   1.146 0.255441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.77 on 76 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.3445, Adjusted R-squared:  0.3186 
## F-statistic: 13.31 on 3 and 76 DF,  p-value: 4.528e-07

For the intercept of the salt users, the second term and the fourth are all zero (since age is zero) but the third should be kept as such. This term is negative. The intercept of salt users is therefore lower than that of the non-users.

a0 <- coef(lm3)[1]
a1 <- coef(lm3)[1] + coef(lm3)[3]

For the slope of the non-salt users, the second coefficient alone is enough since the first and the third are not involved with each unit of increment of age and the fourth term has saltadd being 0. The slope for the salt users group includes the second and the fourth coefficients since saltaddyes is 1.

b0 <- coef(lm3)[2]
b1 <- coef(lm3)[2] + coef(lm3)[4]

plot(age, sbp, main="Systolic BP by age", xlab="Years",ylab="mm.Hg", pch=18, col=as.numeric(saltadd))
abline(a = a0, b = b0, col = 1)
abline(a = a1, b = b1, col = 2)
legend("topleft", legend = c("Salt added", "No salt added"),lty=1, col=c("red","black"))

Note that as.numeric(saltadd) converts the factor levels into the integers 1 (black) and 2 (red), representing the non-salt adders and the salt adders, respectively. These colour codes come from the R colour palette.

This model suggests that at the young age, the systolic blood pressure of two groups are not much different as the two lines are close together on the left of the plot. For example, at the age of 25, the difference is 5.7mmHg. Increasing age increases the difference between the two groups. At 70 years of age, the difference is as great as 38mmHg. In this aspect, age modifies the effect of adding table salt on blood pressure.

On the other hand the slope of age is 1.24mmHg per year among those who did not add salt but becomes 1.24+0.72 = 1.96mmHg among the salt adders. Thus, salt adding modifies the effect of age. Note that interaction is a statistical term whereas effect modification is the equivalent epidemiological term.

Note also that the coefficient of the interaction term age:saltaddyes is not statistically significant !!!

This means that the two slopes just differ by chance (or that we are low-powered to detect a significant interaction..)

This was in fact just a didactic example to show how to introduce an interaction term in a regression model.

3 Causal inference versus prediction

Conceptually, in prediction, in a certain sense we make comparisons between outcomes across different combinations of values of input variables to predict the probability of an outcome. In causal inference, we ask what would happen to an outcome y as a result of a treatment or intervention. Predictive inference relates to comparisons between units (different groups/subjects). Causal inference addresses comparisons of different treatments when applied to the same unit.

3.1 The FEV dataset

The FEV, which is an acronym for Forced Expiratory Volume, is a measure of how much air a person can exhale (in liters) during a forced breath. In this dataset, the FEV of 606 children, between the ages of 6 and 17, were measured. The dataset also provides additional information on these children: their age, their height, their gender and, most importantly, whether the child is a smoker or a non-smoker (the exposure of interest). This is an observational cross-sectional study.

The goal of this study was to find out whether or not smoking has an effect on the FEV of children.

Load the required libraries

library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.3.2

library(DataExplorer)

## Warning: package 'DataExplorer' was built under R version 4.3.2

library(SmartEDA)

## Warning: package 'SmartEDA' was built under R version 4.3.2

library(ggplot2)
library(ggstatsplot)

## Warning: package 'ggstatsplot' was built under R version 4.3.2

Set the working directory and import data

fev <- read_tsv(here("datasets","fev.txt"))
head(fev)

## # A tibble: 6 × 5
##     age   fev height gender smoking
##   <dbl> <dbl>  <dbl> <chr>    <dbl>
## 1     9  1.71   57   f            0
## 2     8  1.72   67.5 f            0
## 3     7  1.72   54.5 f            0
## 4     9  1.56   53   m            0
## 5     9  1.90   57   m            0
## 6     8  2.34   61   f            0

There are a few things in the formatting of the data that can be improved upon:

Both the gender and smoking can be transformed to factors.
The height variable is written in inches. Inches are hard to interpret. Let’s add a new column, height_cm, with the values converted to centimeter using the mutate function. (For this example we will not use this variable however).

fev <- fev %>%
  mutate(gender = as.factor(gender)) %>%
  mutate(smoking = as.factor(smoking)) %>%
  mutate(height_cm = height*2.54)

head(fev)

## # A tibble: 6 × 6
##     age   fev height gender smoking height_cm
##   <dbl> <dbl>  <dbl> <fct>  <fct>       <dbl>
## 1     9  1.71   57   f      0            145.
## 2     8  1.72   67.5 f      0            171.
## 3     7  1.72   54.5 f      0            138.
## 4     9  1.56   53   m      0            135.
## 5     9  1.90   57   m      0            145.
## 6     8  2.34   61   f      0            155.

3.2 Data Exploration

Now, let’s make a first explorative boxplot, showing only the FEV for both smoking categories.

fev %>%
  ggplot(aes(x=smoking,y=fev,fill=smoking)) +
  scale_fill_manual(values=c("dimgrey","firebrick")) +
  theme_bw() +
  geom_boxplot(outlier.shape=NA) + 
  geom_jitter(width = 0.2, size=0.1) +
  ggtitle("Boxplot of FEV versus smoking") +
  ylab("fev (l)") +
  xlab("smoking status")

Did you expect these results??

It appears that children that smoke have a higher median FEV than children that do not smoke. Should we change legislations worldwide and make smoking obligatory for children??

Maybe there is something else going on in the data. Now, we will generate a similar plot, but we will stratify the data based on age (age as factor).

fev %>%
  ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
  geom_boxplot(outlier.shape=NA) +
  geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
  theme_bw() +
  scale_fill_manual(values=c("dimgrey","firebrick")) +
  ggtitle("Boxplot of FEV versus smoking, stratified on age") +
  ylab("fev (l)") +
  xlab("Age (years)")

This plot seems to already give us a more plausible picture. First, it seems that we do not have any smoking children of ages 6, 7 or 8. Second, when looking at the results per age “category”, it seems no longer the case that smokers have a much higher FEV than non-smokers; for the higher ages, the contrary seems true.

This shows that taking into account confounders (in this case) is crucial! If we simply analyse the dataset based on the smoking status and FEV values only, our inference might be incorrect:

fit1   <- lm(fev~smoking, data=fev)     # wrong beta0star, beta1star
fit2   <- lm(fev~age+smoking, data=fev) # "true" beta0, beta2, beta1
fitage <- lm(age~smoking, data=fev)     #  gamma0, gamma1
fit1

## 
## Call:
## lm(formula = fev ~ smoking, data = fev)
## 
## Coefficients:
## (Intercept)     smoking1  
##      2.6346       0.6054

fit2

## 
## Call:
## lm(formula = fev ~ age + smoking, data = fev)
## 
## Coefficients:
## (Intercept)          age     smoking1  
##      0.2040       0.2479      -0.2358

fitage

## 
## Call:
## lm(formula = age ~ smoking, data = fev)
## 
## Coefficients:
## (Intercept)     smoking1  
##       9.804        3.393

beta0     <- coef(fit2)[1]
beta2     <- coef(fit2)[2]
beta1     <- coef(fit2)[3]
gamma0    <- coef(fitage)[1]
gamma1    <- coef(fitage)[2]
beta0star <- coef(fit1)[1]
beta1star <- coef(fit1)[2]

check that: beta0star = beta0 + beta2*gamma0 :

beta0 + beta2*gamma0

## (Intercept) 
##    2.634628

beta0star

## (Intercept) 
##    2.634628

and also check that (wrong) beta1star = beta1 + beta2*gamma1:

beta1 + beta2*gamma1

##  smoking1 
## 0.6054053

beta1star

##  smoking1 
## 0.6054053

Therefore, a beneficial estimated smoking effect is obtained when age is ignored. If the causal inference assumptions hold (see slides) we can consider beta1 as the average smoking effect in the population under study.

3.3 Factors that affect causal inference estimates

Imbalance and lack of complete overlap can make causal inference difficult. Remind : imbalance is when treatment groups differ with respect to an important covariate. Lack of complete overlap: when some combination of treatment level and covariate level is lacking (no observations, violation of the positivity assumption).

To explain fev, sex seems to matter, especially among older individuals:

plot(fev~age, col=gender, data=fev)
legend("topleft", pch=1, col=1:2, levels(fev$gender))

Another way to plot the same thing is:

fev %>%
  ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
  geom_boxplot(outlier.shape=NA) +
  geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
  theme_bw() +
  scale_fill_manual(values=c("dimgrey","firebrick")) +
  ggtitle("Boxplot of FEV versus smoking, stratified on age and gender") +
  ylab("fev (l)") +
  xlab("smoking status") + 
  facet_grid(rows = vars(gender))

Especially for higher ages, the median FEV is higher for males as compared to females. [This could suggest a kind of interaction between gender and age, that could be explored in the regression model even if the interpretation of such interaction could be quite tricky (many levels for age!)].

Moreover, there is a slight gender imbalance among age categories:

counts      <- table(fev$gender, as.factor(fev$age))
percentages <- round(prop.table(table(fev$gender, as.factor(fev$age)),2),digits=3)

barplot(percentages, main="Gender distribution",
  xlab="ages", col=c("pink", "darkblue"),
  legend = rownames(counts))

For imbalanced samples, simple comparisons of sample means between groups are not good estimates of treatment/risk factors effects. A model adjustment is of course one way to better estimate a treatment effect, where we add the covariate to the model. In this case for example we can add also gender in the regression model:

fit3   <- lm(fev~age+smoking+gender, data=fev)
summary(fit3)

## 
## Call:
## lm(formula = fev ~ age + smoking + gender, data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39229 -0.35677 -0.03623  0.33644  1.91571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.088007   0.097635   0.901   0.3677    
## age          0.243707   0.009523  25.591  < 2e-16 ***
## smoking1    -0.180489   0.080950  -2.230   0.0261 *  
## genderm      0.295794   0.044683   6.620 7.97e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5465 on 602 degrees of freedom
## Multiple R-squared:  0.5681, Adjusted R-squared:  0.566 
## F-statistic:   264 on 3 and 602 DF,  p-value: < 2.2e-16

So now the effect of smoking is estimated conditional on both age and gender, and as expected is a negative effect on FEV; one possible problem however with these estimates could be that for some combinations of age-gender-smoking we do not have any observed data, so that the extrapolation of the regression model could be not so reliable. Also in this case (no interaction) if causal assumptions hold we can interpret the effect of smoking also as a marginal effect.

index           <- fev$smoking==0
counts.noS      <- table(fev[index,]$gender, as.factor(fev[index,]$age))
percentages.noS <- round(prop.table(table(fev[index,]$gender, as.factor(fev[index,]$age)),2),digits=3)

index1           <- fev$smoking==1
counts.S      <- table(fev[index1,]$gender, as.factor(fev[index1,]$age))
percentages.S <- round(prop.table(table(fev[index1,]$gender, as.factor(fev[index1,]$age)),2),digits=3)

par(mfrow=c(1,2))
barplot(percentages.noS, main="Gender distribution among non-smokers",cex.main=0.8, 
  xlab="ages", col=c("pink", "darkblue"),
  legend = rownames(percentages.noS))
barplot(percentages.S, main="Gender distribution among smokers",cex.main=0.8, 
  xlab="ages", col=c("pink", "darkblue"),
  legend = rownames(percentages.S))

Observe that in fact there is also a lack of “smoking” children below age 9: lack of complete overlap is when there are no observations at all for some combination(s) of treatment levels / covariate levels.

For lack of complete overlap, there is no data available for some comparisons. This requires extrapolation using a model to make comparisons. This is in fact the job that the regression model does, but again extrapolation is always a risky business for the model in regions where no data are available. This is an even more serious problem than imbalance.

Matching is a possible strategy in these situations to overcome (avoid) imbalance, even if some data will be discarded (see in the propensity score methods examples).

4 Estimating causal effects from observational studies using the propensity score approach

Fist of all, we will upload the required libraries:

library(twang)
library(magrittr)
library(tidyverse)
library(gtsummary)
library(stddiff)
library(ggplot2)
library(data.table)
library(boot)
library(splines)
library(PSAgraphics)
library(Matching)
library(sandwich)
library(survey)
library(rms)

We will use a dataset coming from an observational study of 996 patients receiving an initial Percutaneous Coronary Intervention (PCI) at Ohio Heart Health, Christ Hospital, Cincinnati in 1997 and followed for at least 6 months by the staff of the Lindner Center.

The patients thought to be more severely diseased were assigned to treatment with abciximab (an expensive, high-molecular-weight IIb/IIIa cascade blocker); in fact, only 298 (29.9 percent) of patients received usual-care-alone with their initial PCI.

Our research question aims at estimating the treatment effect of abciximab+PCI (abcix) vs the standard care on the probability of of being deceased at 6 months.

In this practical, we will apply the methods based on the propensity score.

Measured pre-treatment characteristics that could confound the treatment-outcome relationship are:

acutemi: Recent acute myocardial infarction ( 0 No, 1 Yes)
ejecfrac: Left Ejection Fraction (%)
ves1proc: number of vessels involved ( from 1 to 5)
stent: 1 indicates coronary stent inserted
diabetic: 1 indicates the subject is diabetic
height: height of the subject in cm
female: 1 indicates female subjects

4.1 Exploring the data

data("lindner",package="twang")
set.seed(123)

summary(lindner)

##     lifepres       cardbill          abcix            stent       
##  Min.   : 0.0   Min.   :  2216   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:11.6   1st Qu.: 10219   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :11.6   Median : 12458   Median :1.0000   Median :1.0000  
##  Mean   :11.3   Mean   : 15674   Mean   :0.7008   Mean   :0.6687  
##  3rd Qu.:11.6   3rd Qu.: 16660   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :11.6   Max.   :178534   Max.   :1.0000   Max.   :1.0000  
##      height          female          diabetic         acutemi      
##  Min.   :108.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:165.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :173.0   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :171.4   Mean   :0.3474   Mean   :0.2239   Mean   :0.1436  
##  3rd Qu.:178.0   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :196.0   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     ejecfrac        ves1proc     sixMonthSurvive
##  Min.   : 0.00   Min.   :0.000   Mode :logical  
##  1st Qu.:45.00   1st Qu.:1.000   FALSE:26       
##  Median :55.00   Median :1.000   TRUE :970      
##  Mean   :50.97   Mean   :1.386                  
##  3rd Qu.:56.00   3rd Qu.:2.000                  
##  Max.   :90.00   Max.   :5.000

How is the treatment variable distributed in the population?

# Exposure
lindner %>%
  dplyr::select(abcix) %>%
  tbl_summary()

Characteristic	N = 996¹
abcix	698 (70%)
¹ n (%)

ggplot(lindner)+
  geom_bar(aes(x=abcix,fill=as.factor(abcix)),stat="count")+
  scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()

Is the outcome rare?

# Outcome 
lindner$sixMonthDeath <- 1-lindner$sixMonthSurvive

lindner %>%
  dplyr::select(sixMonthDeath) %>% 
  tbl_summary()

Characteristic	N = 996¹
sixMonthDeath	26 (2.6%)
¹ n (%)

What is the crude odds ratio for mortality?

# Crude OR
fit.crude <- glm(sixMonthDeath~abcix,family = binomial,data=lindner)
tbl_regression(fit.crude,exponentiate=T)

Characteristic	OR¹	95% CI¹	p-value
abcix	0.30	0.13, 0.66	0.003
¹ OR = Odds Ratio, CI = Confidence Interval

Let’s now summarize the confounding variables by treatment group and by outcome, to have a general idea about the observed associations:

# Possible confounders
lindner %>% 
  dplyr::select(acutemi,
                  ejecfrac,
                  ves1proc,
                  stent,
                  diabetic,
                  female,
                height) %>%
  tbl_summary()

Characteristic	N = 996¹
acutemi	143 (14%)
ejecfrac	55 (45, 56)
ves1proc
0	4 (0.4%)
1	680 (68%)
2	252 (25%)
3	45 (4.5%)
4	14 (1.4%)
5	1 (0.1%)
stent	666 (67%)
diabetic	223 (22%)
female	346 (35%)
height	173 (165, 178)
¹ n (%); Median (IQR)

# Descriptive statistics of patients'characteristics by treatment group
lindner %>% 
  dplyr::select(acutemi,
                ejecfrac,
                ves1proc,
                stent,
                diabetic,
                female,
                height,
                abcix) %>%
  tbl_summary(by=abcix) %>% 
  add_overall() %>% 
  add_p()

Characteristic	Overall, N = 996¹	0, N = 298¹	1, N = 698¹	p-value²
acutemi	143 (14%)	18 (6.0%)	125 (18%)	<0.001
ejecfrac	55 (45, 56)	55 (50, 60)	55 (45, 56)	0.004
ves1proc				<0.001
0	4 (0.4%)	1 (0.3%)	3 (0.4%)
1	680 (68%)	243 (82%)	437 (63%)
2	252 (25%)	47 (16%)	205 (29%)
3	45 (4.5%)	6 (2.0%)	39 (5.6%)
4	14 (1.4%)	1 (0.3%)	13 (1.9%)
5	1 (0.1%)	0 (0%)	1 (0.1%)
stent	666 (67%)	174 (58%)	492 (70%)	<0.001
diabetic	223 (22%)	80 (27%)	143 (20%)	0.027
female	346 (35%)	115 (39%)	231 (33%)	0.10
height	173 (165, 178)	173 (165, 178)	173 (165, 180)	0.8
¹ n (%); Median (IQR)
² Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test

# Descriptive statistics of patients'characteristics by outcome
lindner %>% 
  dplyr::select(acutemi,
                ejecfrac,
                ves1proc,
                stent,
                diabetic,
                female,
                height,
                sixMonthDeath) %>%
  tbl_summary(by=sixMonthDeath) %>% 
  add_overall() %>% 
  add_p()

Characteristic	Overall, N = 996¹	0, N = 970¹	1, N = 26¹	p-value²
acutemi	143 (14%)	137 (14%)	6 (23%)	0.2
ejecfrac	55 (45, 56)	55 (45, 56)	50 (33, 54)	0.002
ves1proc				0.10
0	4 (0.4%)	4 (0.4%)	0 (0%)
1	680 (68%)	661 (68%)	19 (73%)
2	252 (25%)	249 (26%)	3 (12%)
3	45 (4.5%)	41 (4.2%)	4 (15%)
4	14 (1.4%)	14 (1.4%)	0 (0%)
5	1 (0.1%)	1 (0.1%)	0 (0%)
stent	666 (67%)	649 (67%)	17 (65%)	0.9
diabetic	223 (22%)	212 (22%)	11 (42%)	0.014
female	346 (35%)	335 (35%)	11 (42%)	0.4
height	173 (165, 178)	173 (165, 178)	168 (163, 177)	0.10
¹ n (%); Median (IQR)
² Fisher’s exact test; Wilcoxon rank sum test; Pearson’s Chi-squared test

We can now calculate the Standardized Difference,which can be use as a measure of balance in the treatment groups. It is a measure of difference between groups that is independent from statistical testing (remember that p values always depend on sample size !!). It is very similar to the definition of effect size that we discussed in Block 2.

It can be defined for a continuous covariate as:

$SD_{c}=\frac{\overline{x_{1}}-\overline{x_{0}}} {\sqrt{\frac{s_1^2+s_0^2}{2}}}$

and for a dichotomous covariate as:

$SD_{d}=\frac{\overline{p_{1}}-\overline{p_{0}}} {\sqrt{\frac{\overline{p_{1}}(1-\overline{p_{1})}+\overline{p_{0}}(1-\overline{p_{0}})}{2}}}$

The rough interpretation is that imbalance is present if the standardized difference is greater than 0.1 or 0.2.

s1 <- stddiff.numeric(vcol="height",gcol="abcix",data=lindner)
s2 <- stddiff.numeric(vcol="ejecfrac",gcol="abcix",data=lindner)
s3 <- stddiff.numeric(vcol="ves1proc",gcol="abcix",data=lindner)


s4 <- stddiff.binary(vcol="stent",gcol="abcix",data=lindner)
s5 <- stddiff.binary(vcol="female",gcol="abcix",data=lindner)
s6 <- stddiff.binary(vcol="diabetic",gcol="abcix",data=lindner)
s7 <- stddiff.binary(vcol="acutemi",gcol="abcix",data=lindner)


cont.var <- as.data.frame(rbind(s1,s2,s3))
rownames(cont.var) <- c("height", "ejecfrac", "ves1proc")
cont.var

##           mean.c   sd.c  mean.t   sd.t missing.c missing.t stddiff stddiff.l
## height   171.446 10.589 171.443 10.695         0         0   0.000    -0.135
## ejecfrac  52.289 10.297  50.403 10.419         0         0   0.182     0.046
## ves1proc   1.205  0.480   1.463  0.706         0         0   0.427     0.290
##          stddiff.u
## height       0.136
## ejecfrac     0.318
## ves1proc     0.564

bin.var <- as.data.frame(rbind(s4,s5,s6, s7))
rownames(bin.var) <- c("stent", "female", "diabetic", "acutemi")
bin.var

##            p.c   p.t missing.c missing.t stddiff stddiff.l stddiff.u
## stent    0.584 0.705         0         0   0.255     0.119     0.391
## female   0.386 0.331         0         0   0.115    -0.021     0.251
## diabetic 0.268 0.205         0         0   0.150     0.014     0.286
## acutemi  0.060 0.179         0         0   0.372     0.235     0.508

4.2 Estimating the propensity score

Fit now a propensity score model — a logistic regression model with abciximab+PCI (vs. PCI) as the outcome, and the confounders listed in the table above included as covariates. We exclude from the list the variable height, since there was not a relevant difference between the groups.

# Fit a propensity score model
fit.ps<- glm(abcix~
                      acutemi+
                      ejecfrac+
                      ves1proc+
                      stent+
                      diabetic+
                      female,
                    data=lindner,family = binomial)

summary(fit.ps)

## 
## Call:
## glm(formula = abcix ~ acutemi + ejecfrac + ves1proc + stent + 
##     diabetic + female, family = binomial, data = lindner)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.272481   0.440711   0.618 0.536394    
## acutemi      1.178158   0.269479   4.372 1.23e-05 ***
## ejecfrac    -0.015076   0.007391  -2.040 0.041374 *  
## ves1proc     0.757365   0.138279   5.477 4.32e-08 ***
## stent        0.571165   0.150208   3.803 0.000143 ***
## diabetic    -0.409191   0.170371  -2.402 0.016316 *  
## female      -0.133346   0.151377  -0.881 0.378379    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1215.5  on 995  degrees of freedom
## Residual deviance: 1127.0  on 989  degrees of freedom
## AIC: 1141
## 
## Number of Fisher Scoring iterations: 4

# Save the estimated propensity score
lindner$ps <- fitted(fit.ps)

# Plot estimated ps 
ggplot(lindner) +
  geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) +
  scale_y_continuous("Estimated PS") +
  scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()

ggplot(lindner) +
  geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) +
  scale_y_continuous("Estimated PS") +
  facet_grid(cols=vars(abcix))+
  scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Assess whether there are non-overlapping scores (positivity violation) in the two exposure groups:

lindner %>% 
  dplyr::select(ps,abcix) %>% 
  tbl_summary(type = list(ps~"continuous2"),by=abcix,
                                             statistic = all_continuous2() ~c(
                                             "{median} ({p25}, {p75})", 
                                             "{min}, {max}"))

Characteristic	0, N = 298	1, N = 698
ps
Median (IQR)	0.65 (0.55, 0.70)	0.71 (0.65, 0.82)
Range	0.25, 0.96	0.28, 0.98

Investigate overlap:

lindner %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) &
                                     ps<=max(ps[abcix==0]),1,0)) 

# non-overlap: treatment group have higher ps than any non-abciximab user and 
# control group have smaller ps than any abciximab user 

with(lindner,table(overlap,abcix))

##        abcix
## overlap   0   1
##       0   1   8
##       1 297 690

with(lindner,prop.table(table(overlap,abcix)),2)

##        abcix
## overlap           0           1
##       0 0.001004016 0.008032129
##       1 0.298192771 0.692771084

In the successive steps, we remove subjects that does not overlap. This step reduce the original sample size, but we should respect the assumption of positivity in order to estimate a reasonable causal effect. It makes no sense including subjects that have “near-zero” probability to receive the treatment or to have a “match” in the successive analyses.

4.3 First option: “adjusting” for the propensity score

We can use the estimated PS as a covariate in a logistic regression model for the outcome:

# Model 1: Linear relationship between ps and outcome 
fit.out<- glm(sixMonthDeath~abcix+ps,
              data=lindner,
              family = binomial,
              subset=overlap==1)

# Model summary
summary(fit.out)

## 
## Call:
## glm(formula = sixMonthDeath ~ abcix + ps, family = binomial, 
##     data = lindner, subset = overlap == 1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.211      1.165  -3.616 0.000300 ***
## abcix         -1.445      0.439  -3.293 0.000992 ***
## ps             1.947      1.698   1.147 0.251522    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 233.15  on 986  degrees of freedom
## Residual deviance: 222.01  on 984  degrees of freedom
## AIC: 228.01
## 
## Number of Fisher Scoring iterations: 7

The second step is to save the predicted probabilities for the treated and the untreated and estimate the causal effects of interest in the population:

# fitted values (probabilities)
lindner$predY0<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[3]*lindner$ps) # PCI subjects 
lindner$predY1<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[2]+coef(fit.out)[3]*lindner$ps) # PCI+abciximab subjects 

# ATE effect 
Y1<-mean(lindner$predY1)
Y0<-mean(lindner$predY0)

# ATT effect 
Y1_1<-mean(lindner$predY1[lindner$abcix==1])
Y0_1<-mean(lindner$predY0[lindner$abcix==1])

# ATE effect 
Y1-Y0

## [1] -0.04246731

# ATT effect
Y1_1-Y0_1

## [1] -0.04436082

# Estimate odds ratios related to the "ATE" and the "ATT"
(Y1/(1-Y1))/(Y0/(1-Y0))

## [1] 0.2363187

(Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1))

## [1] 0.236305

The ATE effect is quite similar to the ATT effect, indicating that there is a protective effect of the PCI+abciximab vs PCI alone.

To obtain the corresponding confidence intervals we can use the bootstrap approach.

We do not here outline this procedure, see at the end of this practical the supplementary code.

This model relies on two additional assumptions: no interaction between propensity score and treatment, and a linear relationship between the propensity score and treatment.

Do these assumptions appear reasonable here?

We can try to fit different models, and then compare the AIC:

# Model 2: Non-linear relationship between ps and outcome 
fit.out2<- glm(sixMonthDeath~abcix+ps+I(ps^2),
              data=lindner,
              family = binomial,
              subset=overlap==1)

summary(fit.out2)

## 
## Call:
## glm(formula = sixMonthDeath ~ abcix + ps + I(ps^2), family = binomial, 
##     data = lindner, subset = overlap == 1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.3787     3.8937   0.611 0.541255    
## abcix        -1.5088     0.4493  -3.358 0.000784 ***
## ps          -17.7559    11.5272  -1.540 0.123475    
## I(ps^2)      14.1968     8.3131   1.708 0.087681 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 233.15  on 986  degrees of freedom
## Residual deviance: 219.46  on 983  degrees of freedom
## AIC: 227.46
## 
## Number of Fisher Scoring iterations: 7

# Model 3: Non-linear relationship between ps and outcome and interaction between ps and treatment
fit.out3<- glm(sixMonthDeath~abcix*ps+I(ps^2),
               data=lindner,
               family = binomial,
               subset=overlap==1)

summary(fit.out3)

## 
## Call:
## glm(formula = sixMonthDeath ~ abcix * ps + I(ps^2), family = binomial, 
##     data = lindner, subset = overlap == 1)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   2.361196   3.786408   0.624   0.5329  
## abcix         0.009567   2.075020   0.005   0.9963  
## ps          -18.641792  11.183815  -1.667   0.0955 .
## I(ps^2)      15.513326   8.167099   1.899   0.0575 .
## abcix:ps     -2.130967   2.862505  -0.744   0.4566  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 233.15  on 986  degrees of freedom
## Residual deviance: 218.93  on 982  degrees of freedom
## AIC: 228.93
## 
## Number of Fisher Scoring iterations: 7

It seems that the relationship could be partially non-linear, but there is no a statistical significance very strong, as well as for the interaction. So probably the best parsimonious model to keep is Model 1.

4.4 Second option: Stratification

Create propensity score strata: this could be an iterative process, since we should verify if we have enough subjects/event in each stratum.

lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.25,0.5,0.75,1)),include.lowest=T,labels=c(1:4)))

#Check they have been created correctly
summary(lindner$strata)

##   1   2   3   4 
## 252 246 257 241

tapply(lindner$ps,lindner$strata,summary)

## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2499  0.5168  0.5463  0.5321  0.5713  0.6051 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6078  0.6505  0.6674  0.6605  0.6807  0.6846 
## 
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6877  0.7036  0.7307  0.7406  0.7728  0.8106 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8106  0.8305  0.8706  0.8759  0.9102  0.9765

#Look at numbers of events and patients in each strata/exposure group
table(lindner$sixMonthDeath,lindner$strata,lindner$abcix)

## , ,  = 0
## 
##    
##       1   2   3   4
##   0 105  89  64  25
##   1   7   1   4   3
## 
## , ,  = 1
## 
##    
##       1   2   3   4
##   0 138 156 185 208
##   1   2   0   4   5

These strata seem quite “sparse” as number of events. Another possibility is :

#Create propensity score strata
lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.33,0.66,1)),include.lowest=T,labels=c(1:3))) 

#Check they have been created correctly
summary(lindner$strata)

##   1   2   3 
## 356 302 338

#Look at numbers of events and patients in each strata/exposure group
table(lindner$sixMonthDeath,lindner$strata,lindner$abcix)

## , ,  = 0
## 
##    
##       1   2   3
##   0 146  96  41
##   1   7   4   4
## 
## , ,  = 1
## 
##    
##       1   2   3
##   0 201 201 285
##   1   2   1   8

Remind : we should also check for the balance of the confounders in each strata ! See the supplementary material for that. For now, let’s just estimate the OR in each stratum:

beta.treat<-numeric(3)
nstrata<-table(lindner$strata)
treated.strata<-table(lindner$strata,lindner$abcix)[,2]

for (i in 1:3){
   ms<-glm(sixMonthDeath~abcix,data=lindner,subset = strata==i,family="binomial")
   beta.treat[i]<-coef(ms)[2]
   print(summary(ms))
}

## 
## Call:
## glm(formula = sixMonthDeath ~ abcix, family = "binomial", data = lindner, 
##     subset = strata == i)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.0377     0.3869  -7.851 4.13e-15 ***
## abcix        -1.5725     0.8091  -1.943    0.052 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83.969  on 355  degrees of freedom
## Residual deviance: 79.319  on 354  degrees of freedom
## AIC: 83.319
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## Call:
## glm(formula = sixMonthDeath ~ abcix, family = "binomial", data = lindner, 
##     subset = strata == i)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1781     0.5103  -6.228 4.73e-10 ***
## abcix        -2.1253     1.1249  -1.889   0.0589 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 50.927  on 301  degrees of freedom
## Residual deviance: 46.200  on 300  degrees of freedom
## AIC: 50.2
## 
## Number of Fisher Scoring iterations: 8
## 
## 
## Call:
## glm(formula = sixMonthDeath ~ abcix, family = "binomial", data = lindner, 
##     subset = strata == i)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3273     0.5238  -4.443 8.88e-06 ***
## abcix        -1.2458     0.6347  -1.963   0.0497 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 103.68  on 337  degrees of freedom
## Residual deviance: 100.39  on 336  degrees of freedom
## AIC: 104.39
## 
## Number of Fisher Scoring iterations: 6

And, finally, let’s estimate the weighted OR related to the ATE and the ATT as a weighted average of the ORs in the various strata:

exp(sum(beta.treat*nstrata)/nrow(lindner))

## [1] 0.1960846

exp(sum(beta.treat*treated.strata)/sum(treated.strata))

## [1] 0.2028472

Also here, we should use a bootstrap approach to estimate the corresponding confidence intervals.

4.5 Third option: Matching

Here we should create a reduced dataset retaining only patients with the overlap:

lindner.overlap <- lindner %>% filter(overlap==1)

Now we proceed with the matching algorithm: there is plenty of different algorithms in R that produce matching, here we use one from the library(Matching).

library(Matching)
match <-  Match(Y=lindner.overlap$sixMonthDeath, 
                Tr=lindner.overlap$abcix,
                X=lindner.overlap$ps,
                caliper=0.2,#  all matches not equal to or within 0.2 standard deviations of ps are dropped
                M=1,
                ties=FALSE,
                replace=TRUE # 1:1
)

# Number of pairs
nn   <- length(match$index.treated)

# Create matched dataset
lindnerMatched <- cbind(rbind(lindner.overlap[match$index.treated,],
                              lindner.overlap[match$index.control,]),
                        pair=c(1:nn,1:nn))

table(lindner.overlap$abcix)

## 
##   0   1 
## 297 690

#Check number of treated patients
table(lindnerMatched$abcix)

## 
##   0   1 
## 689 689

#Look at people being used multiple times in the matched sample
summary(as.factor(table(match$index.treated)))

##   1 
## 689

summary(as.factor(table(match$index.control)))

##  1  2  3  4  5  6  7  8  9 10 12 13 15 17 19 
## 69 45 30 20 11  9  9  5  2  2  1  1  2  1  2

#Look at the propensity score distribution in the matched dataset
ggplot(lindnerMatched) +
  geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) +
  scale_y_continuous("Estimated PS") +
  scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()

ggplot(lindnerMatched) +
  geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) +
  scale_y_continuous("Estimated PS") +
  facet_grid(cols=vars(abcix))+
  scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s check now the balance:

# Balance Diagnostics before and after matching
bal <- MatchBalance(abcix~
                      stent+
                      female+
                      diabetic+
                      acutemi+
                      ejecfrac+
                      ves1proc,
                    data=lindner.overlap,
                    match.out = match)

## 
## ***** (V1) stent *****
##                        Before Matching        After Matching
## mean treatment........    0.70435            0.70537 
## mean control..........    0.58586            0.65602 
## std mean diff.........     25.947             10.817 
## 
## mean raw eQQ diff.....    0.11785           0.049347 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.059245           0.024673 
## med  eCDF diff........   0.059245           0.024673 
## max  eCDF diff........    0.11849           0.049347 
## 
## var ratio (Tr/Co).....    0.85663            0.92097 
## T-test p-value........  0.0004398          0.0057027 
## 
## 
## ***** (V2) female *****
##                        Before Matching        After Matching
## mean treatment........    0.33333            0.33382 
## mean control..........    0.38384            0.30479 
## std mean diff.........    -10.706              6.151 
## 
## mean raw eQQ diff.....   0.050505           0.029028 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.025253           0.014514 
## med  eCDF diff........   0.025253           0.014514 
## max  eCDF diff........   0.050505           0.029028 
## 
## var ratio (Tr/Co).....     0.9378             1.0495 
## T-test p-value........    0.13211            0.09767 
## 
## 
## ***** (V3) diabetic *****
##                        Before Matching        After Matching
## mean treatment........    0.20725             0.2061 
## mean control..........    0.26599            0.21771 
## std mean diff.........    -14.483            -2.8684 
## 
## mean raw eQQ diff.....   0.057239           0.011611 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.029373          0.0058055 
## med  eCDF diff........   0.029373          0.0058055 
## max  eCDF diff........   0.058747           0.011611 
## 
## var ratio (Tr/Co).....    0.83988            0.96072 
## T-test p-value........   0.050489            0.47958 
## 
## 
## ***** (V4) acutemi *****
##                        Before Matching        After Matching
## mean treatment........    0.17101            0.17126 
## mean control..........   0.060606            0.16255 
## std mean diff.........     29.302             2.3098 
## 
## mean raw eQQ diff.....    0.11111          0.0087083 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.055204          0.0043541 
## med  eCDF diff........   0.055204          0.0043541 
## max  eCDF diff........    0.11041          0.0087083 
## 
## var ratio (Tr/Co).....     2.4853             1.0426 
## T-test p-value........ 4.1797e-08             0.5361 
## 
## 
## ***** (V5) ejecfrac *****
##                        Before Matching        After Matching
## mean treatment........     50.559             50.553 
## mean control..........     52.279             51.209 
## std mean diff.........    -16.899            -6.4414 
## 
## mean raw eQQ diff.....     1.8687            0.80406 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....         20                 15 
## 
## mean eCDF diff........   0.033253           0.013763 
## med  eCDF diff........  0.0087396          0.0043541 
## max  eCDF diff........    0.10771           0.046444 
## 
## var ratio (Tr/Co).....    0.97403             1.0808 
## T-test p-value........   0.016162            0.13002 
## KS Bootstrap p-value..      0.002              0.216 
## KS Naive p-value......   0.016165            0.44722 
## KS Statistic..........    0.10771           0.046444 
## 
## 
## ***** (V6) ves1proc *****
##                        Before Matching        After Matching
## mean treatment........     1.4435             1.4456 
## mean control..........     1.2088             1.5109 
## std mean diff.........     34.534            -9.6339 
## 
## mean raw eQQ diff.....    0.24242           0.065312 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.048684           0.013062 
## med  eCDF diff........   0.014024          0.0043541 
## max  eCDF diff........     0.1805           0.049347 
## 
## var ratio (Tr/Co).....     2.0392            0.93506 
## T-test p-value........ 9.0068e-10          0.0050611 
## KS Bootstrap p-value.. < 2.22e-16              0.062 
## KS Naive p-value...... 2.6627e-06            0.37114 
## KS Statistic..........     0.1805           0.049347 
## 
## 
## Before Matching Minimum p.value: < 2.22e-16 
## Variable Name(s): ves1proc  Number(s): 6 
## 
## After Matching Minimum p.value: 0.0050611 
## Variable Name(s): ves1proc  Number(s): 6

lindnerMatched %>% 
  dplyr::select(stent,female,diabetic,acutemi,ejecfrac,ves1proc,abcix) %>%
  tbl_summary(by=abcix) %>% 
  add_overall() %>% 
  add_p()

Characteristic	Overall, N = 1,378¹	0, N = 689¹	1, N = 689¹	p-value²
stent	938 (68%)	452 (66%)	486 (71%)	0.049
female	440 (32%)	210 (30%)	230 (33%)	0.2
diabetic	292 (21%)	150 (22%)	142 (21%)	0.6
acutemi	230 (17%)	112 (16%)	118 (17%)	0.7
ejecfrac	55 (45, 56)	55 (45, 60)	55 (45, 56)	0.3
ves1proc				0.3
0	2 (0.1%)	0 (0%)	2 (0.3%)
1	842 (61%)	405 (59%)	437 (63%)
2	434 (31%)	231 (34%)	203 (29%)
3	73 (5.3%)	38 (5.5%)	35 (5.1%)
4	27 (2.0%)	15 (2.2%)	12 (1.7%)
¹ n (%); Median (IQR)
² Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test

Note that the number of stent has not been well balanced after the matching procedure.

For this reason, we use this covariate in the regression model for the outcome.

Now we estimate the causal effect:

fit.out4<-glm(sixMonthDeath~abcix+stent, data=lindnerMatched,family=binomial)
summary(fit.out4)

## 
## Call:
## glm(formula = sixMonthDeath ~ abcix + stent, family = binomial, 
##     data = lindnerMatched)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4089     0.3429  -9.941  < 2e-16 ***
## abcix        -1.5529     0.3562  -4.359  1.3e-05 ***
## stent         0.9438     0.3727   2.532   0.0113 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 449.29  on 1377  degrees of freedom
## Residual deviance: 418.68  on 1375  degrees of freedom
## AIC: 424.68
## 
## Number of Fisher Scoring iterations: 7

We can see that the number of stent is statistically significant in the model,so it has been a good idea to control for it, since it was not well balanced in the matching procedure. As we already have discussed, sometimes also covariates that are not confounders for the effect of the treatment on the outcome could be included in the final model in order to obtain more accurate estimates of the effect.

4.6 Fourth option: IPTW

Definition of the weights:

# Definition of weights for ATE
lindner.overlap %<>%mutate(w_ATE=case_when(abcix==1~1/ps,
                                   abcix==0~1/(1-ps))) 

#Definition of weights for ATT
lindner.overlap %<>%mutate(w_ATT=case_when(abcix==1~1,
                                   abcix==0~ps/(1-ps)))

Check the extreme weights: sometimes it is useful to use truncated or stabilized weights, in order to reduce the variance of the final estimates, but we do not cover here this aspect.

#Check extremes
quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1))

##       0%       1%       5%      95%      99%     100% 
## 1.041578 1.052785 1.076734 1.882227 2.339352 3.627077

quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1))

##        0%        1%        5%       95%       99%      100% 
##  1.566665  1.658869  1.791224  6.137759 14.832595 25.687628

quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1))

##   0%   1%   5%  95%  99% 100% 
##    1    1    1    1    1    1

quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1))

##         0%         1%         5%        95%        99%       100% 
##  0.5666647  0.6588687  0.7912238  5.1377591 13.8325951 24.6876276

# Balance diagnostics
#ATE
bal_IPTW_ATE <- dx.wts(x=lindner.overlap$w_ATE,
                       data=lindner.overlap,
                       vars=colnames(lindner.overlap)[4:10],
                       treat.var = colnames(lindner.overlap)[3],
                       estimand = "ATE")

#ATT
bal_IPTW_ATT <- dx.wts(x=lindner.overlap$w_ATT,
                       data=lindner.overlap,
                       x.as.weights = T,
                       vars=colnames(lindner.overlap)[4:10],
                       treat.var = colnames(lindner.overlap)[3],
                       estimand = "ATT")


bal.table(bal_IPTW_ATE)

## $unw
##            tx.mn  tx.sd   ct.mn  ct.sd std.eff.sz   stat     p    ks ks.pval
## stent      0.704  0.457   0.586  0.493      0.252  3.541 0.000 0.118   0.005
## height   171.419 10.728 171.458 10.605     -0.004 -0.053 0.958 0.024   1.000
## female     0.333  0.472   0.384  0.487     -0.106 -1.509 0.132 0.051   0.641
## diabetic   0.207  0.406   0.266  0.443     -0.141 -1.962 0.050 0.059   0.450
## acutemi    0.171  0.377   0.061  0.239      0.320  5.537 0.000 0.110   0.012
## ejecfrac  50.559 10.179  52.279 10.313     -0.168 -2.415 0.016 0.108   0.015
## ves1proc   1.443  0.680   1.209  0.476      0.370  6.207 0.000 0.181   0.000
## 
## [[2]]
##            tx.mn  tx.sd   ct.mn  ct.sd std.eff.sz   stat     p    ks ks.pval
## stent      0.669  0.471   0.668  0.472      0.000  0.005 0.996 0.000   1.000
## height   171.123 10.734 172.398 10.965     -0.119 -1.351 0.177 0.064   0.532
## female     0.347  0.476   0.330  0.471      0.035  0.468 0.640 0.017   1.000
## diabetic   0.225  0.418   0.242  0.429     -0.040 -0.461 0.645 0.017   1.000
## acutemi    0.137  0.344   0.149  0.357     -0.035 -0.307 0.759 0.012   1.000
## ejecfrac  51.077  9.910  51.093 10.182     -0.002 -0.020 0.984 0.044   0.909
## ves1proc   1.368  0.641   1.427  0.690     -0.093 -0.783 0.434 0.032   0.996

bal.table(bal_IPTW_ATT)

## $unw
##            tx.mn  tx.sd   ct.mn  ct.sd std.eff.sz   stat     p    ks ks.pval
## stent      0.704  0.457   0.586  0.493      0.259  3.541 0.000 0.118   0.005
## height   171.419 10.728 171.458 10.605     -0.004 -0.053 0.958 0.024   1.000
## female     0.333  0.472   0.384  0.487     -0.107 -1.509 0.132 0.051   0.641
## diabetic   0.207  0.406   0.266  0.443     -0.145 -1.962 0.050 0.059   0.450
## acutemi    0.171  0.377   0.061  0.239      0.293  5.537 0.000 0.110   0.012
## ejecfrac  50.559 10.179  52.279 10.313     -0.169 -2.415 0.016 0.108   0.015
## ves1proc   1.443  0.680   1.209  0.476      0.345  6.207 0.000 0.181   0.000
## 
## [[2]]
##            tx.mn  tx.sd   ct.mn  ct.sd std.eff.sz   stat     p    ks ks.pval
## stent      0.704  0.457   0.703  0.458      0.003  0.034 0.973 0.001   1.000
## height   171.419 10.728 172.793 11.090     -0.128 -1.264 0.206 0.068   0.597
## female     0.333  0.472   0.307  0.462      0.055  0.683 0.495 0.026   1.000
## diabetic   0.207  0.406   0.232  0.423     -0.061 -0.598 0.550 0.025   1.000
## acutemi    0.171  0.377   0.186  0.390     -0.041 -0.313 0.754 0.015   1.000
## ejecfrac  50.559 10.179  50.594 10.085     -0.003 -0.041 0.968 0.035   0.997
## ves1proc   1.443  0.680   1.519  0.743     -0.111 -0.827 0.409 0.043   0.966

Finally, let’s estimate the ATE causal effect on the weighted dataset:

# Estimate ATE 
design.lindnerATE <- svydesign(ids=~1,
                               weights = ~w_ATE,
                               data=lindner.overlap)

fit_itpw_ATE <- svyglm(sixMonthDeath~abcix,
                       family=binomial,
                       design=design.lindnerATE)

tbl_regression(fit_itpw_ATE,exponentiate = T)

Characteristic	OR¹	95% CI¹	p-value
abcix	0.17	0.06, 0.49	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

And the ATT:

design.lindnerATT <- svydesign(ids=~1,
                               weights = ~w_ATT,
                               data=lindner.overlap)

fit_iptw_ATT <- svyglm(sixMonthDeath~abcix,
                       family=binomial,
                       design=design.lindnerATT)

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

tbl_regression(fit_iptw_ATT,exponentiate = T)

Characteristic	OR¹	95% CI¹	p-value
abcix	0.15	0.05, 0.46	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Also here it is possible to estimate the 95% CI using boostrap methods (that are in general more robust).

4.7 Boostrap confidence intervals for METHOD 1: covariate adjustement

results <- data.frame(ATE=rep(NA,4),ATT=rep(NA,4))

f_PSadj <- function(data, indices,outcome.formula) {
  
  d <- data[indices,] # allows boot to select sample
  # estimation of ps
  m1<-glm(abcix~
            acutemi+
            ejecfrac+
            ves1proc+
            stent+
            diabetic+
            female,
          data=d,
          family = binomial)
  
  d$ps<-fitted.values(m1)
  # overlap
  d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) &
                                 ps<=max(ps[abcix==0]),1,0)) 


  
  # outcome model
  m2<-glm(outcome.formula,data=d,family="binomial",subset = overlap==1)
  if(!m2$converged) print("Model did not converged")
  d$predY0<-m2$family$linkinv(coef(m2)[1]+coef(m2)[3]*d$ps)
  d$predY1<-m2$family$linkinv(coef(m2)[1]+coef(m2)[2]+coef(m2)[3]*d$ps)
  
  Y1<-mean(d$predY1)
  Y0<-mean(d$predY0)
  Y1_1<-mean(d$predY1[lindner$abcix==1])
  Y0_1<-mean(d$predY0[lindner$abcix==1])
  
  ATE_PSadj<-(Y1/(1-Y1))/(Y0/(1-Y0))
  ATT_PSadj<-(Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1))
  return(c(ATE_PSadj,ATT_PSadj))
}

res_boot <- function(obj.boot,type="percent",digits=3){
  suppressWarnings({
    orig <- round(obj.boot$t0,digits)
    ciATE <- paste(round(boot.ci(obj.boot,index=1)[[type]][4:5],digits),collapse = "-")
    ciATT <- paste(round(boot.ci(obj.boot,index=2)[[type]][4:5],digits),collapse = "-")
  })
  res <- paste(orig,rbind(ciATE,ciATT),sep="(")
  res.u <- paste(res,rep(")",2),sep = "")
  return(res.u)
  
}

boot.out <- boot(data=lindner, statistic=f_PSadj,R=1000,outcome.formula=fit.out$formula)

print(boot.out)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = lindner, statistic = f_PSadj, R = 1000, outcome.formula = fit.out$formula)
## 
## 
## Bootstrap Statistics :
##      original     bias    std. error
## t1* 0.2363187 0.02393070   0.1318934
## t2* 0.2363050 0.02394347   0.1318932

# Get 95% confidence interval
results$ATE[1]<-res_boot(boot.out,digits=4)[1]
results$ATT[1]<-res_boot(boot.out,digits=4)[2]

results

##                     ATE                   ATT
## 1 0.2363(0.0929-0.5899) 0.2363(0.0929-0.5899)
## 2                  <NA>                  <NA>
## 3                  <NA>                  <NA>
## 4                  <NA>                  <NA>

4.8 Boostrap confidence intervals for METHOD 2: stratification

Very often with the stratification method there are many problems of convergence of the regression algorithm, since in the strata we have very few events !

f_PSstrat <- function(data, indices) {
  d <- data[indices,] # allows boot to select sample
  
  m1<-glm(abcix~
            acutemi+
            ejecfrac+
            ves1proc+
            stent+
            diabetic+
            female,
          data=d,family = binomial)
  
  d$ps<-fitted.values(m1)
  
  quart_PS<-quantile(d$ps,c(0,0.33,0.66,1))
  
  d$strata<-cut(d$ps, quart_PS, labels=c(1:3))
  
  for (i in 1:3){
    ms<-glm(sixMonthDeath~abcix,data=d,subset = strata==i,family="binomial")
    beta.treat[i]<-coef(ms)[2]
  }
  
  
  ATE <- exp(sum(beta.treat*nstrata)/nrow(lindner))
  ATT <- exp(sum(beta.treat*treated.strata)/sum(treated.strata))
  
  return(c(ATE,ATT))
  
}

boot.out4 <- boot(data=lindner, statistic=f_PSstrat,R=1000)

# Get 95% confidence interval
results$ATE[2]<-res_boot(boot.out4,digits=3)[1]
results$ATT[2]<-res_boot(boot.out4,digits=3)[2]

4.9 Robust confidence intervals for METHOD 3: matching

We now estimate the robust standard errors related to the estimate on the matched dataset:

cov <- vcovHC(fit.out4, type = "HC0")
std.err <- sqrt(diag(cov))

q.val <- qnorm(0.975)
r <- cbind(
  Estimate = coef(fit.out4)
  , "Robust SE" = std.err
  , z = (coef(fit.out4)/std.err)
  , "Pr(>|z|) "= 2 * pnorm(abs(coef(fit.out4)/std.err), lower.tail = FALSE)
  , LL = coef(fit.out4) - q.val  * std.err
  , UL = coef(fit.out4) + q.val  * std.err
)


#Exponential to get the OR
results$ATT[3]<- paste0(round(exp(r[2,1]),4),"(",round(exp(r[2,5]),4),"-",round(exp(r[2,6]),4),")")

4.10 Boostrap confidence intervals for method 4: IPTW

f_IPTW <- function(data, indices) {
  
  d <- data[indices,] # allows boot to select sample
  
  # estimation of ps
  m1<-glm(abcix~
            acutemi+
            ejecfrac+
            ves1proc+
            stent+
            diabetic+
            female,
          data=d,
          family = binomial)
  
  d$ps<-fitted.values(m1)
  
  # overlap
  d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) &
                                 ps<=max(ps[abcix==0]),1,0)) %>% 
    filter(overlap==1)
  
  # Definition of weights for ATE
  d %<>%mutate(w_ATE=case_when(abcix==1~1/ps,
                                     abcix==0~1/(1-ps))) 
  
  #Definition of weights for ATT
  d %<>%mutate(w_ATT=case_when(abcix==1~1,
                                     abcix==0~ps/(1-ps)))
  
  
  
  # Estimate ATE 
  design.lindnerATE <- svydesign(ids=~1,
                                 weights = ~w_ATE,
                                 data=d)
  
  fit_itpw_ATE <- svyglm(sixMonthDeath~abcix,
                         family=binomial,
                         design=design.lindnerATE)
  
  
  
  # Estimate ATT
  design.lindnerATT <- svydesign(ids=~1,
                                 weights = ~w_ATT,
                                 data=d)
  
  fit_iptw_ATT <- svyglm(sixMonthDeath~abcix,
                         family=binomial,
                         design=design.lindnerATT)
  
  ATE <- exp(fit_itpw_ATE$coefficients[2])
  ATT <- exp(fit_iptw_ATT$coefficients[2])
  
  return(c(ATE,ATT))
}

boot.out5 <- boot(data=lindner, statistic=f_IPTW,R=1000)

results$ATE[4]<-res_boot(boot.out5,digits=4)[1]
results$ATT[4]<-res_boot(boot.out5,digits=4)[2]

4.11 Final Comparison across methods !

# recall the crude OR from the original dataset
tbl_regression(fit.crude,exponentiate=T)

Characteristic	OR¹	95% CI¹	p-value
abcix	0.30	0.13, 0.66	0.003
¹ OR = Odds Ratio, CI = Confidence Interval

# PS results
row.names(results) <- c("PS Adj Linear","Stratification","Matching","IPTW")
results

##                                  ATE                   ATT
## PS Adj Linear  0.2363(0.0929-0.5899) 0.2363(0.0929-0.5899)
## Stratification       0.196(0-43.913)   0.202(0.001-35.486)
## Matching                        <NA> 0.2116(0.1046-0.4283)
## IPTW           0.1749(0.0592-0.6322) 0.1525(0.0474-0.6406)

We can observe that using the propensity score based methods we obtain a stronger estimate of the treatment effect with respect to the crude odds ratio. We can also observe that stratification produce very unstable results due to the low sample size in the different strata. In conclusion, adjusting for confounders was important in this context !

5 Survey of Health, Ageing and Retirement in Europe : example of an analysis of a real dataset

Load the required libraries

library(haven)
library(tidyverse)
library(magrittr)

Data comes from SHARE, the Survey of Health, Ageing and Retirement in Europe (https://share-eric.eu/data/). It is a multidisciplinary and cross-national panel database of data on health, socio-economic status and social and family networks of about 140000 individuals aged 50 or older (around 380000 interviews). SHARE covers 27 European countries and Israel.

The dataset we use here contains only part of the data the survey produced.

The research question is whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population: therefore we can consider it from the point of view of an explanatory (causal) model, not a predictive one.
This question could be of interest since muscular weakness can be considered a proxy of poor physical health. Muscular strength was defined as the hand grip measured with a dynamo-meter and the weakness was defined as having a grip strength below a threshold calculated according to BMI and sex.

The exposure group was defined as subjects who underwent a stressful period between the first and the last time they had been interviewed; all other subject were defined as non-exposed.

Of note, an inclusion criteria was to have a normal hand strength at the start of the study. Once the subjects were divided in the exposed and non-exposed group, the hand grip measurement after two years from the exposure definition was used to define the outcome.

In this example we will consider the following variables:

low_grip: our binary outcome variable (0: No; 1: Yes)
stress: our binary exposure variable (0: No; 1: Yes)
age: age of the subject in years
female: sex (1: females; 0: males)
ses: is household able to make ends meet? (1: Great Difficulties; 2: Some Difficulties; 3: Fairly Good ; 4 Good)
paid_job: has the subject ever done some paid job in their life? (0: No; 1: Yes)
move_house: has the subject ever moved country in their life? (0: No; 1: Yes)

load("Logistic Regression examples.RData")

5.1 IDA phase

5.1.1 Categorical Variables

Before starting to fit any regression model we have to properly code categorical variables. This is usually done in the initial data analysis (IDA) and the descriptive statistics phase.

The first step consists in identifying which variables are categorical. This may seem as an easy step but it isn’t always. If fact the choice for some variables (i.e. indexes and scales) is not obvious and it requires either knowledge of the data we are analysing or a bit more effort on the data modelling.

The important message is that we can’t assume that it is correct to model a variable as numerical only because it was of type numerical/integer in the dataset imported in R.

Once we have identified the variables that we think are categorical we can process them. We can have three cases that leads to a slightly different procedure according to the R type of the variable:

Binary variables
Variables of type numerical or integer
Variables of type character

5.1.2 Binary Variables

Binary variables can either be treated as numerical, usually dummy 0/1 variables, or as factors. It does not make any difference in terms of the results obtained from the model but we have to be aware of how they are coded for a correct interpretation of the model.

In general, the model will always have 1 parameter for a binary variable. However, we have to choose which level the parameter should represent by deciding which group to code with 1. The choice should be made according to how we want the model to be interpreted. For example, for the treatment variable, 0 is usually the placebo or the control treatment and 1 is usually the experimental treatment. Alternatively, in the case of the covariates/exposure variables, we could simply use a statistical criteria: for a better stability of the model it is always better to use the most frequent category as a reference.

For example, it makes more sense to consider as the reference level of paid_job subjects who have had a paid job. This will help with the stability of the model but also it will ease the interpretation of the results. Therefore, we create a new variable no_paid_job:

datiSHAREStress$no_paid_job <- ifelse(datiSHAREStress$paid_job==0,1,0)

5.1.3 Categorical variables with numerical labels

In this case we must transform the variable in a factor, assign labels to the factor levels and choose a reference level. The reference level will be the group all the others levels will be compared to. In fact when a factor variable with c levels is added to a model, R internally transforms it in c-1 dummy variables and a model parameter will be estimated for each of them. Again, is important we choose the reference level. In this case we have ses which we have to transform into a factor

datiSHAREStress$ses <- as.factor(datiSHAREStress$ses)

We can now check how R has coded the variable:

str(datiSHAREStress$ses)

##  Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 2 4 4 ...

contrasts(datiSHAREStress$ses)

By default, R uses the group 1 as the reference level. With the contrasts() function we can see how the variable will be parametrized in the regression model. In the rows we have the variable levels and on the columns the parameters generated for this variable.

It is also useful to assign more meaningful labels:

datiSHAREStress$ses=factor(datiSHAREStress$ses,
                           labels = c("Great Difficulties",
                           "Some Difficulties",
                            "Fairly Easily",
                             "Easily"))

Last but not least we can set the reference category of our choice. In this case we choose “Easily” which is the most frequent category.

table(datiSHAREStress$ses)

## 
## Great Difficulties  Some Difficulties      Fairly Easily             Easily 
##                191                813               1997               3148

datiSHAREStress$ses=relevel(datiSHAREStress$ses,ref="Easily")

5.1.4 Characters variables

If the variable is of type character the steps are similar to the previous ones with the exception that we won’t need to explicitly set the labels. As a reminder, by default R uses as a reference level the first level in alphabetical order.

5.2 General descriptives statistics and plots

We skip this part here, since we already discussed how to describe a dataset in the various IDA examples. Remind that you should pay particular attention to outliers, missing values, continuous variable’s distribution shapes, rare categories in categorical ones!

5.3 Univariable Logistic Regression (univariable filtering)

A common method used in the analysis of health data for variables selection, is fitting many models with one covariate at the time. This is also a way to begin to explore the dataset, even if as discussed it is not the suggested method to select variables in the model. In this case, we can start with the exposure to stress, our variable of interest. This kind of analysis in the medical literature is often referred as “univariable analysis”.

So we start by fitting the model

fit_uni_stress <- glm(low_grip~stress,family = binomial,data=datiSHAREStress)

As a reminder, the glm function automatically uses the logit as link function unless we state otherwise.

The results of the model can be obtained with

summary(fit_uni_stress)

## 
## Call:
## glm(formula = low_grip ~ stress, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.5660     0.0501 -51.217   <2e-16 ***
## stress       -0.5174     0.4204  -1.231    0.218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3141.1  on 6147  degrees of freedom
## AIC: 3145.1
## 
## Number of Fisher Scoring iterations: 5

It seems that we can’t reject the null hypothesis for $\beta_{stress}$.

We can also obtain the 95% Confidence Interval

confint.default(fit_uni_stress)

##                 2.5 %     97.5 %
## (Intercept) -2.664221 -2.4678285
## stress      -1.341440  0.3066133

We can do more univariable models to look for possible confounders (previoulsy discussed with an expert if possible..) in the exposure-outcome relationship.

fit_uni_age <- glm(low_grip~age,family = binomial,data=datiSHAREStress)
summary(fit_uni_age)

## 
## Call:
## glm(formula = low_grip ~ age, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.505544   0.419383  -22.67   <2e-16 ***
## age          0.101366   0.005822   17.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 2817.0  on 6147  degrees of freedom
## AIC: 2821
## 
## Number of Fisher Scoring iterations: 6

fit_uni_job <- glm(low_grip~no_paid_job,family = binomial,data=datiSHAREStress)
summary(fit_uni_job)

## 
## Call:
## glm(formula = low_grip ~ no_paid_job, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.61992    0.05134 -51.030  < 2e-16 ***
## no_paid_job  1.13184    0.21544   5.254 1.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3120.8  on 6147  degrees of freedom
## AIC: 3124.8
## 
## Number of Fisher Scoring iterations: 5

fit_uni_sex <- glm(low_grip~female,family = binomial,data=datiSHAREStress)
summary(fit_uni_sex)

## 
## Call:
## glm(formula = low_grip ~ female, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.56532    0.07178 -35.738   <2e-16 ***
## female      -0.01918    0.09955  -0.193    0.847    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3142.8  on 6147  degrees of freedom
## AIC: 3146.8
## 
## Number of Fisher Scoring iterations: 5

fit_uni_move <- glm(low_grip~house_move,family = binomial,data=datiSHAREStress)
summary(fit_uni_move)

## 
## Call:
## glm(formula = low_grip ~ house_move, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3735     0.1307 -18.161   <2e-16 ***
## house_move   -0.2329     0.1413  -1.648   0.0993 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3140.2  on 6147  degrees of freedom
## AIC: 3144.2
## 
## Number of Fisher Scoring iterations: 5

fit_uni_ses <- glm(low_grip~ses,family = binomial,data=datiSHAREStress)
summary(fit_uni_ses)

## 
## Call:
## glm(formula = low_grip ~ ses, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.69592    0.07324 -36.809  < 2e-16 ***
## sesGreat Difficulties  0.70745    0.23408   3.022  0.00251 ** 
## sesSome Difficulties   0.37973    0.14288   2.658  0.00787 ** 
## sesFairly Easily       0.11084    0.11422   0.970  0.33182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3129.9  on 6145  degrees of freedom
## AIC: 3137.9
## 
## Number of Fisher Scoring iterations: 5

When we have covariates with many levels, such as ses, we can use the global Wald test which takes into account for multiple testing.

drop1(fit_uni_ses,test="Chisq")

## Single term deletions
## 
## Model:
## low_grip ~ ses
##        Df Deviance    AIC    LRT Pr(>Chi)   
## <none>      3129.8 3137.8                   
## ses     3   3142.8 3144.8 12.988 0.004663 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.4 Multivariable Logistic Model

We can now fit a multivariable logistic model with the variables that were found to be associated with the outcome in the previous analysis and we suspect could act as confounders of the main exposure of interest.
Since we have not the possibility do discuss with an expert, we decide to include in the multivariable regression model all covariates with a p-value < 0.1. This criterion could be not the best one, but is widely applied in clinical and epidemiological research.

We obviously include in the model stress , since it is our exposure of interest.

We will treat the others variables are possible confounders in the model, and we will explore if significant associations are present with the exposure of interest. In principle, in the explanatory setting, we should start from a DAG and make the causal assumptions (in order to estimate a causal effect!!) but we do not have here the experts of the matter and we will limit ourselves in the end to a cautious interpretation of the estimated association.

fit_multi <-  glm(low_grip~stress+no_paid_job+age+house_move+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi)

## 
## Call:
## glm(formula = low_grip ~ stress + no_paid_job + age + house_move + 
##     ses, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -9.486409   0.448437 -21.154  < 2e-16 ***
## stress                -0.149441   0.438194  -0.341 0.733075    
## no_paid_job            0.408889   0.233225   1.753 0.079568 .  
## age                    0.101474   0.005924  17.129  < 2e-16 ***
## house_move            -0.196669   0.147181  -1.336 0.181472    
## sesGreat Difficulties  0.939131   0.249104   3.770 0.000163 ***
## sesSome Difficulties   0.489696   0.149577   3.274 0.001061 ** 
## sesFairly Easily       0.066057   0.118419   0.558 0.576962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 2790.2  on 6141  degrees of freedom
## AIC: 2806.2
## 
## Number of Fisher Scoring iterations: 6

This model estimates the coefficients of each of the variables independently from all the others. This is key concept of multivariable regression and it is what allows to remove confounding.

At a first sight it seems that our exposure of interest is not associated with the outcome.

Knowledge of the context from which the data comes from can be used to make hypothesis about possible interaction between confounders and the exposure.

In this case we want to test for an interaction between no_paid_job and stress. In other words we want to see if the association between the outcome and the exposure could vary for different levels of no_paid_job.

We then fit the model:

fit_multi2 <-  glm(low_grip~stress*no_paid_job+house_move+age+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi2)

## 
## Call:
## glm(formula = low_grip ~ stress * no_paid_job + house_move + 
##     age + ses, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -9.544923   0.450526 -21.186  < 2e-16 ***
## stress                -0.812176   0.603047  -1.347 0.178049    
## no_paid_job            0.267759   0.243706   1.099 0.271901    
## house_move            -0.181885   0.147844  -1.230 0.218604    
## age                    0.102246   0.005946  17.197  < 2e-16 ***
## sesGreat Difficulties  0.927537   0.249828   3.713 0.000205 ***
## sesSome Difficulties   0.490455   0.149577   3.279 0.001042 ** 
## sesFairly Easily       0.060946   0.118563   0.514 0.607225    
## stress:no_paid_job     3.189101   1.054703   3.024 0.002497 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 2781.6  on 6140  degrees of freedom
## AIC: 2799.6
## 
## Number of Fisher Scoring iterations: 6

As a reminder, using * introduces both the main effect and the interaction effect between variables in the model.

Alternatively, we can also fit the same model as follows:

fit_multi2 <-  glm(low_grip~stress+no_paid_job+age+ses+stress:no_paid_job+house_move,family = binomial,data=datiSHAREStress)

This second way of adding an interaction term is useful in case we have multiple interactions involving the same variable.

It seems that a significant interaction is present between stress and the no_paid_job confounder. Note that in any case, the main effect of stress is still not significant in the model.

Now that we have fitted these two models, which one should we choose?

We can use a formal statistical test to compare the two nested models:

anova(fit_multi,fit_multi2,test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: low_grip ~ stress + no_paid_job + age + house_move + ses
## Model 2: low_grip ~ stress + no_paid_job + age + ses + stress:no_paid_job + 
##     house_move
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1      6141     2790.2                        
## 2      6140     2781.6  1   8.5938 0.003373 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis of this test is that the simpler model (i.e. the one with the smaller number of regression parameters) is no different from the more complex model. In this case we reject that hypothesis at a 95% confidence level and we would keep the model with the interaction.

5.5 Calibration for the logistic regression model (simplest basic method)

Even if we are here estimating an explanatory model, this does not means that we are not at all interested in evaluating if the predicted probabilities are in line with the observed event rates. A basic procedure to evaluate this aspect in logistic regression is the Hosmer-Lemeshow test (see for details: https://onlinelibrary.wiley.com/doi/book/10.1002/0471722146). In brief, the null hypothesis of the test is that the predicted probabilities (splitted in ordinal groups) are in line with the observed rates of the events in each group.

library(generalhoslem)

## Loading required package: reshape

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:Matrix':
## 
##     expand

## The following object is masked from 'package:data.table':
## 
##     melt

## The following object is masked from 'package:lubridate':
## 
##     stamp

## The following object is masked from 'package:dplyr':
## 
##     rename

## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

logitgof(datiSHAREStress$low_grip,fitted(fit_multi2))

## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  datiSHAREStress$low_grip, fitted(fit_multi2)
## X-squared = 5.6329, df = 8, p-value = 0.6883

The model doesn’t show evidence for poor goodness of fit/calibration. Remind that when we are instead working with models used specifically for prediction, we should use more refined analyses such that the bootstrap overfitting-corrected calibration curves.

5.6 Interpretation of the logistic regression model

The next step is interpreting the results obtained from the model which is one of the most important part of the analysis to be discussed with experts. Of course we are keeping things simple here: in reality we would have had to consider many other possible candidate confounders for the outcome in order to properly control for confounding in this observational study, at least for the measured ones.

5.6.1 Continuous covariates

We obtain the OR, which is the measure of association selected when reporting and interpreting the results of a logistic model.

We start with the OR estimated for age:

exp(fit_multi2$coefficients)[4]

##      age 
## 1.107656

In general, OR>1 indicate an increase of the probability of the outcome whereas OR<1 a decrease. But the question is : with respect to what? We know that the OR is a relative measure of association so we must always ask ourselves what comparison we are making when we report an OR we have estimated. The above output does not mean anything by itself.

The first important thing to remember is that age is a continuous variable (measured in years) and in the model we have assumed it has a linear effect on the log-odds of the outcome. When we obtain the $\hat{OR}$ by exponentiating the coefficient for age we obtain the OR estimates for an increase of one year of age. Being the $\hat{OR}>1$, the probability of developing low grip increases as the age increases. Specifically, a subject has odds of having low grip 1.11 times greater with respect to a subject who is 1 year younger.

For continuous variables is important to report an OR for a difference that is relevant in the application at hand; a proper choice will depend of the scale the variable is measured and on the magnitude of the estimated coefficient.

For example here 1 year may not be so relevant from an epidemiological point of view, we may want to report an OR for a 5 years difference in age:

exp(fit_multi2$coefficients*5)[4]

##     age 
## 1.66734

5.6.2 Binary Covariates

What if we want to interpret the OR of a binary variable? Let’s consider the variable house_move. If it is coded numerical, then the coefficients refers always to the level 1. In this case 1 stands for having moved country in the past, so if the estimated coefficient was significant we could say that subjects who moved seem to have an odds lower by 17% ($1-\hat{OR}$) compared to people that have never moved countries in their life. However, when we look at the 95% CI we observe that it contains 1 so the association does not seem to be statistically significant.

#OR
exp(fit_multi2$coefficients)[8]

## house_move 
##  0.8336973

exp(confint.default(fit_multi2))[8,]

##     2.5 %    97.5 % 
## 0.6239695 1.1139185

5.6.3 Categorical Covariates

The interpretation of the OR for categorical variables is similar: we will have to keep in mind that we are always comparing each of the variable levels to the reference level. Here, it seems (interestingly!) that the odds of health worsening is the same for subject with good or fairly good self-perceived socio-economic status while it increases as the economic situation gets worse.

# OR
exp(fit_multi2$coefficients)[5:7]

## sesGreat Difficulties  sesSome Difficulties      sesFairly Easily 
##              2.528273              1.633059              1.062842

# 95% CI
exp(confint.default(fit_multi2))[5:7,]

##                          2.5 %   97.5 %
## sesGreat Difficulties 1.549423 4.125513
## sesSome Difficulties  1.218097 2.189385
## sesFairly Easily      0.842456 1.340880

5.6.4 Interactions

Finally, we will obtain the $\hat{OR}s$ for stress and no_paid_job. Since there is an interaction involved, we have to be a bit more careful and we have to consider the two variables together. So first of all we have to know which is our reference group. In our case this is the group who have not experienced a stressful period and have done some paid job in their life (the reference levels for both variables involved in the interaction). Then, we can consider all combinations of the levels of the variables.

If we want the $\hat{OR}$ for undergoing a stressful period and having had a paid job with respect to not having had a period of stress and having had a paid job we simply exponentiate the main estimated coefficient for stress (keeping at the same values the others covariates in the model):

exp(fit_multi2$coefficients)[2]

##    stress 
## 0.4438912

and we don’t reject the null hypothesis it is equal to 1 (no significant effect):

exp(confint.default(fit_multi2))[2,]

##     2.5 %    97.5 % 
## 0.1361326 1.4474079

On the other hand, if we want the $\hat{OR}$ for never having had a paid job and not being exposed to stress with respect to having had a paid job and not being exposed to stress we simply exponentiate the coefficient for the main effect of the variable no_paid_job:

exp(fit_multi2$coefficients)[3]

## no_paid_job 
##    1.307032

Also this effect is not statistically significant which means that there does not seem to be a difference in the risk of low hand grip with regards to job experience in the non-exposed group (no stress) keeping fixed all others covariates.

exp(confint.default(fit_multi2))[3,]

##     2.5 %    97.5 % 
## 0.8106679 2.1073152

Last, we can obtain the OR for subjects who were stressed and had never have a paid job with respect to the reference group (not experienced a stressful period and have done some paid job):

exp(fit_multi2$coefficients[9]+fit_multi2$coefficients[2]+fit_multi2$coefficients[3])

## stress:no_paid_job 
##           14.07899

So in this subgroup it seems that the probability of developing low grip increases.

However, we should evaluate also the 95% CI for this $\hat{OR}$.

One possibility to do it is to re-fit the multivariable logistic model using the rms R package:

library(rms)
dd <- datadist(datiSHAREStress)

## Warning in datadist(datiSHAREStress): intervallo is constant

options(datadist='dd') #define ranges of the covariates 

fit_multi2b <- lrm(low_grip~stress*no_paid_job+age+ses+house_move,data=datiSHAREStress)

The code is quite similar, the main difference is that first we have to run the datadist function which stores the distribution summaries of the variables.

The print function returns a very similar output of the summary for the glm() object.

print(fit_multi2b)

## Logistic Regression Model
## 
## lrm(formula = low_grip ~ stress * no_paid_job + age + ses + house_move, 
##     data = datiSHAREStress)
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs          6149    LR chi2     361.20       R2       0.143    C       0.752    
##  0           5714    d.f.             8      R2(8,6149)0.056    Dxy     0.504    
##  1            435    Pr(> chi2) <0.0001    R2(8,1212.7)0.253    gamma   0.505    
## max |deriv| 2e-06                             Brier    0.061    tau-a   0.066    
## 
##                        Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept              -9.5449 0.4505 -21.19 <0.0001 
## stress                 -0.8122 0.6031  -1.35 0.1781  
## no_paid_job             0.2678 0.2437   1.10 0.2719  
## age                     0.1022 0.0059  17.20 <0.0001 
## ses=Great Difficulties  0.9275 0.2498   3.71 0.0002  
## ses=Some Difficulties   0.4905 0.1496   3.28 0.0010  
## ses=Fairly Easily       0.0609 0.1186   0.51 0.6072  
## house_move             -0.1819 0.1478  -1.23 0.2186  
## stress * no_paid_job    3.1891 1.0547   3.02 0.0025

The summary on the other hand gives you the estimates for the coefficients as well as the ones for the OR. Of note by default for the continuous variables here, such as age, the function calculates the OR for the difference between the $1^{st}$ and the $3^{rd}$ quartile.

summary(fit_multi2b)

##              Effects              Response : low_grip 
## 
##  Factor                          Low High Diff. Effect    S.E.     Lower 0.95
##  stress                           0   1.0  1.0  -0.812180 0.603110 -1.99430  
##   Odds Ratio                      0   1.0  1.0   0.443890       NA  0.13612  
##  no_paid_job                      0   1.0  1.0   0.267760 0.243710 -0.20990  
##   Odds Ratio                      0   1.0  1.0   1.307000       NA  0.81067  
##  age                             58  70.8 12.8   1.308700 0.076105  1.15960  
##   Odds Ratio                     58  70.8 12.8   3.701500       NA  3.18860  
##  house_move                       0   1.0  1.0  -0.181880 0.147840 -0.47165  
##   Odds Ratio                      0   1.0  1.0   0.833700       NA  0.62397  
##  ses - Great Difficulties:Easily  1   2.0   NA   0.927540 0.249830  0.43788  
##   Odds Ratio                      1   2.0   NA   2.528300       NA  1.54940  
##  ses - Some Difficulties:Easily   1   3.0   NA   0.490460 0.149580  0.19729  
##   Odds Ratio                      1   3.0   NA   1.633100       NA  1.21810  
##  ses - Fairly Easily:Easily       1   4.0   NA   0.060946 0.118560 -0.17143  
##   Odds Ratio                      1   4.0   NA   1.062800       NA  0.84246  
##  Upper 0.95
##  0.36990   
##  1.44760   
##  0.74541   
##  2.10730   
##  1.45790   
##  4.29700   
##  0.10788   
##  1.11390   
##  1.41720   
##  4.12550   
##  0.78362   
##  2.18940   
##  0.29333   
##  1.34090   
## 
## Adjusted to: stress=0 no_paid_job=0

So far, the OR for stress and the job experience are the ones for the main effects. We can easily obtain here the OR for the interaction with the built-in contrast function:

c <- contrast(fit_multi2b,
         list(stress=1,no_paid_job=1), list(stress=0,no_paid_job=0),type="average")

c

##    Contrast      S.E.    Lower    Upper    Z Pr(>|z|)
## 11 2.644684 0.8328779 1.012273 4.277095 3.18   0.0015
## 
## Confidence intervals are 0.95 individual intervals

#OR and 95% CI
exp(c$Contrast)

##        1 
## 14.07899

exp(c$Lower)

##       1 
## 2.75185

exp(c$Upper)

##        1 
## 72.03085

So it seems that there is a significant interaction between stress and no paid job on the outcome.

5.7 Conclusions

Remind our initial research question: whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population. Based on our analysis it seems that is not the single event of a stessful period that increase the occurrence of muscular weakness, but it is the joint impact of having a stressful period coupled with no paid job that is associated with the occurrence of muscular weakness, adjusting for (or independently from) the specific age, house condition or socio-economic status.

6 Regression on count data: Poisson regression

6.1 Introduction

In nature, an event usually takes place in a very small amount of time. At any given point of time, the probability of encountering such an event is small. Instead of the probability of the single event, now we focus on the frequency of the events as a density, which means incidence or ‘count’ of events over a period of time. (While time is one dimension, the same concept applies to the density of counts of small objects in a two-dimensional area or three-dimensional space). Moreover we can assume that one event is independent from another and that the densities in different units of time vary with a variance equal to the average density. We can approximate this kind of random process using the Poisson random variable. When the probability of having an event is affected by some factors, a model is needed to explain and predict the density. Variation among different strata of a population could be explained by the various combination of factors. Within each stratum (defined by covariates combination), the distribution of the events is assumed random. Poisson regression deals with outcome variables that are counts in nature (whole numbers or integers). Independent covariates have the same role as those encountered in linear and logistic regression. In epidemiology, Poisson regression is very often used for analysing grouped population based or cohort data, looking at incidence density among person-time contributed by subjects that share similar characteristics of interest. Poisson regression is one of 3 common regression models used in epidemiological studies. The other two that are more commonly used are linear regression and logistic regression, which have been already covered. The last family is survival methods, that we will explain in the last block 4.

There are two main assumptions for Poisson regression:

risk is homogeneous among person-times contributed by different subjects who share the same characteristics of interest (e.g. sex, age-group) and the same period.
asymptotically, or as the sample size becomes larger and larger, the mean of the counts is equal to the variance.

Straightforward linear regression methods (assuming constant variance and normal errors) are not appropriate for count data for four main reasons:

the linear model might lead to the prediction of negative counts
the variance of the response may increase with the mean
the errors will not be normally distributed
zero counts are difficult to handle in transformations

Moreover, in studies that invole the time dimension different subjects may have different person-times of exposure. Analysing risk factors while ignoring differences in person-times is wrong.

Poisson regression overcomes these limitations.

Note that in survival analysis using for example Cox regression (see block 4), the hazard ratio will be estimated for each covariate in the model, not the incidence density in each subgroup; in the Cox model the interest will be focused on the “how long until an event occurs - time to event -”, instead in the Poisson regression model the focus is on “how many events occur in given interval”.

6.2 Example of Poisson model: the Montana smelter study

The dataset Montana was extracted from an occupational cohort study conducted to test the association between respiratory deaths (outcome) and exposure to arsenic in the industry, after adjusting for various other risk factors/confounders. The main outcome variable is respdeath. This is the count of the number of deaths among personyrs or personyears of subjects in each category. The other variables are independent covariates including age group agegr, period of employment period, starting time of employment start and the level of exposure to arsenic during the study period arsenic (the exposure of interest). Read in the data first and examine the variables.

data(Montana)
summary(Montana)

##    respdeath        personyrs           agegr           period     
##  Min.   : 0.000   Min.   :    4.2   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:  127.5   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 1.000   Median :  335.1   Median :3.000   Median :2.000  
##  Mean   : 2.421   Mean   : 1096.4   Mean   :2.605   Mean   :2.404  
##  3rd Qu.: 3.000   3rd Qu.:  925.7   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :19.000   Max.   :12451.3   Max.   :4.000   Max.   :4.000  
##     starting        arsenic     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.250  
##  Median :1.000   Median :2.000  
##  Mean   :1.456   Mean   :2.474  
##  3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :2.000   Max.   :4.000

The last four variables are classed as integers. We need to tell R to interpret them as categorical variables, or factors, and attach labels to each of the levels. This can be done using the factor command with a ‘labels’ argument included.

Montana$agegr    <- factor(Montana$agegr, labels=c("40-49","50-59","60-69","70-79"))
Montana$period   <- factor(Montana$period, labels=c("1938-1949", "1950-1959","1960-1969", "1970-1977"))
Montana$start    <- factor(Montana$start, labels=c("pre-1925", "1925 & after"))
Montana$arsenic1 <- factor(Montana$arsenic, labels=c("<1 year", "1-4years","5-14 years", "15+ years"))
summary(Montana)

##    respdeath        personyrs         agegr          period      starting    
##  Min.   : 0.000   Min.   :    4.2   40-49:24   1938-1949:30   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:  127.5   50-59:28   1950-1959:32   1st Qu.:1.000  
##  Median : 1.000   Median :  335.1   60-69:31   1960-1969:28   Median :1.000  
##  Mean   : 2.421   Mean   : 1096.4   70-79:31   1970-1977:24   Mean   :1.456  
##  3rd Qu.: 3.000   3rd Qu.:  925.7                             3rd Qu.:2.000  
##  Max.   :19.000   Max.   :12451.3                             Max.   :2.000  
##     arsenic               start          arsenic1 
##  Min.   :1.000   pre-1925    :62   <1 year   :29  
##  1st Qu.:1.250   1925 & after:52   1-4years  :29  
##  Median :2.000                     5-14 years:29  
##  Mean   :2.474                     15+ years :27  
##  3rd Qu.:3.000                                    
##  Max.   :4.000

We keep the original arsenic variable unchanged for use later on.

6.2.1 Descriptive analyses : breakdown of incidence by age and period

Let us explore the person-years breakdown by age and period. Firstly, create a table for total person-years:

tapply(Montana$personyrs, list(Montana$period, Montana$agegr), sum) -> table.pyears

Carry out the same procedure for number of deaths, and compute the table of incidence per 10,000 person years for each cell.

tapply(Montana$respdeath, list(Montana$period, Montana$agegr), sum) -> table.deaths
table.inc10000 <- table.deaths/table.pyears*10000
table.inc10000

##              40-49    50-59    60-69    70-79
## 1938-1949 5.424700 17.13102 34.95107 26.53928
## 1950-1959 3.344638 23.47556 49.01961 64.82632
## 1960-1969 4.341516 20.49375 58.23803 55.06608
## 1970-1977 4.408685 14.77747 44.09949 80.81413

Now, create a time-series plot of the incidence:

plot.ts(table.inc10000, plot.type="single", xlab=" ",ylab="#/10,000 person-years", xaxt="n", col=c("black",
"blue","red","green"), lty=c(2,1,1,2), las=1)
points(rep(1:4,4), table.inc10000, pch=22, cex=table.pyears/sum(table.pyears) * 20)
title(main = "Incidence by age and period")
axis(side = 1, at = 1:4, labels = levels(Montana$period))
legend(3.2,40, legend=levels(Montana$agegr)[4:1], col=c("green","red", "blue", "black"), bg = "white", lty=c(2,1,1,2))

The above graph shows that the older age group is generally associated with a higher risk. On the other hand, the sample size (reflected by the size of the squares at each point) decreases with age. The possibility of a confounding effect of age on the exposure of interest can better be examined by using Poisson regression.

6.2.2 Modelling with Poisson regression

Let’s estimate a Poisson regression model taking into account only period as a covariate:

mode11 <- glm(respdeath ~ period, offset = log(personyrs),family = poisson, data=Montana)
summary(mode11)

## 
## Call:
## glm(formula = respdeath ~ period, family = poisson, data = Montana, 
##     offset = log(personyrs))
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -6.4331     0.1715 -37.511   <2e-16 ***
## period1950-1959   0.2365     0.2117   1.117   0.2638    
## period1960-1969   0.3781     0.2001   1.889   0.0588 .  
## period1970-1977   0.4830     0.2036   2.372   0.0177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 369.27  on 110  degrees of freedom
## AIC: 596.05
## 
## Number of Fisher Scoring iterations: 6

The option offset = log(personyrs) allows the variable personyrs to be the denominator for the counts of respdeath. A logarithmic transformation is needed since, for a Poisson generalized linear model, the link function is the natural log, and the default link for the Poisson family is the log link.

Remind : an important criterion in the choice of a link function for various families of distributions is to ensure that the fitted values from the modelling stay within reasonable bounds. Specifying a log link (default for Poisson) ensures that the fitted counts are all greater than or equal to zero. For more details on default links for various families of distributions related to generalized linear modelling, see the help in R under help(family).

The first model above with period as the only independent variable suggests that the death rate increased with time. The model can be tested for goodness of fit and the checked whether the Poisson assumptions mentioned earlier have been violated.

6.2.3 Goodness of fit test

To test the goodness of fit of the Poisson model, type:

poisgof(mode11)

## $results
## [1] "Goodness-of-fit test for Poisson assumption"
## 
## $chisq
## [1] 369.2654
## 
## $df
## [1] 110
## 
## $p.value
## [1] 9.578443e-30

The component ‘$chisq’ is actually computed from the model deviance, a parameter reflecting the level of errors. A large chi-squared value with small degrees of freedom results in a significant violation of the Poisson assumption (p < 0.05). If only the P value is wanted, the command can be shortened.

poisgof(mode11)$p.value

## [1] 9.578443e-30

The P value is very small indicating a poor fit.

Note:It should be noted that this method works under the assumption of a large sample size. An alternative method is to a fit negative binomial regression model (but not covered in the slides!).

We now add the second independent variable ‘agegr’ to the model:

mode12 <- glm(respdeath~agegr+period, offset=log(personyrs), family = poisson, data=Montana)
AIC(mode12)

## [1] 396.6407

The AIC (Akaike Information Criterion) has decreased remarkably from ‘model1’ to ‘model2’ indicating a poor fit of the first model.

poisgof(mode12)$p.value

## [1] 0.0003381328

But ‘model2’ still violates the Poisson assumption.

mode13 <- glm(respdeath ~ agegr, offset = log(personyrs), family = poisson, data=Montana)
AIC(mode13)

## [1] 394.4699

poisgof(mode13)$p.value

## [1] 0.0003295142

Removal of ‘period’ further reduces the AIC but still violates the Poisson assumption to the same extent as the previous model. The next step is to add the exposure of interest: ‘arsenic1’.

mode14 <- glm(respdeath ~ agegr + arsenic1, offset=log(personyrs), family = poisson, data=Montana)
summary(mode14)

## 
## Call:
## glm(formula = respdeath ~ agegr + arsenic1, family = poisson, 
##     data = Montana, offset = log(personyrs))
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -7.9947     0.2237 -35.739  < 2e-16 ***
## agegr50-59           1.4624     0.2454   5.960 2.53e-09 ***
## agegr60-69           2.3502     0.2380   9.873  < 2e-16 ***
## agegr70-79           2.5987     0.2564  10.136  < 2e-16 ***
## arsenic11-4years     0.8041     0.1577   5.099 3.42e-07 ***
## arsenic15-14 years   0.5957     0.2059   2.893  0.00381 ** 
## arsenic115+ years    0.9976     0.1758   5.673 1.40e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 122.25  on 107  degrees of freedom
## AIC: 355.04
## 
## Number of Fisher Scoring iterations: 5

poisgof(mode14)$p.value

## [1] 0.1486902

Fortunately, ‘model4’ has a much lower AIC than model3 and it now does not violate the assumption.

If we change the reference level for arsenic and we use 1-4 years vs others:

Montana$arsenic.b <- relevel(Montana$arsenic1,ref="1-4years")
mode15 <- glm(respdeath ~ agegr + arsenic.b, offset=log(personyrs), family = poisson, data=Montana)
summary(mode15)

## 
## Call:
## glm(formula = respdeath ~ agegr + arsenic.b, family = poisson, 
##     data = Montana, offset = log(personyrs))
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -7.1905     0.2466 -29.156  < 2e-16 ***
## agegr50-59            1.4624     0.2454   5.960 2.53e-09 ***
## agegr60-69            2.3502     0.2380   9.873  < 2e-16 ***
## agegr70-79            2.5987     0.2564  10.136  < 2e-16 ***
## arsenic.b<1 year     -0.8041     0.1577  -5.099 3.42e-07 ***
## arsenic.b5-14 years  -0.2085     0.2326  -0.896     0.37    
## arsenic.b15+ years    0.1935     0.2071   0.935     0.35    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 122.25  on 107  degrees of freedom
## AIC: 355.04
## 
## Number of Fisher Scoring iterations: 5

It does not appear to be any increase in the risk of death from more than 4 years of exposure to arsenic so it may be worth combining it into just two levels:

Montana$arsenic2 <- Montana$arsenic1
levels(Montana$arsenic2) <- c("<1 year", rep("1+ years", 3))

model6 <- glm(respdeath ~ agegr + arsenic2,offset=log(personyrs), family=poisson, data=Montana)
summary(model6)

## 
## Call:
## glm(formula = respdeath ~ agegr + arsenic2, family = poisson, 
##     data = Montana, offset = log(personyrs))
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -8.0086     0.2233 -35.859  < 2e-16 ***
## agegr50-59         1.4702     0.2453   5.994 2.04e-09 ***
## agegr60-69         2.3661     0.2372   9.976  < 2e-16 ***
## agegr70-79         2.6238     0.2548  10.297  < 2e-16 ***
## arsenic21+ years   0.8109     0.1210   6.699 2.09e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 125.02  on 109  degrees of freedom
## AIC: 353.8
## 
## Number of Fisher Scoring iterations: 5

At this stage, we would accept ‘model6’ as the final model, since it has the smallest AIC among all the models that we have tried. We conclude that exposure to arsenic for at least one year is associated with an increased risk for the disease by exp(0.8109) or 2.25 times with statistical significance, independently from age.

7 Minimum Sample Size Required for Developing a Multivariable Model

We now want to compute the minimum sample size required for the development of a new multivariable model using the criteria proposed by Riley et al. The required sample size aims to minimize model overfitting and to ensure key parameters (such as the model intercept) are estimated precisely. As for any sample size calculation, the approach requires the user to specify anticipated values for key parameters.

The package pmsampsize can be used to calculate the minimum sample size for the development of models with continuous, binary or survival (time-to-event) outcomes. Riley et al. lay out a series of criteria the sample size should meet. These aim to minimise the overfitting and to ensure precise estimation of key parameters in the prediction model.

For continuous outcomes, there are four criteria:

small overfitting defined by an expected shrinkage of predictor effects by 10% or less
small absolute difference of 0.05 in the model’s apparent and adjusted R-squared value
precise estimation of the residual standard deviation
precise estimation of the average outcome value

The sample size calculation requires the user to pre-specify (e.g. based on previous evidence) the anticipated R-squared of the model, and the average outcome value and standard deviation of outcome values in the population of interest.

For binary or survival (time-to-event) outcomes, there are three major criteria:

small overfitting defined by an expected shrinkage of predictor effects by 10% or less,
small absolute difference of 0.05 in the model’s apparent and adjusted Nagelkerke’s R-squared value
precise estimation (within +/- 0.05) of the average outcome risk in the population [for a key timepoint of interest for prediction in case of survival data, see later block 4].

library(pmsampsize)

Arguments of the function:

type specifies the type of analysis for which sample size is being calculated

“c”” specifies sample size calculation for a prediction model with a continuous outcome

“b” specifies sample size calculation for a prediction model with a binary outcome

“s” specifies sample size calculation for a prediction model with a survival (time-to-event) outcome

rsquared specifies the expected value of the (Cox-Snell) R-squared of the new model, where R-squared is the percentage of variation in outcome values explained by the model. For example, the user may input the value of the (Cox-Snell) Rsquared reported for a previous prediction model study in the same field. If taking a value from a previous prediction model development study, users should input the model’s adjusted R-squared value, not the apparent R-squared value, as the latter is optimistic (biased). However, if taking the R-squared value from an external validation of a previous model, the apparent R-squared can be used (as the validation data was not used for development, and so R-squared apparent is then unbiased). Note that for binary and survival outcome models, the Cox-Snell R-squared value is required; this is the generalised version of the well-known Rsquared for continuous outcomes, based on the likelihood. The papers by Riley et al. (see references) outline how to obtain the Cox-Snell R-squared value from published studies if they are not reported, using other information (such as the Cstatistic [see cstatistic() option below] or Nagelkerke’s R-squared). Users should be conservative with their chosen R-squared value; for example, by taking the R-squared value from a previous model, even if they hope their new model will improve performance.

parameters specifies the number of candidate predictor parameters for potential inclusion in the new prediction model. Note that this may be larger than the number of candidate predictors, as categorical and continuous predictors often require two or more parameters to be estimated.

shrinkage specifies the level of shrinkage desired at internal validation after developing the new model. Shrinkage is a measure of overfitting, and can range from 0 to 1, with higher values denoting less overfitting. A shrinkage = 0.9 is recommended (the default in pmsampsize), which indicates that the predictor effect (beta coefficients) in the model would need to be shrunk by 10% to adjust for overfitting. See references below for further information.

prevalence (binary outcome option) specifies the overall outcome proportion (for a prognostic model) or overall prevalence (for a diagnostic model) expected within the model development dataset. This should be derived based on previous studies in the same population.

cstatistic (binary outcome option) specifies the C-statistic reported in an existing prediction model study to be used in conjunction with the expected prevalence to approximate the Cox-Snell R-squared using the approach of Riley et al. 2020. Ideally, this should be an optimism-adjusted C-statistic. The approximate Cox- Snell R-squared value is used as described above for the rsquared() option, and so is treated as a baseline for the expected performance of the new model.

seed (binary outcome option) specifies the initial value of the random-number seed used by the random-number functions when simulating data to approximate the Cox-Snell R-squared based on reported C-statistic and expect prevalence as described by Riley et al. 2020

rate (survival outcome option) specifies the overall event rate in the population of interest, for example as obtained from a previous study, for the survival outcome of interest. NB: rate must be given in time units used for meanfup and timepoint options.

timepoint (survival outcome option) specifies the timepoint of interest for prediction. NB: time units must be the same as given for meanfup option (e.g. years, months).

meanfup (survival outcome option) specifies the average (mean) follow-up time anticipated for individuals in the model development dataset, for example as taken from a previous study in the population of interest. NB: time units must be the same as given for timepoint option.

intercept (continuous outcome options) specifies the average outcome value in the population of interest e.g. the average blood pressure, or average pain score. This could be based on a previous study, or on clinical knowledge.

sd (continuous outcome options) specifies the standard deviation (SD) of outcome values in the population e.g. the SD for blood pressure in patients with all other predictors set to the average. This could again be based on a previous study, or on clinical knowledge.

mmoe (continuous outcome options) multiplicative margin of error (MMOE) acceptable for calculation of the intercept. The default is a MMOE of 10%. Confidence interval for the intercept will be displayed in the output for reference. See references below for further information.

7.1 Continuous outcome (Linear model)

Self-identified race or ethnic group is used to determine normal reference standards in the prediction of pulmonary function. A study was conducted in 2010 to determine whether the genetically determined percentage of African ancestry is associated with lung function and whether its use could improve predictions of lung function among persons who identified themselves as African American (see:https://www.nejm.org/doi/full/10.1056/NEJMoa0907897) Authors assessed the ancestry of 777 participants self-identified as African American and evaluated the relation between pulmonary function and ancestry by means of linear regression. African ancestry was inversely related to forced expiratory volume in 1 second (FEV1) and forced vital capacity. Assuming to use 25 candidate parameters, and an intercept of 1.9, a standard deviation of 0.6 (from the published paper) and a lower bound for the R squared of 0.20:

pmsampsize(type = "c", rsquared = 0.2, parameters = 25, intercept=1.9, sd=0.6)

## NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared 
## NB: Assuming MMOE <= 1.1 in estimation of intercept & residual standard deviation 
## SPP - Subjects per Predictor Parameter 
##  
##             Samp_size Shrinkage Parameter Rsq   SPP
## Criteria 1        918     0.900        25 0.2 36.72
## Criteria 2        401     0.801        25 0.2 16.04
## Criteria 3        259     0.727        25 0.2 10.36
## Criteria 4*       918     0.900        25 0.2 36.72
## Final             918     0.900        25 0.2 36.72
##  
##  Minimum sample size required for new model development based on user inputs = 918  
##  
##  * 95% CI for intercept = (1.87, 1.93), for sample size n = 918

SPP indicates the “Subjects per Predictor parameter” and as you can observe this number vary accordingly to the criterion used (from 10 to 37). The confidence interval reported for the intercept is based on a 10% margin of error.

7.2 Binary outcome (Logistic model)

Chagas disease is a tropical parasitic disease.It is spread mostly by insects known as Triatominae, or “kissing bugs”. The symptoms change over the course of the infection. In the early stage, symptoms are typically either not present or mild, and may include fever, swollen lymph nodes, headaches, or swelling at the site of the bite.After four to eight weeks, untreated individuals enter the chronic phase of disease, which in most cases does not result in further symptoms.Up to 45% of people with chronic infection develop heart disease 10-30 years after the initial illness, which can lead to heart failure. Digestive complications, including an enlarged esophagus or an enlarged colon, may also occur in up to 21% of people, and up to 10% of people may experience nerve damage. With the globalization of Chagas disease, unexperienced health care providers may have difficulties in identifying which patients should be examined for this condition. This study published in 2016 aimed to develop and validate a diagnostic clinical model for chronic Chagas disease : https://www.scielo.br/j/rsbmt/a/WMwS4xvKGxMBMzybxwkbvVv/?lang=en

We use pmsampsize to calculate the minimum sample size required to develop a multivariable prediction model for a binary outcome using 24 candidate predictor parameters. Based on the published evidence, the outcome prevalence is anticipated to be 0.174 (17.4%) and a lower bound (taken from the adjusted Cox-Snell R-squared of an existing prediction model) for the new model’s R-squared value is 0.288.

# pmsampsize(type = "b", rsquared = 0.288, parameters = 24, prevalence = 0.174)

EPP here indicates the “Event per Predictor parameter” and as you can observe this number vary accordingly to the criterion used (from 1.6 to 4.8).

7.3 Binary outcome (Logistic model) using C-statistic

Here we are interested in developing a diagnostic model for the presence of DVT (deep venous thrombosis). DVT is a blood clot that forms in a leg vein and may migrate to the lungs leading to blockage of arterial flow, preventing oxygenation of the blood and potentially causing death. Multivariable diagnostic prediction models have been proposed during the past decades to safely exclude DVT without having to refer for further burdening (reference standard) testing. In the reference http://dx.doi.org/10.1016/j.jclinepi.2014.06.018 was reported a C statistic of 0.79 for a model with 8 parameters and an outcome prevalence estimated at 22%.

pmsampsize(type = "b", cstatistic=0.79, parameters = 9, prevalence = 0.22)

## Given input C-statistic = 0.79  & prevalence = 0.22  
## Cox-Snell R-sq = 0.1802  
##  
## NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared 
## NB: Assuming 0.05 margin of error in estimation of intercept 
## NB: Events per Predictor Parameter (EPP) assumes prevalence = 0.22  
##  
##            Samp_size Shrinkage Parameter CS_Rsq Max_Rsq Nag_Rsq  EPP
## Criteria 1       403     0.900         9 0.1802   0.651   0.277 9.85
## Criteria 2       246     0.847         9 0.1802   0.651   0.277 6.01
## Criteria 3       264     0.900         9 0.1802   0.651   0.277 6.45
## Final            403     0.900         9 0.1802   0.651   0.277 9.85
##  
##  Minimum sample size required for new model development based on user inputs = 403, 
##  with 89 events (assuming an outcome prevalence = 0.22) and an EPP = 9.85 
##  
##

7.4 References

Van Calster, B., Nieboer, D., Vergouwe, Y., De Cock, B., Pencina, M.J., Steyerberg, E.W. (2016). A calibration hierarchy for risk models was defined: from utopia to empirical data. Journal of Clinical Epidemiology, 74, pp. 167-176

Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ (Clinical research ed). 2020

Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS. Minimum sample size required for developing a multivariable prediction model: Part I continuous outcomes. Statistics in Medicine. doi: 10.1002/sim.7993

Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS. Minimum sample size required for developing a multivariable prediction model: Part II binary and time-to-event outcomes. Statistics in Medicine. doi: 10.1002/sim.7992

van Smeden M, Moons KG, de Groot JA, et al. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Methods Med Res. 2019;28(8):2455-74

Riley, RD, Van Calster, B, Collins, GS. A note on estimating the Cox-Snell R2 from a reported C statistic (AUROC) to inform sample size calculations for developing a prediction model with a binary outcome. Statistics in Medicine. 2020

R applications & Exercises - Block 3 (reduced)

STATISTICAL LEARNING IN EPIDEMIOLOGY 2023/2024

Prof. Giulia Barbati & Paolo Dalena

13/05/2024