--- title: "R applications & Exercises - Block 3" subtitle: "STATISTICAL LEARNING IN EPIDEMIOLOGY 2023/2024" author: "Prof. Giulia Barbati & Paolo Dalena" date: "13/05/2024" output: html_document: toc: true number_sections: true theme: united --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Example of prediction model: dataset “Prostate” Case Study First of all, install and load the required libraries: ```{r warning=FALSE,message=FALSE} library(rms) library(Hmisc) library(gtsummary) library(magrittr) library(tidyverse) library(haven) library(ggplot2) library(ggpubr) library(pROC) library(here) library(psych) ``` This practical contains a case study on developing, describing, and validating a regression prediction model. In the original data set of the study there was also a follow up time to the events, but we did not consider here this aspect (just for didactic purposes we will consider it as another candidate predictor in the model). The original dataset was composed by 506 patients with prostate cancer from the Byar and Green paper (see if interested: D. P. Byar and S. B. Green. The choice of treatment for cancer patients based on covariate information: Application to prostate cancer. Bulletin Cancer, Paris, 67:477-488, 1980). These data come originally from a randomized trial (RCT) comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen. In this trial, larger doses of estrogen reduced the effect of prostate cancer but at the cost of increased risk of cardiovascular death. (Four patients had missing values on many variables. These patients have been excluded from consideration). Our goal here is to predict the probability of incurring in cardiovascular-cerebrovascular death in the subset of patients that experienced this event or the prostate cancer death. We will subset the original dataset of those patients dying from prostate cancer, heart or vascular disease, or cerebrovascular disease. Here are the description of the variables contained in the dataset: ![](/Users/PaoloMacbook/Documents/PHD/esercitazioni_stat_learn_epi/images/metadata1.png) ![](/Users/PaoloMacbook/Documents/PHD/esercitazioni_stat_learn_epi/images/metadata2.png) Some a priori information from clinicians: *stage* is defined by *ap* as well as X-ray results. Of the patients in stage 3, 92% have *ap* ≤ 0.8. Of those in stage 4, 93% have *ap* > 0.8. Since stage can be predicted almost certainly from *ap*, we do not consider stage in the analyses. (You can check this relationship on the data). ## Initial Data Analysis (IDA) on the original dataset This part is an exercise that you will do on your own: first of all, let's load the original dataset: ```{r message=FALSE,warning=FALSE} prostate <- read_sav(here("datasets", "prostate.sav")) prostate <-labelled::to_factor(prostate, labelled_only = TRUE, strict = TRUE, unclass = TRUE) summary.data.frame(prostate) ``` Here are the suggestions for the initial data analysis: 1. Create a descriptive summary of all the variables contained in the dataset (except for patient number!). Do you think that some variables could be re-coded combining an *infrequent* category with the next category? For example, pay attention to the frequencies of *ekg* and *pf*. Hint from a discussion with clinicians: in the *ekg* variable we could collapse the *normal* and *benign* conditions. For *pf*, you can combine *in bed > 50% daytime* with *confined to bed* in one category. 2. Then, here there is another question for you: based on the study design (RCT), do you expect to find significant statistical differences between baseline characteristics of groups defined by the levels of the treatment *rx* ? Build a descriptive table comparing the four groups and evaluating the global p value across groups. (Hint: for numerical variables you could use the Kruskal-Wallis test and for categorical ones the Chi-square test). 3. Give a look to the distribution's shape of the continuous variables : do you think that some scale transformation (for example taking the logarithm) could help in improving the *normality* of some of them ? 4. Now: subset the original dataset selecting only the patients died for cardiovascular–cerebrovascular death or prostate cancer death. Create the outcome variable *cvd* identifying as “1” patients died for cardiovascular–cerebrovascular death and as “0” patients died for prostate cancer. (We will use then logistic regression for the prediction model). The reduced dataset “prostate_rid” is created. Here we begin with the prediction model development part. ## Let's begin with the prediction model construction First of all, we upload the reduced dataset with the recoded variables that have been suggested in the IDA phase: ```{r message=FALSE,warning=FALSE} prostate_rid <- read_sav(here("datasets","prostate_rid.sav")) prostate_rid <-labelled::to_factor(prostate_rid, labelled_only = TRUE, strict = TRUE, unclass = TRUE) summary.data.frame(prostate_rid) ``` Here we define ranges of the variables contained in the dataset (this is a useful function both for data description and for the subsequent interpretation of the regression coefficients): ```{r message=FALSE,warning=FALSE} options(datadist='dd') dd <-datadist(prostate_rid) dd ``` ## Functional forms of continous variables Now we visually explore the assumption about the linearity of the effect for continuous candidate predictors by means of nonparametric smoothed regression: ```{r message=FALSE,warning=FALSE} par(mfrow=c(2,2)) plsmo(prostate_rid$age,prostate_rid$cvd) plsmo(prostate_rid$wt, prostate_rid$cvd) plsmo(prostate_rid$sbp,prostate_rid$cvd) plsmo(prostate_rid$dbp,prostate_rid$cvd) ``` A strange behaviour is observed for *wt* i.e. the Weight Index [calculated as wt= wt(kg)-ht(cm)+200]; the others variables could be *roughly* approximated by a linear effect. ```{r message=FALSE,warning=FALSE} par(mfrow=c(2,3)) plsmo(prostate_rid$hg, prostate_rid$cvd) plsmo(prostate_rid$sz, prostate_rid$cvd) plsmo(prostate_rid$sg, prostate_rid$cvd) plsmo(prostate_rid$ap, prostate_rid$cvd) plsmo(prostate_rid$dtime, prostate_rid$cvd) ``` Again, a strange behaviour is observed for *dtime* (Months of Follow-up, as it is obvioulsy expected...), the others could be *roughly* approximated by a linear effect. Remind that this dataset is quite small, so we have to be parsimonious in the number of coefficients to be estimated ! Note that, being this dataset quite small, we will use all the dataset for the *training* phase of the regression model, but after for evaluating the performance of the model we will apply boostrap techniques. This is obviously not the optimal way to proceed, but when there is a constraint of a low-moderate sample size (as often happen in medical data collected prospectively) there is no possibility of further splitting the sample. With such a small sample size we could not afford an initial split of the study sample in separate training and test subgroups. This is also a great obstacle in applying machine learning techniques to small-moderate medical datasets. ## Univariable analysis Even if univariable analysis is generally discouraged to select candidate predictors (it works only with perfectly uncorrelated variables, that is nearly impossible in biological problems), it is another exploratory step that could be useful also to discuss then with clinicians about the indications of associations that are present in the data. So first of all we can create a table with univariable results, assuming for the moment a linear effect for all the continuous variables : ```{r message=FALSE,warning=FALSE} prostate_rid %>% select(-patno,-stage,-pf,-ekg,-status,-status_num,-subset) %>% tbl_uvregression(method = glm, # glm function method.args = list(family = binomial),# logistic model exponentiate = T, # report OR y=cvd,# outcome variable conf.level = 0.95, pvalue_fun = function(x) style_pvalue(x, digits = 3)) ``` Note that here all the odds ratios (OR) are expressing variations by 1-unit on the scale of the independent variable. This table should be commented with the clinicians in order to discuss if the observed statistical significant associations are in line with what it is expected from the biological perspective. Now, we can come back to the functional forms for the continuous variables: try now to use a nonlinear effect for wt and for dtime ```{r message=FALSE,warning=FALSE} fit.splines.wt <- lrm(cvd ~ rcs(wt,4), data=prostate_rid) print(fit.splines.wt) summary(fit.splines.wt) ``` We observe that there is no significant effect for the nonlinear components of the spline. ```{r message=FALSE,warning=FALSE} fit.splines.dtime <- lrm(cvd ~ rcs(dtime,4), data=prostate_rid) print(fit.splines.dtime) summary(fit.splines.dtime) ``` Here instead there is a significant nonlinear effect for the *dtime* variable (as expected being the follow up...) In this case, the interpretation of the odds ratio is reported as a contrast between two specific values (37 vs 11), i.e. the third and first quartile of the variable. This is done in order to help in the interpretation of the non-linear effect modelled by the spline function. Let's visualize the effect of this candidate predictor modeled in a nonlinear way: ```{r message=FALSE,warning=FALSE} ggplot(Predict(fit.splines.dtime)) ``` This is quite similar to the non parametric regression previously used: ```{r message=FALSE,warning=FALSE} plsmo(prostate_rid$dtime, prostate_rid$cvd) ``` ## Multivariable full model We decide to start with a *full* (no selection of predictors) binary logistic model assuming linearity for continuous predictors except for dtime: ```{r message=FALSE,warning=FALSE} fit.multi <- lrm(cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt +hg + ekg_rec + pf_rec + bm + hx + rx +rcs(dtime,4), data =prostate_rid) ``` Note here that there is usually an upper-bound on the number of candidate predictors that can be used in a [logistic/poisson/survival] regression model dictated by the number of events occurring in the population (we will take into account this aspect when the sample size issue in multivariable modelling will be discussed). Let's see the results: ```{r message=FALSE,warning=FALSE} print(fit.multi) summary(fit.multi) ``` ```{r message=FALSE,warning=FALSE} an <- anova (fit.multi) an plot(an) ``` More than interpreting the single regression coefficients and their corresponding statistical significance (even if in the last phase, i.e. when presenting results to the clinicians also this aspect is important!!), in prediction modelling we are mostly interested in evaluating the accuracy in predictions. Now we want to evaluate the performance of the full model: ```{r message=FALSE,warning=FALSE} s <- fit.multi$stats round(s, digits=3) ``` C is the AUC (discrimination measure) and Dxy is related to C (Dxy=2*(C-0.5)), so they are both discrimination measures. They measure how the model perform in the relative ranking of the estimated risk. ```{r message=FALSE,warning=FALSE} gamma.hat <- (s['Model L.R.'] - s['d.f.'])/s['Model L.R.'] gamma.hat ``` Estimation of the *shrinkage coefficient* gamma allows the quantification of the amount of overfitting present, and it also allows one to estimate the likelihood that the model will *reliably* predict new observations. The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.85, indicating that this model will validate on *new data* about 15% worse than on this dataset. This is also related to the calibration concept (the absolute ordering of the predicted risks). ## Model visualization: a useful interim step to be discussed with clinicians ! Let's now visualize the predictors effects on the logit scale: ```{r message=FALSE,warning=FALSE} ggplot(Predict(fit.multi)) ``` Now let's see them on the odds ratio scale: ```{r message=FALSE,warning=FALSE} plot(summary(fit.multi),log=TRUE) ``` The general function is *plot(summary(fit.multi))* but it less readable, for this reason there is the option log=TRUE. Estimates of the associations are related to the interquartile-range odds ratios for continuous predictors and odds ratios with respect to a reference level for categorical predictors. Numbers at left are upper quartile vs lower quartile or current group vs reference group. The bars represent confidence limits. The intervals are drawn on the log odds ratio scale and labeled on the odds ratio scale. Ranges are on the original scale. Discussing this data with clinicians is also useful for thinking about possible interaction effects between some of the candidate predictors; in this example we will not examine this issue. ## Using more splines! Now, forget for a moment our sample size limitation, we want to check if it is worth to expand all the others continuous variables in a full model approach using splines: ```{r message=FALSE,warning=FALSE} fit.all.splines <- lrm(cvd ~ rcs(sz,4) + rcs(sg,4) + rcs(log(ap),4) +rcs(sbp,4) + rcs(dbp,4) + rcs(age,4) + rcs(wt,4) + rcs(hg,4) + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime,4),data=prostate_rid) print(fit.all.splines) summary(fit.all.splines) ``` ```{r message=FALSE,warning=FALSE} an.s <- anova (fit.all.splines) an.s plot(an.s) ``` Let's check what happens to the model's performance: ```{r message=FALSE,warning=FALSE} s.splines <- fit.all.splines$stats # performance of the estimated model round(s.splines, digits=3) # C is the AUC and Dxy is related to C (Dxy=2*(C-0.5)), are both discrimination measures gamma.hat.splines <- (s.splines['Model L.R.'] - s.splines['d.f.'])/s.splines['Model L.R.'] gamma.hat.splines ``` The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.79, indicating that this model will validate on new data about 21% worse than on this dataset (more complex than the previous model, more overfitting, as expected). Let's compare the performance of the two models in terms of AIC: the Akaike Information Criterion (the smaller is the better is): ```{r message=FALSE,warning=FALSE} AIC.models <- c(AIC(fit.multi), AIC(fit.all.splines)) AIC.models ``` Based on AIC, the simpler model fitted to the raw data and assuming linearity for all the continuous predictors except *dtime* has a lower AIC and a lower optimism, so we would prefer that one as a predictive tool. ## Selecting candidate predictors using backwards step-down selection Now we use fast backward step-down (with total residual AIC as the stopping rule, that is a reasonable way to use stepwise selection) to identify the variables that explain the bulk of the cause of death. This algorithm performs a slightly inefficient but numerically stable version of fast backward elimination on factors;it uses the fitted full model and computes approximate Wald statistics by computing conditional (restricted) maximum likelihood estimates (assuming multivariate normality of estimates..). The function prints the deletion statistics for each variable in turn, and prints approximate parameter estimates for the model after deleting variables. The approximation is better when the number of factors deleted is not large. ```{r message=FALSE,warning=FALSE} fastbw(fit.multi) ``` In the final step the following covariates are retained: sz, ap, age and hx. Let's estimate the reduced model: ```{r message=FALSE,warning=FALSE} fred <- lrm(cvd ~ sz + log(ap) + age + hx , data =prostate_rid) print(fred) summary(fred) ``` Again, we want to evaluate the performance of the reduced model: ```{r message=FALSE,warning=FALSE} s.red <- fred$stats round(s.red, digits=3) gamma.hat.red <- (s.red['Model L.R.'] - s.red['d.f.'])/s.red['Model L.R.'] gamma.hat.red ``` The van Houwelingen-Le Cessie heuristic shrinkage estimate is now 0.96, indicating that this model will validate on new data about 4% worse than on this dataset (simpler than the full model, less overfitting). Let us now use a bootstrap approach to further evaluate calibration and discrimination of this reduced model: ```{r message=FALSE,warning=FALSE} fred <- update(fred, x=TRUE , y=TRUE) vred <- validate (fred, B=200) vred ``` The slope shrinkage as already seen is around 4%. There is a sligth drop-off in all indexes. The estimated likely future predictive discrimination of the model as measured by Somers' Dxy change from 0.68 to 0.66. The latter estimate is the one that should be claimed when describing model performance. A nearly unbiased estimate of future calibration of the stepwise-derived model is given below: ```{r message=FALSE,warning=FALSE} fred <- update(fred, x=TRUE , y=TRUE) cal <- calibrate(fred, B=200) plot(cal) ``` Bootstrap overfitting-corrected calibration curve estimate for the backwards step-down cause of death logistic model, along with a rug plot showing the distribution of predicted risks. The smooth nonparametric calibration estimator (loess) is used.The estimated mean absolute error is 0.013 between estimated probabilities and observed events: quite good! For comparison, let consider now a bootstrap validation of the full model without using any variable selection: ```{r message=FALSE,warning=FALSE} fit.multi <- update(fit.multi, x=TRUE , y=TRUE) vfull <- validate (fit.multi, B=200) vfull ``` Compared to the validation of the full model, the step-down model has much less optimism, as expected, even if it has a smaller Dxy (discrimination) due to loss of information from removing moderately important variables. Let's now explore calibration of the full model: ```{r message=FALSE,warning=FALSE} cal.full <- calibrate(fit.multi, B=200) plot(cal.full) ``` The estimated mean absolute error is 0.04 between estimated probabilities and observed events, the bias-corrected performance is worse than the reduced model. We could also make a test to compare the two models in terms of AUC (i.e the discrimination power): note that in the reduced model we have less missing values (in any case the % of missing values in this dataset is quite irrelevant). Let's extract the probabilities predicted from the models: ```{r} full.pred <- predict(fit.multi, type="fitted") rid.pred <- predict(fred, type="fitted") dati.pp <- data.frame(prostate_rid, full.pred, rid.pred) ``` And perform the test: ```{r message=FALSE,warning=FALSE} roc.test(response = dati.pp$cvd, predictor1= dati.pp$full.pred, predictor2 = dati.pp$rid.pred) ``` We observe as expected significant better discrimination of the full model, but remember also that this model is more prone to overfitting (the bootstrap-corrected calibration is worse...) and that these measures are derived from the "training-all" dataset so they need also to be rescaled in *new* data. ```{r message=FALSE,warning=FALSE} roc.full <- roc(dati.pp$cvd, dati.pp$full.pred) roc.red <- roc(dati.pp$cvd, dati.pp$rid.pred) plot(roc.full) plot(roc.red, add=TRUE, col="red") legend("bottomright",legend=c("Full model", "Reduced model"), lty=c(1,1), col=c("black", "red"), bty="n") title("Discrimination on the development sample") ``` ## Another threshold to selecting candidate predictors We can try to be less conservative in estimating a reduced model, in order to increase the discrimination, using a different significance level for a variable to stay in the model and using individual approximate Wald tests rather than tests combining all deleted variables: ```{r message=FALSE,warning=FALSE} vrid2 <- validate(fit.multi, bw=TRUE , sls =0.5, type = 'individual' , B=200) ``` Now let's estimate a second reduced model: ```{r message=FALSE,warning=FALSE} fred2 <- lrm(cvd ~ sz + sg+ log(ap) + dbp + age + hx +pf_rec+rx +rcs(dtime,4) , data =prostate_rid) fred2 print(fred2) summary(fred2) ``` Give a look to the performance of this alternative reduced model: ```{r message=FALSE,warning=FALSE} fred2 <- update(fred2, x=TRUE , y=TRUE) vred2 <- validate (fred2, B=200) vred2 ``` The performance statistics are now midway between the full model and the smaller stepwise model. ```{r message=FALSE,warning=FALSE} cal.red2 <- calibrate(fred2, B=200) plot(cal.red2) ``` The mean absolute error is 0.03. Let's compare the AUCs : ```{r message=FALSE,warning=FALSE} rid2.pred <- predict(fred2, type="fitted") dati.pp2 <- data.frame(dati.pp,rid2.pred) roc.test(response = dati.pp2$cvd, predictor1= dati.pp2$full.pred, predictor2 = dati.pp2$rid2.pred) ``` We have a comparable discrimination power on the development sample with respect to the full model, but with a better calibration. ```{r message=FALSE,warning=FALSE} roc.full <- roc(dati.pp2$cvd, dati.pp2$full.pred) roc.red2 <- roc(dati.pp2$cvd, dati.pp2$rid2.pred) plot(roc.full) plot(roc.red2, add=TRUE, col="red") legend("bottomright",legend=c("Full model", "Reduced 2 model"), lty=c(1,1), col=c("black", "red"), bty="n") title("Discrimination on the development sample") ``` ## Final decision ! Finally we decide to present our estimated prediction model to the clinicians. Let's prepare a visualization that could be helpful to discuss with them; we rename some levels of the categorical variables: ```{r message=FALSE,warning=FALSE} prostate_rid$hxf <- as.factor(prostate_rid$hx) levels(prostate_rid$hxf) <- c("No", "Yes") label(prostate_rid$hxf) <- "History of CV disease" options(datadist='dd') dd <-datadist(prostate_rid) dd ``` We estimate the final model selected: ```{r message=FALSE,warning=FALSE} fredfin <- lrm(cvd ~ sz + sg+ log(ap) + dbp+ age + hxf +pf_rec+rx+rcs(dtime,4), data =prostate_rid) fredfin ``` ```{r message=FALSE,warning=FALSE} nom <- nomogram (fredfin , ap=c(.1 , .5 , 1, 5, 10, 20, 30, 40), fun=plogis , funlabel ="Probability ", lp=TRUE, fun.at =c(.01,.1,.25,.5,.75,.95)) nom ``` This is a useful tool to visualize a prediction model and discuss it with clinicians: ```{r message=FALSE,warning=FALSE} plot(nom,cex.axis=.75) ``` # Simple Linear Regression (a useful recap) First of all, install and load the required libraries : *rms*, *Hmisc* and *epiDisplay*. ```{r warning=FALSE,message=FALSE} library(rms) library(Hmisc) library(epiDisplay) ``` Let's simulate data from a sample of n=100 points along with population linear regression line. The conditional distribution of y|x can be thought of as a vertical slice at x. The unconditional distribution of y is shown on the y-axis. To envision the conditional normal distributions assumed for the underlying population, think of a bell-shaped curve *coming out* of the page, with its base along one of the vertical lines of points. The equal variance assumption (*homoscedasticity*) dictates that the series of Gaussian curves for all the different x values have equal variances. ```{r} n <- 100 set.seed(13) x <- round(rnorm(n, .5, .25), 1) y <- x + rnorm(n, 0, .1) r <- c(-.2, 1.2) ``` Plot: ```{r} plot(x, y, axes=FALSE, xlim=r, ylim=r, xlab=expression(x), ylab=expression(y)) axis(1, at=r, labels=FALSE) axis(2, at=r, labels=FALSE) abline(a=0,b=1) histSpike(y, side=2, add=TRUE) abline(v=.6, lty=2) ``` Simple linear regression is used when: * Only 2 variables are of interest * One variable is a response (continuous scale) and one is a predictor * The mean of the dependent variable is a quantity of interest [otherwise explore for example quantile regression] * No adjustment is needed for confounding or other between-subject variation * The investigator is interested in assessing the strength of the relationship between x and y in real data units, or in predicting y from x * A linear relationship is assumed (visual inspection is strongly recommended...) * Not when one only needs to test for association (use Pearson or Spearman's correlation in case) ## Interval Estimation: evaluating the uncertainty about predictions Estimation of the confidence intervals (CI) for predictions depend on what you want to predict, if at *individual* level or at *mean population* level. ```{r} x1 <- c( 1, 3, 5, 6, 7, 9, 11) y <- c( 5, 10, 70, 58, 85, 89, 135) dd <- datadist(x1, n.unique=5); options(datadist='dd') f <- ols(y ~ x1) p1 <- Predict(f, x1=seq(1,11,length=100), conf.type='mean') p2 <- Predict(f, x1=seq(1,11,length=100), conf.type='individual') p <- rbind(Mean=p1, Individual=p2) ggplot(p, legend.position='none') + geom_point(aes(x1, y), data=data.frame(x1, y, .set.='')) ``` Example usages: * Is a child of age x smaller than predicted for her age? Use the *individual level*, p2 (wider bands) * What is the best estimate of the *population mean* blood pressure for patients on treatment A? Use the *mean population level*, p1 (narrower bands) ## Assessing the Goodness of Fit It is crucial to verify the the assumptions underlying a linear regression model: * In a scatterplot the spread of y about the fitted line should be constant as x increases, and y vs. x should appear linear * Easier to see this with a plot of residuals vs estimated values * In this plot there should be no systematic patterns (no trend in central tendency, no change in spread of points with x) * Trend in central tendency indicates failure of linearity * qqnorm plot of residuals is a useful tool Here an example: we fit a linear regression model where x and y should instead have been log transformed: ```{r} n <- 50 set.seed(2) res <- rnorm(n, sd=.25) x <- runif(n) y <- exp(log(x) + res) f <- ols(y ~ x) plot(fitted(f), resid(f)) ``` This plot depicts non-constant variance of the residuals, which might call for transforming y. Now, we fit a linear model that should have been quadratic (functional form of X): ```{r} x <- runif(n, -1, 1) y <- x ^ 2 + res f <- ols(y ~ x) plot(fitted(f), resid(f)) ``` Finally, we fit a correct model: ```{r} y <- x + res f <- ols(y ~ x) plot(fitted(f), resid(f)) qqnorm(resid(f)); qqline(resid(f)) ``` These plots shows the ideal situation of white noise (no trend, constant variance). The qq plot demonstrates approximate normality of residuals, for a sample of size n = 50. ## Application example of simple linear regression: Hookworm & blood loss The dataset concerns the relationship between hookworm and blood loss from a study conducted in 1970. ```{r} data(Suwit) summary(Suwit) des(Suwit) ``` ```{r} summ(Suwit) attach(Suwit) ``` The file is clean and ready for analysis (this happens here only for didactic purposes: in real life, you will usually spend a couple of hours - at minimum, if not days..- to *clean* datasets). For example, with this small sample size it is somewhat straightforward to verify that there is no repetition of 'id' and no missing values. The records have been sorted in ascending order of 'worm' (number of worms) ranging from 32 in the first subject to 1929 in the last one. Blood loss ('bloss') is however, not sorted. The 13th record has the highest blood loss of 86 ml per day, which is very high. The objective of this analysis is to predict blood loss using worm. First of all, give a look to the data: ```{r} plot(worm, bloss, xlab="No. of worms", ylab="ml. per day", main = "Blood loss by number of hookworms in the bowel") ``` A linear model using the above two variables seems reasonable. ```{r} lm1 <- lm(bloss ~ worm, data=Suwit) lm1 ``` Displaying the model by typing 'lm1' gives limited information (essentially, the estimated regression line coefficients). To get more information, one can look at the attributes of this model, its summary and attributes of its summary. ```{r} attr(lm1, "names") ``` ```{r} summary(lm1) ``` The first section of summary shows the formula that was 'called'. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (34.38) compared to the minimum (-15.84) and the first quartile is further left (-10.81) of the median (0.75) than the third quartile (4.35) is. Otherwise, the median is close to zero. The third section gives coefficients of the intercept and the effect of 'worm' on blood loss. The intercept is 10.8 meaning that when there are no worms, the blood loss is estimated to be 10.8 ml per day. This is however, not significantly different from zero as the P value is 0.0618. The coefficient of 'worm' is 0.04 indicating that each worm is associated with an average increase of 0.04 ml of blood loss per day. Although the value is small, it is highly significantly different from zero. When there are many worms, the level of blood loss can be very substantial. The multiple R-squared value of 0.716 indicates that 71.6% of the variation in the data is explained by the model. The adjusted value is 0.6942. (The calculation of Rsquared is discussed in the analysis of variance section below). The last section describes more details of the residuals and hypothesis testing on the effect of 'worm' using the F-statistic. The P value from this section (0.0000699) is equal to that tested by the t-distribution in the coefficient section. This F-test more commonly appears in the analysis of variance table. ### Analysis of variance table, R-squared and adjusted R-squared ```{r} summary(aov(lm1)) ``` The above analysis of variance (aov) table breaks down the degrees of freedom, sum of squares and mean square of the outcome (blood loss) by sources (in this case there only two: worm + residuals). The so-called 'square' is actually the square of difference between the value and the mean. The total sum of squares of blood loss is therefore: ```{r} SST <- sum((bloss-mean(bloss))^2) SST ``` The sum of squares from residuals is: ```{r} SSR <- sum(residuals(lm1)^2) SSR ``` See also the analysis of variance table. The sum of squares of worm or sum of squares of difference between the fitted values and the grand mean is: ```{r} SSW <- sum((fitted(lm1)-mean(bloss))^2) SSW ``` The latter two sums add up to the first one. The R-squared is the proportion of sum of squares of the fitted values to the total sum of squares. ```{r} SSW/SST ``` Instead of sum of squares, one may consider the mean square as the level of variation. In such a case, the number of worms can reduce the total mean square (or variance) by: (total mean square - residual mean square) / total mean square, or (variance - residual mean square) / variance. ```{r} resid.msq <- sum(residuals(lm1)^2)/lm1$df.residual Radj <- (var(bloss)- resid.msq)/var(bloss) Radj ``` This is the adjusted R-squared shown in *summary(lm1)* in the above section. ### F-test When the mean square of 'worm' is divided by the mean square of residuals, the result is: ```{r} F <- SSW/resid.msq; F ``` Using this F value with the two corresponding degrees of freedom (from 'worm' and residuals) the P value for testing the effect of 'worm' can be computed. ```{r} pf(F, df1=1, df2=13, lower.tail=FALSE) ``` The function *pf* is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument 'lower.tail' is set to FALSE to obtain the right margin of the area under the curve of the F distribution. In summary, both the regression and analysis of variance give the same conclusion; that number of worms has a significant linear relationship with blood loss. Now the regression line can be drawn. ### Regression line, fitted values and residuals A regression line can be added to the scatter plot with the following command: ```{r} plot(worm, bloss, xlab="No. of worms", ylab="ml. per day", main = "Blood loss by number of hookworms in the bowel", type="n") abline(lm1) points(worm, fitted(lm1), pch=18, col="blue") segments(worm, bloss, worm, fitted(lm1), col="red") ``` The regression line has an intercept of 10.8 and a slope of 0.04. The expected value is the value of blood loss estimated from the regression line with a specific value of 'worm'.A residual is the difference between the observed and expected value. The residuals can be drawn by adding the red segments. The actual values of the residuals can be checked from the specific attribute of the defined linear model. ```{r} residuals(lm1) -> lm1.res hist(lm1.res) ``` ### Checking the normality of residuals Histogram of the residuals is not somehow convincing about their normal distribution shape. However, with such a small sample size, it is difficult to draw any conclusion. A better way to check normality is to plot the residuals against the expected normal score or (residual-mean) / standard deviation. A reasonably straight line would indicate normality. Moreover, a test for normality could be calculated. ```{r} a <- qqnorm(lm1.res) shapiro.qqnorm(lm1.res, type="n") text(a$x, a$y, labels=as.character(id)) ``` If the residuals were perfectly normally distributed, the text symbols would have formed along the straight dotted line. The graph suggests that the largest residual (13th) is too high (positive) whereas the smallest value (7th) is not large enough (negative). However, the P value from the Shapiro-Wilk test is 0.08 suggesting that the possibility of residuals being normally distributed cannot be rejected. Finally, the residuals are plotted against the fitted values to see if there is a pattern. ```{r} plot(fitted(lm1), lm1.res, xlab="Fitted values") plot(fitted(lm1), lm1.res, xlab="Fitted values", type="n") text(fitted(lm1), lm1.res, labels=as.character(id)) abline(h=0, col="blue") ``` There is no obvious pattern. The residuals are quite independent of the expected values. With this and the above findings from the *qqnorm* command we may conclude that the residuals are randomly and normally distributed. The above two diagnostic plots for the model 'lm1' can also be obtained from: ```{r} par(mfrow=c(1,2)) plot(lm1, which=1:2) detach(Suwit) ``` ### Final conclusion From the analysis, it is clear that blood loss is associated with number of hookworms. On average, each additional worm is associated with an increase of 0.04 ml of blood loss. The remaining uncertainty of blood loss, apart from hookworm, is explained by random variation or other factors that were not measured. ## Visualizing relationships/functional forms Remember that *linear regression* always means linearity in the parameters, irrespective of linearity in explanatory variables (i.e. the functional forms selected for the covariates). Y is *linearly* related to X (in the parameters) if the rate of change of Y with respect to X (dY/dX) is independent of the value of X. A function Y=BX=$b_{1}$$x_{1}$ + $b_{2}$$x_{2}$ is said to be linear (in the parameters), say, $b_{1}$, if $b_{1}$ appears with a power of 1 only and is not multiplied or divided by any other parameter (for eg $b_{1}$ x $b_{2}$ , or $b_{2}$ / $b_{1}$). This is a different concept with respect to the linearity in the functional form of the covariates, that is not instead required. Moreover, some regression models may look non linear in the parameters but are inherently or intrinsically linear. This is because with *suitable transformations* they can be made linear in parameters. Just as an example, imagine that we want to investigate the effect of total CSF polymorph count on blood glucose ratio in patients with either bacterial or viral meningitis. ```{r} require(Hmisc) getHdata(abm) ``` As a first step in every statistical analysis, we should *visualize* data of interest. In this example, we use a nonparametric regression approach to explore the relationship between the two variables: *plsmo* is a loess nonparametric smoother. ```{r} with(ABM, { glratio <- gl / bloodgl tpolys <- polys * whites / 100 plsmo(tpolys, glratio, xlab='Total Polymorphs in CSF', ylab='CSF/Blood Glucose Ratio', xlim=quantile(tpolys, c(.05,.95), na.rm=TRUE), ylim=quantile(glratio, c(.05,.95), na.rm=TRUE)) scat1d(tpolys); scat1d(glratio, side=4) }) ``` Moreover, we can use also a *Super smoother* relating age to the probability of bacterial meningitis given a patient has bacterial or viral meningitis (a binary outcome, see later in this block for logistic regression examples), with a rug plot showing the age distribution: ```{r} with(ABM, { plsmo(age, abm, 'supsmu', bass=7, xlab='Age at Admission, Years', ylab='Proportion Bacterial Meningitis') scat1d(age) }) ``` We can use these nonparametric approaches to help in finding a suitable *functional form* for the candidate predictor (switching to parametric regression models), that could be very different from the linear effect. For example, we can decide to use *splines* to model the effect of a continuous predictor on an outcome. For whom interested in splines see: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3 # Multiple linear regression (MLR) Datasets usually contain many variables collected during a study. It is often useful to see the relationship between two variables within the different levels of another third, categorical variable, i.e. to verify for the presence of an interaction. (This is just a didactic example, usually more than 3 variables are involved in MLR models). ## Example: Systolic blood pressure A small survey on blood pressure was carried out. The objective is to see the hypertensive effect of subjects putting additional table salt on their meal.Gender of subjects is also measured. ```{r} data(BP) attach(BP) des(BP) ``` Note that the maximum systolic and diastolic blood pressures are quite high. There are 20 missing values in *saltadd*. The frequencies of the categorical variables *sex* and *saltadd* are now inspected. ```{r} describe(data.frame(sex, saltadd)) ``` The next step is to create a new age variable from birthdate. The calculation is based on 12th March 2001, the date of the survey (date of entry in the study). ```{r} age.in.days <- as.Date("2001-03-12") - birthdate ``` There is a leap year in every four years. Therefore, an average year will have 365.25 days. ```{r} class(age.in.days) age <- as.numeric(age.in.days)/365.25 ``` The function *as.numeric* is needed to transform the units of age (difftime); otherwise modelling would not be possible. ```{r} describeBy(sbp,saltadd) ``` ### Recoding missing values into another category The missing value group has the highest median and average systolic blood pressure. In order to create a new variable with three levels type: ```{r} saltadd1 <- saltadd levels(saltadd1) <- c("no", "yes", "missing") saltadd1[is.na(saltadd)] <- "missing" summary(saltadd1) ``` ```{r} summary(aov(age ~ saltadd1)) ``` Since there is not enough evidence that the missing group is important and for additional reasons of simplicity, we will assume MCAR (*missing completely at random*) and we will ignore this group and continue the analysis with the original *saltadd* variable consisting of only two levels. Before doing this however, a simple regression model and regression line are first fitted. ```{r} lm1 <- lm(sbp ~ age) summary(lm1) ``` Although the R-squared is not very high, the p value is small indicating important influence of age on systolic blood pressure. A scatterplot of age against systolic blood pressure is now shown with the regression line added using the *abline* function. This function can accept many different argument forms, including a regression object. If this object has a *coef* method, and it returns a vector of length 1, then the value is taken to be the slope of a line through the origin, otherwise the first two values are taken to be the intercept and slope, as is the case for *lm1*. ```{r} plot(age, sbp, main = "Systolic BP by age", xlab = "Years", ylab = "mm.Hg") abline(lm1) ``` Subsequent exploration of residuals suggests a non-significant deviation from normality and no pattern. Details of this can be adopted from the techniques already discussed and are omitted here. The next step is to provide different plot patterns for different groups of salt habits. Note that here the sample size is lower (we decided to omit missing values for *saltadd*, reducing the analysis to the complete cases): ```{r} lm2 <- lm(sbp ~ age + saltadd) summary(lm2) ``` On the average, a one year increment of age is associated with an increase in systolic blood pressure by 1.5 mmHg. Adding table salt increases systolic blood pressure significantly by approximately 23 mmHg. ```{r} plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n") points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue") points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18) ``` Note that the red dots corresponding to those who added table salt are higher than the blue circles. The final task is to draw two separate regression lines for each group. We now have two regression lines to draw, one for each group. The intercept for non-salt users will be the first coefficient and for salt users will be the first plus the third. The slope for both groups is the same. ```{r} a0 <- coef(lm2)[1] a1 <- coef(lm2)[1] + coef(lm2)[3] b <- coef(lm2)[2] ``` ```{r} plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n") points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue") points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18) abline(a = a0, b, col = "blue") abline(a = a1, b, col = "red") ``` Note that X-axis does not start at zero. Thus the intercepts are out of the plot frame. The red line is for the red points of salt adders and the blue line is for the blue points of non-adders. In this model, age is assumed to have a constant effect on systolic blood pressure independently from added salt. But look at the distributions of the points of the two colours: the red points are higher than the blue ones but mainly on the right half of the graph. To fit lines with different slopes, a new model with *interaction term* is created. Therefore the next step is to estimate a model with different slopes (or different 'b' for the abline arguments) for the different lines. The model needs an interaction term between *addsalt* and *age*. ```{r} lm3 <- lm(sbp ~ age * saltadd) summary(lm3) ``` For the intercept of the salt users, the second term and the fourth are all zero (since age is zero) but the third should be kept as such. This term is negative. The intercept of salt users is therefore lower than that of the non-users. ```{r} a0 <- coef(lm3)[1] a1 <- coef(lm3)[1] + coef(lm3)[3] ``` For the slope of the non-salt users, the second coefficient alone is enough since the first and the third are not involved with each unit of increment of age and the fourth term has *saltadd* being 0. The slope for the salt users group includes the second and the fourth coefficients since *saltaddyes* is 1. ```{r} b0 <- coef(lm3)[2] b1 <- coef(lm3)[2] + coef(lm3)[4] ``` ```{r} plot(age, sbp, main="Systolic BP by age", xlab="Years",ylab="mm.Hg", pch=18, col=as.numeric(saltadd)) abline(a = a0, b = b0, col = 1) abline(a = a1, b = b1, col = 2) legend("topleft", legend = c("Salt added", "No salt added"),lty=1, col=c("red","black")) ``` Note that *as.numeric(saltadd)* converts the factor levels into the integers 1 (black) and 2 (red), representing the non-salt adders and the salt adders, respectively. These colour codes come from the R colour palette. This model suggests that at the young age, the systolic blood pressure of two groups are not much different as the two lines are close together on the left of the plot. For example, at the age of 25, the difference is 5.7mmHg. Increasing age increases the difference between the two groups. At 70 years of age, the difference is as great as 38mmHg. In this aspect, age *modifies* the effect of adding table salt on blood pressure. On the other hand the slope of age is 1.24mmHg per year among those who did not add salt but becomes 1.24+0.72 = 1.96mmHg among the salt adders. Thus, salt adding *modifies* the effect of age. Note that interaction is a statistical term whereas effect modification is the equivalent epidemiological term. Note also that the coefficient of the interaction term *age:saltaddyes* is not statistically significant !!! This means that the two slopes just differ *by chance* (or that we are low-powered to detect a significant interaction..) This was in fact just a didactic example to show how to introduce an interaction term in a regression model. # Causal inference versus prediction Conceptually, in prediction, in a certain sense we make *comparisons* between outcomes across different combinations of values of input variables to predict the probability of an outcome. In causal inference, we ask *what would happen* to an outcome y *as a result of a treatment or intervention*. Predictive inference relates to comparisons between units (different groups/subjects). Causal inference addresses comparisons of *different treatments* when applied to the *same* unit. ## The FEV dataset The FEV, which is an acronym for Forced Expiratory Volume, is a measure of how much air a person can exhale (in liters) during a forced breath. In this dataset, the FEV of 606 children, between the ages of 6 and 17, were measured. The dataset also provides additional information on these children: their age, their height, their gender and, most importantly, whether the child is a smoker or a non-smoker (the exposure of interest). This is an observational cross-sectional study. The goal of this study was to find out whether or not smoking has an effect on the FEV of children. Load the required libraries ```{r, message = FALSE} library(tidyverse) library(DataExplorer) library(SmartEDA) library(ggplot2) library(ggstatsplot) ``` Set the working directory and import data ```{r message=FALSE} fev <- read_tsv(here("datasets","fev.txt")) head(fev) ``` There are a few things in the formatting of the data that can be improved upon: 1. Both the `gender` and `smoking` can be transformed to factors. 2. The `height` variable is written in inches. Inches are hard to interpret. Let's add a new column, `height_cm`, with the values converted to centimeter using the `mutate` function. (For this example we will not use this variable however). ```{r} fev <- fev %>% mutate(gender = as.factor(gender)) %>% mutate(smoking = as.factor(smoking)) %>% mutate(height_cm = height*2.54) head(fev) ``` ## Data Exploration Now, let's make a first explorative boxplot, showing only the FEV for both smoking categories. ```{r message=FALSE, warning=FALSE} fev %>% ggplot(aes(x=smoking,y=fev,fill=smoking)) + scale_fill_manual(values=c("dimgrey","firebrick")) + theme_bw() + geom_boxplot(outlier.shape=NA) + geom_jitter(width = 0.2, size=0.1) + ggtitle("Boxplot of FEV versus smoking") + ylab("fev (l)") + xlab("smoking status") ``` Did you expect these results?? It appears that children that smoke have a higher median FEV than children that do not smoke. Should we change legislations worldwide and make smoking obligatory for children?? Maybe there is something else going on in the data. Now, we will generate a similar plot, but we will stratify the data based on age (age as factor). ```{r message=FALSE, warning=FALSE} fev %>% ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) + geom_boxplot(outlier.shape=NA) + geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) + theme_bw() + scale_fill_manual(values=c("dimgrey","firebrick")) + ggtitle("Boxplot of FEV versus smoking, stratified on age") + ylab("fev (l)") + xlab("Age (years)") ``` This plot seems to already give us a more plausible picture. First, it seems that we do not have any smoking children of ages 6, 7 or 8. Second, when looking at the results per age "category", it seems no longer the case that smokers have a much higher FEV than non-smokers; for the higher ages, the contrary seems true. This shows that taking into account confounders (in this case) is crucial! If we simply analyse the dataset based on the smoking status and FEV values only, our inference might be incorrect: ```{r} fit1 <- lm(fev~smoking, data=fev) # wrong beta0star, beta1star fit2 <- lm(fev~age+smoking, data=fev) # "true" beta0, beta2, beta1 fitage <- lm(age~smoking, data=fev) # gamma0, gamma1 fit1 fit2 fitage ``` ```{r} beta0 <- coef(fit2)[1] beta2 <- coef(fit2)[2] beta1 <- coef(fit2)[3] gamma0 <- coef(fitage)[1] gamma1 <- coef(fitage)[2] beta0star <- coef(fit1)[1] beta1star <- coef(fit1)[2] ``` check that: beta0star = beta0 + beta2*gamma0 : ```{r} beta0 + beta2*gamma0 beta0star ``` and also check that (wrong) beta1star = beta1 + beta2*gamma1: ```{r} beta1 + beta2*gamma1 beta1star ``` Therefore, a beneficial estimated smoking effect is obtained when age is ignored. *If* the causal inference assumptions hold (see slides) we can consider *beta1* as the average smoking effect in the population under study. ## Factors that affect causal inference estimates Imbalance and lack of complete overlap can make causal inference difficult. Remind : imbalance is when treatment groups differ with respect to an important covariate. Lack of complete overlap: when some combination of treatment level and covariate level is lacking (no observations, violation of the positivity assumption). To explain fev, sex seems to matter, especially among older individuals: ```{r} plot(fev~age, col=gender, data=fev) legend("topleft", pch=1, col=1:2, levels(fev$gender)) ``` Another way to plot the same thing is: ```{r message=FALSE, warning=FALSE} fev %>% ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) + geom_boxplot(outlier.shape=NA) + geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) + theme_bw() + scale_fill_manual(values=c("dimgrey","firebrick")) + ggtitle("Boxplot of FEV versus smoking, stratified on age and gender") + ylab("fev (l)") + xlab("smoking status") + facet_grid(rows = vars(gender)) ``` Especially for higher ages, the median FEV is higher for males as compared to females. [This could suggest a kind of *interaction* between gender and age, that could be explored in the regression model even if the interpretation of such interaction could be quite tricky (many levels for age!)]. Moreover, there is a slight gender imbalance among age categories: ```{r} counts <- table(fev$gender, as.factor(fev$age)) percentages <- round(prop.table(table(fev$gender, as.factor(fev$age)),2),digits=3) barplot(percentages, main="Gender distribution", xlab="ages", col=c("pink", "darkblue"), legend = rownames(counts)) ``` For imbalanced samples, simple comparisons of sample means between groups are not good estimates of treatment/risk factors effects. A model adjustment is of course one way to better estimate a treatment effect, where we add the covariate to the model. In this case for example we can add also gender in the regression model: ```{r} fit3 <- lm(fev~age+smoking+gender, data=fev) summary(fit3) ``` So now the effect of smoking is estimated conditional on both age and gender, and as expected is a negative effect on FEV; one possible problem however with these estimates could be that for some combinations of age-gender-smoking we do not have *any* observed data, so that the extrapolation of the regression model could be not so reliable. Also in this case (no interaction) if causal assumptions hold we can interpret the effect of smoking also as a marginal effect. ```{r} index <- fev$smoking==0 counts.noS <- table(fev[index,]$gender, as.factor(fev[index,]$age)) percentages.noS <- round(prop.table(table(fev[index,]$gender, as.factor(fev[index,]$age)),2),digits=3) index1 <- fev$smoking==1 counts.S <- table(fev[index1,]$gender, as.factor(fev[index1,]$age)) percentages.S <- round(prop.table(table(fev[index1,]$gender, as.factor(fev[index1,]$age)),2),digits=3) par(mfrow=c(1,2)) barplot(percentages.noS, main="Gender distribution among non-smokers",cex.main=0.8, xlab="ages", col=c("pink", "darkblue"), legend = rownames(percentages.noS)) barplot(percentages.S, main="Gender distribution among smokers",cex.main=0.8, xlab="ages", col=c("pink", "darkblue"), legend = rownames(percentages.S)) ``` Observe that in fact there is also a lack of “smoking” children below age 9: lack of complete overlap is when there are no observations at all for some combination(s) of treatment levels / covariate levels. For lack of complete overlap, there is no data available for some comparisons. This requires extrapolation using a model to make comparisons. This is in fact the job that the regression model does, but again extrapolation is always a risky business for the model in regions where no data are available. This is an even more serious problem than imbalance. Matching is a possible strategy in these situations to overcome (avoid) imbalance, even if some data will be discarded (see in the propensity score methods examples). ## Supplementary material: estimating a marginal effect from a regression model with an interaction Load the required libraries ```{r, message = FALSE} library(tidyverse) library(broom) library(gtsummary) library(MatchIt) library(geepack) library(boot) ``` Let's simulate the data : first of all set the parameters of the data generating mechanism: ```{r} set.seed(0) n.obs = 10000 #set sample size #---- True parameters in outcome model b0 = 60 b1 = 5 b2 = -0.3 b3 = -0.1 b4 = 8 b5 = 3 b6 = 2 #---- True parameters in exposure odds model g0 = log(0.20/(1-0.20)) g1 = log(1.01) g2 = log(1.005) g3 = log(0.6) g4 = log(0.5) g5 = log(0.8) #Function to compute outcome values ## Use the parameters specified above mean_out <- function(C1, C2, C3, exposure){ b0 + b1*exposure + b2*C1 + b3*I(C1^2) + b4*C2 + b5*C3 + b6*exposure*C2 + rnorm(n = n.obs, mean = 0, sd = 5) } # Function to compute exposure probabilities ## Use the parameters specified above prob_exp <- function(C1, C2, C3){ exp(g0 + g1*C1 + g2*I(C1^2) + g3*C2 + g4*C3 + g5*C2*C3)/(1 + exp(g0 + g1*C1+ g2*I(C1^2) + g3*C2 + g4*C3 + g5*C2*C3)) } ``` Now simulate the dataset : ```{r} df.sim <- tibble("ID" = seq(from = 1, to = n.obs, by = 1), "C1" = rnorm(n = n.obs, mean = 0, sd = 5), "C2" = rbinom(n = n.obs, size = 1, p = 0.4), "C3" = rbinom(n = n.obs, size = 1, p = 0.3), "Pexposure" = prob_exp(C1, C2, C3), "Exposure" = rbinom(n = n.obs, size = 1,prob = Pexposure), "Outcome" = as.numeric(mean_out(C1,C2,C3, Exposure))) head(df.sim) ``` In this simulated data, A is exposure, C1 is a continuous covariate, and C2 and C3 are binary covariates. The true outcome model is specified as follows: Y=b0+b1xA+b2xC1+b3x(C1)^2+b4xC2+b5xC3+b6xAxC2+eps where eps is the error term normally distributed with variance set at 5^2. The true causal effect of exposure A among those with C2 = 0 is b1=5. The true causal effect of exposure A among those with C2 = 1 is b1+b6=7. The marginal effect of A is: 5x0.6+7x0.4=5.8 because 40% of the total population has C2 = 1. ```{r} # Correctly specified model df.sim %>% lm(Outcome ~ Exposure*C2 + C1 + I(C1^2) + C3, data = .) %>% tidy(conf.int = TRUE) ``` The estimate for Exposure is 5.02. Note that this is an estimate of the conditional effect for C2=0 and it is identical to the true value because the model is correctly specified. The conditional effect for C2 = 1 is estimated to be 5.02 + 1.93 = 6.95, which again is nearly identical to the true parameter b1+b6=7. One can now *standardize* the conditional effect estimates from the correctly specified multivariable regression model to get an estimate of a marginal effect: ```{r} # Make copies of original data df.sim.a1 <- df.sim %>% mutate(Outcome = NA, Exposure = 1) #Assign Exposure = 1 to everyone df.sim.a0 <- df.sim %>% mutate(Outcome = NA, Exposure = 0) #Assign Exposure = 0 to everyone df.sim.combined <- bind_rows(df.sim.a1,df.sim.a0) #mean(df.sim.combined$"Exposure") ``` ```{r} # Fit an outcome model to the original data ## Correctly specified model gcomp.fit <- df.sim %>% lm(Outcome ~ Exposure*C2 + C1 + I(C1^2) + C3, data = .) # Predict outcome values using now the copied datasets df.sim.combined$pred <- predict(gcomp.fit, newdata = df.sim.combined) ``` ```{r} # ATE Estimate: difference between mean predicted values for rows with A=1 and mean predicted values for rows with A = 0 df.sim.combined %>% group_by(Exposure) %>% summarise( mean.Y = mean(pred) ) %>% pivot_wider( names_from = Exposure, names_glue = "mean.Y.{Exposure}", values_from = mean.Y ) %>% mutate(ATE = mean.Y.1-mean.Y.0) ``` The resulting estimate of a marginal effect is 5.78 — this is a consistent estimate of the true marginal effect of 5.8. Confidence intervals for the standardized estimate can be obtained via bootstrapping: ```{r} standardization.boot <- function(data, indices){ df <- data[indices,] df.a1 <- df %>% mutate(Outcome = NA, Exposure = 1) df.a0 <- df %>% mutate(Outcome = NA, Exposure = 0) df.combined <- bind_rows(df.a1,df.a0) gcomp.fit <- df %>% lm(Outcome ~ Exposure*C2 + C1 + I(C1^2) + C3, data = .) df.combined$pred <- predict(gcomp.fit, newdata = df.combined) output <- df.combined %>% group_by(Exposure) %>% summarise( mean.Y = mean(pred)) %>% pivot_wider( names_from = Exposure, names_glue = "mean.Y.{Exposure}", values_from = mean.Y ) %>% mutate( ATE = mean.Y.1-mean.Y.0 ) return(output$ATE) } # bootstrap standardization.results <- boot(data=df.sim, statistic=standardization.boot, R=100) # 100 bootstrapped samples # generating confidence intervals empirical.se <- sd(standardization.results$t) # get empirical standard error estimate estimate <- standardization.results$t0 ll <- estimate - qnorm(0.975)*empirical.se # normal approximation ul <- estimate + qnorm(0.975)*empirical.se data.frame(cbind(estimate, empirical.se, ll, ul)) ``` Of note: if you are interested in more details about these topics, you can explore the R library *lmw* at https://cran.r-project.org/web/packages/lmw/index.html and also the R library *arm* at https://cran.r-project.org/web/packages/arm/index.html. # Estimating causal effects from observational studies using the propensity score approach Fist of all, we will upload the required libraries: ```{r warning=FALSE,message=FALSE} library(twang) library(magrittr) library(tidyverse) library(gtsummary) library(stddiff) library(ggplot2) library(data.table) library(boot) library(splines) library(PSAgraphics) library(Matching) library(sandwich) library(survey) library(rms) ``` We will use a dataset coming from an observational study of 996 patients receiving an initial Percutaneous Coronary Intervention (PCI) at Ohio Heart Health, Christ Hospital, Cincinnati in 1997 and followed for at least 6 months by the staff of the Lindner Center. The patients thought to be more severely diseased were assigned to treatment with abciximab (an expensive, high-molecular-weight IIb/IIIa cascade blocker); in fact, only 298 (29.9 percent) of patients received usual-care-alone with their initial PCI. Our research question aims at estimating the *treatment effect* of abciximab+PCI (abcix) vs the standard care on the probability of of being deceased at 6 months. In this practical, we will apply the methods based on the propensity score. Measured pre-treatment characteristics that could *confound* the treatment-outcome relationship are: - acutemi: Recent acute myocardial infarction ( 0 No, 1 Yes) - ejecfrac: Left Ejection Fraction (%) - ves1proc: number of vessels involved ( from 1 to 5) - stent: 1 indicates coronary stent inserted - diabetic: 1 indicates the subject is diabetic - height: height of the subject in cm - female: 1 indicates female subjects ## Exploring the data ```{r} data("lindner",package="twang") set.seed(123) ``` ```{r} summary(lindner) ``` How is the treatment variable distributed in the population? ```{r} # Exposure lindner %>% dplyr::select(abcix) %>% tbl_summary() ``` ```{r} ggplot(lindner)+ geom_bar(aes(x=abcix,fill=as.factor(abcix)),stat="count")+ scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` Is the outcome rare? ```{r} # Outcome lindner$sixMonthDeath <- 1-lindner$sixMonthSurvive lindner %>% dplyr::select(sixMonthDeath) %>% tbl_summary() ``` What is the *crude* odds ratio for mortality? ```{r} # Crude OR fit.crude <- glm(sixMonthDeath~abcix,family = binomial,data=lindner) tbl_regression(fit.crude,exponentiate=T) ``` Let's now summarize the confounding variables by treatment group and by outcome, to have a general idea about the observed associations: ```{r} # Possible confounders lindner %>% dplyr::select(acutemi, ejecfrac, ves1proc, stent, diabetic, female, height) %>% tbl_summary() ``` ```{r} # Descriptive statistics of patients'characteristics by treatment group lindner %>% dplyr::select(acutemi, ejecfrac, ves1proc, stent, diabetic, female, height, abcix) %>% tbl_summary(by=abcix) %>% add_overall() %>% add_p() ``` ```{r} # Descriptive statistics of patients'characteristics by outcome lindner %>% dplyr::select(acutemi, ejecfrac, ves1proc, stent, diabetic, female, height, sixMonthDeath) %>% tbl_summary(by=sixMonthDeath) %>% add_overall() %>% add_p() ``` We can now calculate the *Standardized Difference*,which can be use as a measure of balance in the treatment groups. It is a measure of difference between groups that is *independent* from statistical testing (remember that p values always depend on sample size !!). It is very similar to the definition of *effect size* that we discussed in Block 2. It can be defined for a continuous covariate as: $SD_{c}=\frac{\overline{x_{1}}-\overline{x_{0}}} {\sqrt{\frac{s_1^2+s_0^2}{2}}}$ and for a dichotomous covariate as: $SD_{d}=\frac{\overline{p_{1}}-\overline{p_{0}}} {\sqrt{\frac{\overline{p_{1}}(1-\overline{p_{1})}+\overline{p_{0}}(1-\overline{p_{0}})}{2}}}$ The rough interpretation is that imbalance is present if the standardized difference is greater than 0.1 or 0.2. ```{r} s1 <- stddiff.numeric(vcol="height",gcol="abcix",data=lindner) s2 <- stddiff.numeric(vcol="ejecfrac",gcol="abcix",data=lindner) s3 <- stddiff.numeric(vcol="ves1proc",gcol="abcix",data=lindner) s4 <- stddiff.binary(vcol="stent",gcol="abcix",data=lindner) s5 <- stddiff.binary(vcol="female",gcol="abcix",data=lindner) s6 <- stddiff.binary(vcol="diabetic",gcol="abcix",data=lindner) s7 <- stddiff.binary(vcol="acutemi",gcol="abcix",data=lindner) cont.var <- as.data.frame(rbind(s1,s2,s3)) rownames(cont.var) <- c("height", "ejecfrac", "ves1proc") cont.var bin.var <- as.data.frame(rbind(s4,s5,s6, s7)) rownames(bin.var) <- c("stent", "female", "diabetic", "acutemi") bin.var ``` ## Estimating the propensity score Fit now a propensity score model — a logistic regression model with abciximab+PCI (vs. PCI) as the outcome, and the confounders listed in the table above included as covariates. We exclude from the list the variable *height*, since there was not a relevant difference between the groups. ```{r} # Fit a propensity score model fit.ps<- glm(abcix~ acutemi+ ejecfrac+ ves1proc+ stent+ diabetic+ female, data=lindner,family = binomial) summary(fit.ps) ``` ```{r} # Save the estimated propensity score lindner$ps <- fitted(fit.ps) ``` ```{r} # Plot estimated ps ggplot(lindner) + geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) + scale_y_continuous("Estimated PS") + scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` ```{r} ggplot(lindner) + geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) + scale_y_continuous("Estimated PS") + facet_grid(cols=vars(abcix))+ scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` Assess whether there are non-overlapping scores (positivity violation) in the two exposure groups: ```{r} lindner %>% dplyr::select(ps,abcix) %>% tbl_summary(type = list(ps~"continuous2"),by=abcix, statistic = all_continuous2() ~c( "{median} ({p25}, {p75})", "{min}, {max}")) ``` Investigate overlap: ```{r} lindner %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) & ps<=max(ps[abcix==0]),1,0)) # non-overlap: treatment group have higher ps than any non-abciximab user and # control group have smaller ps than any abciximab user with(lindner,table(overlap,abcix)) with(lindner,prop.table(table(overlap,abcix)),2) ``` In the successive steps, we remove subjects that does not overlap. This step reduce the original sample size, but we should respect the assumption of positivity in order to estimate a reasonable causal effect. It makes no sense including subjects that have "near-zero" probability to receive the treatment or to have a "match" in the successive analyses. ## First option: "adjusting" for the propensity score We can use the estimated PS as a covariate in a logistic regression model for the outcome: ```{r} # Model 1: Linear relationship between ps and outcome fit.out<- glm(sixMonthDeath~abcix+ps, data=lindner, family = binomial, subset=overlap==1) # Model summary summary(fit.out) ``` The second step is to save the predicted probabilities for the treated and the untreated and estimate the causal effects of interest in the population: ```{r} # fitted values (probabilities) lindner$predY0<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[3]*lindner$ps) # PCI subjects lindner$predY1<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[2]+coef(fit.out)[3]*lindner$ps) # PCI+abciximab subjects # ATE effect Y1<-mean(lindner$predY1) Y0<-mean(lindner$predY0) # ATT effect Y1_1<-mean(lindner$predY1[lindner$abcix==1]) Y0_1<-mean(lindner$predY0[lindner$abcix==1]) # ATE effect Y1-Y0 # ATT effect Y1_1-Y0_1 # Estimate odds ratios related to the "ATE" and the "ATT" (Y1/(1-Y1))/(Y0/(1-Y0)) (Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1)) ``` The ATE effect is quite similar to the ATT effect, indicating that there is a protective effect of the PCI+abciximab vs PCI alone. To obtain the corresponding confidence intervals we can use the bootstrap approach. We do not here outline this procedure, see at the end of this practical the supplementary code. This model relies on two additional assumptions: no interaction between propensity score and treatment, and a linear relationship between the propensity score and treatment. Do these assumptions appear reasonable here? We can try to fit different models, and then compare the AIC: ```{r} # Model 2: Non-linear relationship between ps and outcome fit.out2<- glm(sixMonthDeath~abcix+ps+I(ps^2), data=lindner, family = binomial, subset=overlap==1) summary(fit.out2) # Model 3: Non-linear relationship between ps and outcome and interaction between ps and treatment fit.out3<- glm(sixMonthDeath~abcix*ps+I(ps^2), data=lindner, family = binomial, subset=overlap==1) summary(fit.out3) ``` It seems that the relationship could be partially non-linear, but there is no a statistical significance very strong, as well as for the interaction. So probably the best parsimonious model to keep is Model 1. ## Second option: Stratification Create propensity score strata: this could be an iterative process, since we should verify if we have enough subjects/event in each stratum. ```{r} lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.25,0.5,0.75,1)),include.lowest=T,labels=c(1:4))) ``` ```{r} #Check they have been created correctly summary(lindner$strata) tapply(lindner$ps,lindner$strata,summary) ``` ```{r} #Look at numbers of events and patients in each strata/exposure group table(lindner$sixMonthDeath,lindner$strata,lindner$abcix) ``` These strata seem quite "sparse" as number of events. Another possibility is : ```{r} #Create propensity score strata lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.33,0.66,1)),include.lowest=T,labels=c(1:3))) #Check they have been created correctly summary(lindner$strata) #Look at numbers of events and patients in each strata/exposure group table(lindner$sixMonthDeath,lindner$strata,lindner$abcix) ``` Remind : we should also check for the balance of the confounders in each strata ! See the supplementary material for that. For now, let's just estimate the OR in each stratum: ```{r} beta.treat<-numeric(3) nstrata<-table(lindner$strata) treated.strata<-table(lindner$strata,lindner$abcix)[,2] for (i in 1:3){ ms<-glm(sixMonthDeath~abcix,data=lindner,subset = strata==i,family="binomial") beta.treat[i]<-coef(ms)[2] print(summary(ms)) } ``` And, finally, let's estimate the *weighted* OR related to the ATE and the ATT as a weighted average of the ORs in the various strata: ```{r} exp(sum(beta.treat*nstrata)/nrow(lindner)) exp(sum(beta.treat*treated.strata)/sum(treated.strata)) ``` Also here, we should use a bootstrap approach to estimate the corresponding confidence intervals. ## Third option: Matching Here we should create a reduced dataset retaining only patients with the overlap: ```{r} lindner.overlap <- lindner %>% filter(overlap==1) ``` Now we proceed with the matching algorithm: there is plenty of different algorithms in R that produce matching, here we use one from the *library(Matching)*. ```{r} library(Matching) match <- Match(Y=lindner.overlap$sixMonthDeath, Tr=lindner.overlap$abcix, X=lindner.overlap$ps, caliper=0.2,# all matches not equal to or within 0.2 standard deviations of ps are dropped M=1, ties=FALSE, replace=TRUE # 1:1 ) ``` ```{r} # Number of pairs nn <- length(match$index.treated) # Create matched dataset lindnerMatched <- cbind(rbind(lindner.overlap[match$index.treated,], lindner.overlap[match$index.control,]), pair=c(1:nn,1:nn)) table(lindner.overlap$abcix) ``` ```{r} #Check number of treated patients table(lindnerMatched$abcix) ``` ```{r} #Look at people being used multiple times in the matched sample summary(as.factor(table(match$index.treated))) summary(as.factor(table(match$index.control))) ``` ```{r} #Look at the propensity score distribution in the matched dataset ggplot(lindnerMatched) + geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) + scale_y_continuous("Estimated PS") + scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` ```{r} ggplot(lindnerMatched) + geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) + scale_y_continuous("Estimated PS") + facet_grid(cols=vars(abcix))+ scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` Let's check now the balance: ```{r warning=FALSE} # Balance Diagnostics before and after matching bal <- MatchBalance(abcix~ stent+ female+ diabetic+ acutemi+ ejecfrac+ ves1proc, data=lindner.overlap, match.out = match) ``` ```{r} lindnerMatched %>% dplyr::select(stent,female,diabetic,acutemi,ejecfrac,ves1proc,abcix) %>% tbl_summary(by=abcix) %>% add_overall() %>% add_p() ``` Note that the number of stent has not been well balanced after the matching procedure. For this reason, we use this covariate in the regression model for the outcome. Now we estimate the causal effect: ```{r} fit.out4<-glm(sixMonthDeath~abcix+stent, data=lindnerMatched,family=binomial) summary(fit.out4) ``` We can see that the number of stent is statistically significant in the model,so it has been a good idea to control for it, since it was not well balanced in the matching procedure. As we already have discussed, sometimes also covariates that are not confounders for the effect of the treatment on the outcome could be included in the final model in order to obtain more accurate estimates of the effect. ## Fourth option: IPTW Definition of the weights: ```{r} # Definition of weights for ATE lindner.overlap %<>%mutate(w_ATE=case_when(abcix==1~1/ps, abcix==0~1/(1-ps))) #Definition of weights for ATT lindner.overlap %<>%mutate(w_ATT=case_when(abcix==1~1, abcix==0~ps/(1-ps))) ``` Check the extreme weights: sometimes it is useful to use truncated or stabilized weights, in order to reduce the variance of the final estimates, but we do not cover here this aspect. ```{r} #Check extremes quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1)) quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1)) quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1)) quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1)) ``` ```{r} # Balance diagnostics #ATE bal_IPTW_ATE <- dx.wts(x=lindner.overlap$w_ATE, data=lindner.overlap, vars=colnames(lindner.overlap)[4:10], treat.var = colnames(lindner.overlap)[3], estimand = "ATE") #ATT bal_IPTW_ATT <- dx.wts(x=lindner.overlap$w_ATT, data=lindner.overlap, x.as.weights = T, vars=colnames(lindner.overlap)[4:10], treat.var = colnames(lindner.overlap)[3], estimand = "ATT") bal.table(bal_IPTW_ATE) bal.table(bal_IPTW_ATT) ``` Finally, let's estimate the ATE causal effect on the weighted dataset: ```{r warning=FALSE, message=FALSE} # Estimate ATE design.lindnerATE <- svydesign(ids=~1, weights = ~w_ATE, data=lindner.overlap) fit_itpw_ATE <- svyglm(sixMonthDeath~abcix, family=binomial, design=design.lindnerATE) tbl_regression(fit_itpw_ATE,exponentiate = T) ``` And the ATT: ```{r} design.lindnerATT <- svydesign(ids=~1, weights = ~w_ATT, data=lindner.overlap) fit_iptw_ATT <- svyglm(sixMonthDeath~abcix, family=binomial, design=design.lindnerATT) tbl_regression(fit_iptw_ATT,exponentiate = T) ``` Also here it is possible to estimate the 95% CI using boostrap methods (that are in general more robust). ## Boostrap confidence intervals for METHOD 1: covariate adjustement ```{r} results <- data.frame(ATE=rep(NA,4),ATT=rep(NA,4)) ``` ```{r} f_PSadj <- function(data, indices,outcome.formula) { d <- data[indices,] # allows boot to select sample # estimation of ps m1<-glm(abcix~ acutemi+ ejecfrac+ ves1proc+ stent+ diabetic+ female, data=d, family = binomial) d$ps<-fitted.values(m1) # overlap d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) & ps<=max(ps[abcix==0]),1,0)) # outcome model m2<-glm(outcome.formula,data=d,family="binomial",subset = overlap==1) if(!m2$converged) print("Model did not converged") d$predY0<-m2$family$linkinv(coef(m2)[1]+coef(m2)[3]*d$ps) d$predY1<-m2$family$linkinv(coef(m2)[1]+coef(m2)[2]+coef(m2)[3]*d$ps) Y1<-mean(d$predY1) Y0<-mean(d$predY0) Y1_1<-mean(d$predY1[lindner$abcix==1]) Y0_1<-mean(d$predY0[lindner$abcix==1]) ATE_PSadj<-(Y1/(1-Y1))/(Y0/(1-Y0)) ATT_PSadj<-(Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1)) return(c(ATE_PSadj,ATT_PSadj)) } res_boot <- function(obj.boot,type="percent",digits=3){ suppressWarnings({ orig <- round(obj.boot$t0,digits) ciATE <- paste(round(boot.ci(obj.boot,index=1)[[type]][4:5],digits),collapse = "-") ciATT <- paste(round(boot.ci(obj.boot,index=2)[[type]][4:5],digits),collapse = "-") }) res <- paste(orig,rbind(ciATE,ciATT),sep="(") res.u <- paste(res,rep(")",2),sep = "") return(res.u) } ``` ```{r warning=FALSE} boot.out <- boot(data=lindner, statistic=f_PSadj,R=1000,outcome.formula=fit.out$formula) print(boot.out) # Get 95% confidence interval results$ATE[1]<-res_boot(boot.out,digits=4)[1] results$ATT[1]<-res_boot(boot.out,digits=4)[2] results ``` ## Boostrap confidence intervals for METHOD 2: stratification Very often with the stratification method there are many problems of convergence of the regression algorithm, since in the strata we have very few events ! ```{r} f_PSstrat <- function(data, indices) { d <- data[indices,] # allows boot to select sample m1<-glm(abcix~ acutemi+ ejecfrac+ ves1proc+ stent+ diabetic+ female, data=d,family = binomial) d$ps<-fitted.values(m1) quart_PS<-quantile(d$ps,c(0,0.33,0.66,1)) d$strata<-cut(d$ps, quart_PS, labels=c(1:3)) for (i in 1:3){ ms<-glm(sixMonthDeath~abcix,data=d,subset = strata==i,family="binomial") beta.treat[i]<-coef(ms)[2] } ATE <- exp(sum(beta.treat*nstrata)/nrow(lindner)) ATT <- exp(sum(beta.treat*treated.strata)/sum(treated.strata)) return(c(ATE,ATT)) } ``` ```{r warning=FALSE} boot.out4 <- boot(data=lindner, statistic=f_PSstrat,R=1000) # Get 95% confidence interval results$ATE[2]<-res_boot(boot.out4,digits=3)[1] results$ATT[2]<-res_boot(boot.out4,digits=3)[2] ``` ## Robust confidence intervals for METHOD 3: matching We now estimate the robust standard errors related to the estimate on the matched dataset: ```{r warning=FALSE} cov <- vcovHC(fit.out4, type = "HC0") std.err <- sqrt(diag(cov)) q.val <- qnorm(0.975) r <- cbind( Estimate = coef(fit.out4) , "Robust SE" = std.err , z = (coef(fit.out4)/std.err) , "Pr(>|z|) "= 2 * pnorm(abs(coef(fit.out4)/std.err), lower.tail = FALSE) , LL = coef(fit.out4) - q.val * std.err , UL = coef(fit.out4) + q.val * std.err ) #Exponential to get the OR results$ATT[3]<- paste0(round(exp(r[2,1]),4),"(",round(exp(r[2,5]),4),"-",round(exp(r[2,6]),4),")") ``` ## Boostrap confidence intervals for method 4: IPTW ```{r warning=FALSE} f_IPTW <- function(data, indices) { d <- data[indices,] # allows boot to select sample # estimation of ps m1<-glm(abcix~ acutemi+ ejecfrac+ ves1proc+ stent+ diabetic+ female, data=d, family = binomial) d$ps<-fitted.values(m1) # overlap d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) & ps<=max(ps[abcix==0]),1,0)) %>% filter(overlap==1) # Definition of weights for ATE d %<>%mutate(w_ATE=case_when(abcix==1~1/ps, abcix==0~1/(1-ps))) #Definition of weights for ATT d %<>%mutate(w_ATT=case_when(abcix==1~1, abcix==0~ps/(1-ps))) # Estimate ATE design.lindnerATE <- svydesign(ids=~1, weights = ~w_ATE, data=d) fit_itpw_ATE <- svyglm(sixMonthDeath~abcix, family=binomial, design=design.lindnerATE) # Estimate ATT design.lindnerATT <- svydesign(ids=~1, weights = ~w_ATT, data=d) fit_iptw_ATT <- svyglm(sixMonthDeath~abcix, family=binomial, design=design.lindnerATT) ATE <- exp(fit_itpw_ATE$coefficients[2]) ATT <- exp(fit_iptw_ATT$coefficients[2]) return(c(ATE,ATT)) } ``` ```{r warning=FALSE} boot.out5 <- boot(data=lindner, statistic=f_IPTW,R=1000) results$ATE[4]<-res_boot(boot.out5,digits=4)[1] results$ATT[4]<-res_boot(boot.out5,digits=4)[2] ``` ## Final Comparison across methods ! ```{r} # recall the crude OR from the original dataset tbl_regression(fit.crude,exponentiate=T) ``` ```{r} # PS results row.names(results) <- c("PS Adj Linear","Stratification","Matching","IPTW") results ``` We can observe that using the propensity score based methods we obtain a stronger estimate of the treatment effect with respect to the crude odds ratio. We can also observe that stratification produce very unstable results due to the low sample size in the different strata. In conclusion, adjusting for confounders was important in this context ! ## Appendix 1 : METHOD 2 "Visually" checking balance across strata ```{r} # Function for graphical diagnostic to check balance balance.Strat <- function(x){ if(length(unique(lindner[,x]))<10){ cat.psa(categorical = lindner[,x], treatment = lindner$abcix, strata = lindner$strata, ylab="Relative Frequency") title(x) } else{ box.psa(continuous = lindner[,x], treatment = lindner$abcix, strata = lindner$strata) title(x) } } ``` ```{r} var <- lindner %>%dplyr::select(stent, female, diabetic, acutemi, ejecfrac, ves1proc) %>% colnames() ``` ```{r} balance.Strat(var[1]) ``` ```{r} balance.Strat(var[2]) ``` ```{r} balance.Strat(var[3]) ``` ```{r} balance.Strat(var[4]) ``` ```{r} balance.Strat(var[5]) ``` ```{r} balance.Strat(var[6]) ``` ## Appendix 2: Methods to estimate weights There are many R libraries that can be used to estimate weights see for example: https://cran.r-project.org/web/packages/PSweight/index.html This library enables the estimation and inference of average causal effects with binary and multiple treatments using overlap weights (ATO), inverse probability of treatment weights (ATE), average treatment effect among the treated weights (ATT), matching weights (ATM) and entropy weights (ATEN), with and without propensity score trimming. Reference: https://arxiv.org/pdf/2010.08893 # Logistic regression ## Introduction In epidemiological data, most of the outcomes are often binary or dichotomous. For example, in the investigation of the cause of a disease, the status of the outcome, the disease, is diseased vs non-diseased. For a mortality study, the outcome is usually died or survived. For a continuous variable such as weight or height, we can think that the single *representative number* for the population or sample is the mean or median. For dichotomous data, the *representative number* is the proportion or percentage of one type of the outcome. For example, as you should remember from Block 1, *prevalence* is the proportion of the population with the disease of interest. Case-fatality is the proportion of deaths among the people with the disease. The other related term is *probability*. Proportion is a simple straightforward term. Probability denotes the likeliness, which is more theoretical (assuming a random mechanism that generate the observed data). In the case of a dichotomous variable, the proportion is used as the *estimated* probability, for example assuming a binomial random variable behaviour. For computation, the outcome is often represented with 1 and 0 otherwise. The prevalence is then the mean of diseased values among the study sample. Assuming the binomial mechanism, the standard regression model is the logistic regression, using the log(odds) as a dependent variable in the model. If P is the probability of having a disease, 1-P is probability of not having the disease. The odds is thus P/(1-P). As you remind, the relationship between probability and odds, mainly log(odds) can be plotted as follows: ```{r} p <- seq(from=0, to=1, by=.01) odds <- p/(1-p) plot(log(odds), p, type="l", col="blue", ylab="Probability", main="Relationship between log odds and probability", las=1) abline(h=.5) abline(v=0) ``` Being on a linear and well balanced scale, the logit is a more appropriate scale for regression model for a binary outcome than the probability itself. Similarly to the conversion between probabilities and odds, we can devise functions that convert both ways between probability and log-odds: ```{r} logit <- function(p) log(p/(1-p)) tigol <- function(t) 1/(1+exp(-t)) ``` (tigol is just the letters of logit reversed). Modelling logit(Y|X) ~ $\beta$X is the general form of logistic regression. It means that we are modelling the logit of Y given X (or *conditioning* on X), where X denotes one or more independent variables. Suppose there are independent or exposure variables: X1 and X2. $\beta$X would be $\beta_{0}$+ $\beta_{1}$X1 + $\beta_{2}$X2, where $\beta_{0}$ is the intercept. The X can be age, sex, and other prognostic variables. In the explanatory/causal setting, among these X variables, one *specific* X is usually the focus of the study. Others are potential confounders and covariates. Using logistic regression it turns out that the probability of observing Y given X Pr(Y|X) is equal to exp$\beta$X/(1 + exp$\beta$X). Hence, logistic regression is often used to compute the probability of an outcome under a given set of exposures. For example, prediction of probability of getting a disease under a given set of age, sex, and behaviour groups, etc. ## Example: Tooth decay The dataset Decay is a simple dataset containing two variables: 'decay', which is binary and 'strep', which is a continuous variable. ```{r} data(Decay) summary(Decay) ``` The outcome variable is 'decay', which indicates whether a person has at least one decayed tooth (1) or not (0). The exposure variable is 'strep', the number of colony forming units (CFU) of streptococci, a group of bacteria suspected to cause tooth decay. The prevalence of having decayed teeth is equal to the mean of the 'decay' variable, i.e. 0.63. To look at the 'strep' variable type: ```{r} summary(Decay$strep) hist(Decay$strep) ``` The plot shows that the vast majority have the value at about 150. Since the natural distribution of bacteria is logarithmic, a transformed variable is created and used as the independent variable. Let's now estimate an univariable logistic regression model: ```{r} Decay$log10.strep <- log10(Decay$strep) glm0 <- glm(decay~log10.strep, family=binomial, data=Decay) ``` Ask for the summary of the estimated model: ```{r} summary(glm0) ``` Both the coefficients of the intercept and 'log10.strep' are statistically significant. Pr(>|z|) for 'log10.strep' is the P value from Wald's test. This tests whether the coefficient, 1.681, is significantly different from 0. In this case it is. The estimated intercept is -2.554. This means that when log10.strep is 0 (or strep equals 1 CFU), the logit of having at least a decayed tooth is -2.55. We can then calculate the related baseline odds and probability. ```{r} exp(-2.554) -> baseline.odds baseline.odds ``` ```{r} baseline.odds/(1+baseline.odds) -> baseline.prob baseline.prob ``` There is an odds of 0.077 or a probability of 7.2% of having at least one decayed tooth if the number of CFU of the mutan strep is at 1 CFU. The coefficient of log10.strep is 1.681. For every unit increment of log10(strep), or an increment of 10 CFU, the logit will increase by 1.681. This increment of logit is constant but not the increment of probability because the latter is not on a linear scale. The probability at each point of CFU is computed by replacing both coefficients obtained from the model. For example, at 100 CFU, the probability is: ```{r} prob.100 <- coef(glm0)[1] + log10(100)*coef(glm0)[2] prob.100 ``` To see the relationship for the whole dataset: ```{r} plot(Decay$log10.strep, fitted(glm0), ylim=c(0,0.80)) ``` A logistic nature of the curve is partly demonstrated. To make it clearer, the ranges of X and Y axes are both expanded to allow a more extensive curve fitting. ```{r} plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4), ylim=c(0,1)) ``` Another vector of the same name 'log10.strep' is created in the form of a data frame for plotting a fitted line on the same graph. ```{r} newdata <- data.frame(log10.strep=seq(from=-2, to=4, by=.01)) predicted.line <- predict.glm(glm0,newdata,type="response") ``` The values for predicted line on the above command must be on the same scale as the 'response' variable. Since the response is either 0 or 1, the predicted line would be in between, ie. the predicted probability for each value of log10(strep). ```{r} plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4), ylim=c(0,1), xlab=" ", ylab=" ", xaxt="n", las=1) lines(newdata$log10.strep, predicted.line, col="blue") axis(side=1, at=-2:4, labels=as.character(10^(-2:4))) title(main="Relationship between mutan streptococci \n and probability of tooth decay", xlab="CFU", ylab="Probability of having decayed teeth") ``` ## Another example of logistic regression The above example of caries data has a continuous variable 'log10.strep' as the key independent variable. In most epidemiological datasets, the independent variables are often categorical. We will examine an example coming from an outbreak investigation. On 25 August 1990, the local health officer in Supan Buri Province of Thailand reported the occurrence of an outbreak of acute gastrointestinal illness on a national handicapped sports day. Epidemiologists went to investigate. The dataset is called Outbreak. Most variable names are self explanatory. Variables are coded as 0 = no, 1 = yes and 9 = missing/unknown for three food items consumed by participants: 'beefcurry' (beef curry), 'saltegg' (salted eggs) and 'water'. Also on the menu were eclairs, a finger-shaped iced cake of choux pastry filled with cream. This variable records the number of pieces eaten by each participant. Missing values were coded as follows: 88 = "ate but do not remember how much", while code 90 represents totally missing information. Some participants experienced gastrointestinal symptoms, such as: nausea, vomiting, abdominal pain and diarrhea. The ages of each participant are recorded in years with 99 representing a missing value. The variables 'exptime' and 'onset' are the exposure and onset times, which are in character format. Let's look at the data. ```{r} data(Outbreak) summary(Outbreak) ``` We will first of all define the cases. It was agreed among the investigators that a case should be defined as a person who had any of the four symptoms: 'nausea', 'vomiting', 'abdpain' or 'diarrhea'. A case can then by computed as follows: ```{r} Outbreak$case <- (Outbreak$nausea==1)|(Outbreak$vomiting==1)|(Outbreak$abdpain==1)|(Outbreak$diarrhea==1) ``` The variable 'case' is now incorporated into the data. ```{r} summary(Outbreak) ``` ### Recoding missing values We now recode missing values.We do not impute them, so subjects with missing values will be removed from the analyses. ```{r} Outbreak = Outbreak %>% mutate( age=na_if(age, 99), beefcurry=na_if(beefcurry, 9), saltegg=na_if(saltegg, 9), water=na_if(water, 9), eclair=na_if(eclair, 90)) summary(Outbreak) ``` The three variables can also be changed to factors: ```{r} Outbreak= Outbreak %>% mutate( beefcurry=as.factor(beefcurry), saltegg=as.factor(saltegg), water=as.factor(water)) summary(Outbreak) ``` All variables now look fine except 'eclair' which still contains the value 80 representing "ate but not remember how much". We will analyse its relationship with 'case' by considering it as an ordered categorical variable. ```{r} tabpct(Outbreak$eclair, Outbreak$case) ``` The width of the columns of the mosaic graph denotes the relative frequency of that category. The highest frequency is 2 pieces followed by 0 and 1 piece. The other numbers have relatively low frequencies; particularly the 5 records where 'eclair' was coded as 80. There is a tendency of increasing red area or attack rate from left to right indicating that the risk was increased when more pieces of eclair were consumed. We will use the distribution of these proportions to guide our grouping of eclair consumption. The first column of zero consumption has a very low attack rate, therefore it should be a separate category. Only a few took half a piece and this could be combined with those who took only one piece. Persons consuming 2 pieces should be kept as one category as their frequency is very high. Others who ate more than two pieces should be grouped into another category. Finally, those coded as '80' will be dropped due to the unknown amount of consumption as well as its low frequency. Remind that collapsing categories with low numbers help in increasing the power of the statistical estimate of the regression coefficients. ```{r} Outbreak$eclairgr <- cut(Outbreak$eclair, breaks = c(0, 0.4, 1, 2, 79), include.lowest = TRUE, labels=c("0","1","2",">2")) ``` The argument 'include.lowest=TRUE' indicates that 0 eclair must be included in the lowest category. ```{r} tabpct(Outbreak$eclairgr, Outbreak$case) ``` The attack rate or percentage of diseased in each category of exposure, as shown in the bracket of the column TRUE, increases from 5.1% among those who did not eat any eclairs to 70.1% among those heavy eaters of eclair. The graph output is similar to the preceding one except that the groups now are more concise. We now have a continuous variable of 'eclair' and a categorical variable of 'eclairgr'. The next step is to create a binary exposure for eclair. ```{r} Outbreak$eclair.eat <- Outbreak$eclair > 0 ``` This binary exposure variable is now similar to the others, i.e. 'beefcurry', 'saltegg' and 'water'. We now model 'case' as the binary outcome variable and take 'eclair.eat' as the only explanatory variable: ```{r} glm0 <- glm(case ~ eclair.eat, family=binomial, data=Outbreak) summary(glm0) ``` ```{r} logistic.display(glm0) ``` The odds ratio from the logistic regression is derived from exponentiation of the estimate, i.e. 23.75 is obtained from: ```{r} exp(coef(summary(glm0))[2,1]) ``` The 95% confidence interval of the odds ratio is obtained from: ```{r} exp(coef(summary(glm0))[2,1] + c(-1,1) * 1.96 *coef(summary(glm0))[2,2]) ``` These values are close to simple calculation of the 2-by-2 table. The output from *logistic.display* also contains the 'LR-test' (likelihood ratio test) result, which checks whether the likelihood of the given model, 'glm0', would be significantly different from the model without 'eclair.eat', which in this case would be the "null" model. For an independent variable with two levels, the LR-test does not add further important information because Wald's test has already tested the hypothesis. When the independent variable has more than two levels, the LR-test is more important than Wald's test as the following example demonstrates. ```{r} glm1 <- glm(case ~ eclairgr, family=binomial, data=Outbreak) logistic.display(glm1) ``` Interpreting Wald's test alone, one would conclude that all levels of eclair eaten would be significant. However, this depends on the reference level. By default, R assumes that the first level of an independent factor is the reference level. If we *relevel* the reference level to be 2 pieces of eclair, Wald's test gives a different impression. ```{r} Outbreak$eclairgr <- relevel(Outbreak$eclairgr, ref="2") glm2 <- glm(case ~ eclairgr, family=binomial, data=Outbreak) logistic.display(glm2) ``` The results show that eating only one piece of eclair does not reduce the risk significantly compared to eating two pieces. While results from Wald's test depend on the reference level of the explanatory variable, the LR-test is concerned only with the contribution of the variable as a whole and ignores the reference level. ### Try to evaluate the relationship between variables (when discussion with an expert is not possible) Imagine that we want to have an idea of the presence of possible confounders with respect to the effect of eclair and we have not the possibility to talk with an expert, so that we can investigate relationship observing the data. In this case a possibility is to examine univariable effects and then building multivariable models paying attention to the possible change in the effects estimation for the variable of interest when introducing the other. This procedure is not considered as valid as defining a DAG *a priori* with the help of an expert, from a causal inference point of view,but it give some indications at least. Therefore, try 'saltegg' now as the only explanatory variable: ```{r} glm3 <- glm(case ~ saltegg, family = binomial, data=Outbreak) logistic.display(glm3) ``` The odds ratio for 'saltegg' is statistically significant. The number of valid records is also higher than the model containing 'eclairgr'. Note: One should always be very careful when analysing data that contain missing values. Advanced methods to handle missing values are beyond the scope of this course and for reasons of simplicity are ignored here. Data scientists should be advised to deal with missing values *properly* prior to conducting their analysis. To check whether the odds ratio of 'saltegg' is confounded by 'eclairgr', the two explanatory variables are put together in the next model: ```{r} glm4 <- glm(case ~ eclairgr + saltegg, family=binomial, data=Outbreak) logistic.display(glm4, crude.p.value=TRUE) ``` The odds ratios of the explanatory variables in glm4 are adjusted for each other. The crude odds ratios are exactly the same as from the previous models with only single variable. The adjusted odds ratios of 'eclairgr' do not change suggesting that it is not confounded by 'saltegg', whereas the odds ratio of 'saltegg' is clearly changed towards unity, and now has a very large P value. The difference between the adjusted odds ratio and the crude odds ratio is an indication that 'saltegg' is confounded by 'eclairgr', which could be considered as an independent risk factor. Now that we have a model containing two explanatory variables, we can compare models 'glm4' and 'glm2' using the lrtest command. ```{r} lrtest(glm4, glm2) ``` The P value of 0.975 is the same as that from 'P(LR-test)' of 'saltegg' obtained from the preceding command. The test determines whether removal of 'saltegg' in a model would make a significant difference than if it were kept. When there is more than one explanatory variable, 'P(LR-test)' from logistic.display is actually obtained from the lrtest command, which compares the current model against one in which the particular variable is removed, while keeping all remaining variables. Let's further add covariates in the model: ```{r} glm5 <- glm(case~eclairgr+saltegg+sex, family=binomial, data=Outbreak) logistic.display(glm5) ``` The third explanatory variable 'sex' is another independent risk factor. Since females are the reference level, males have an increased risk compared to females. This variable is not considered a confounder to either of the preceding variables because it has not substantially changed the odds ratios of any of them (from 'glm4'). The reason for not being able to confound is its lack of association with either of the preceding explanatory variables. In other words, males and females were not different in terms of eating eclairs and salted eggs. # Survey of Health, Ageing and Retirement in Europe : example of an analysis of a real dataset Load the required libraries ```{r} library(haven) library(tidyverse) library(magrittr) ``` Data comes from SHARE, the *Survey of Health, Ageing and Retirement in Europe* (https://share-eric.eu/data/). It is a multidisciplinary and cross-national panel database of data on health, socio-economic status and social and family networks of about 140000 individuals aged 50 or older (around 380000 interviews). SHARE covers 27 European countries and Israel. The dataset we use here contains only part of the data the survey produced. The research question is whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population: therefore we can consider it from the pint of view of an *explanatory* (causal) model, not a predictive one. This question could be of interest since muscular weakness can be considered a proxy of poor physical health. Muscular strength was defined as the hand grip measured with a dynamo-meter and the weakness was defined as having a grip strength below a threshold calculated according to BMI and sex. The exposure group was defined as subjects who underwent a stressful period between the first and the last time they had been interviewed; all other subject were defined as non-exposed. Of note, an inclusion criteria was to have a normal hand strength at the start of the study. Once the subjects were divided in the exposed and non-exposed group, the hand grip measurement after two years from the exposure definition was used to define the outcome. In this example we will consider the following variables: + *low_grip*: our binary outcome variable (0: No; 1: Yes) + *stress*: our binary exposure variable (0: No; 1: Yes) + *age*: age of the subject in years + *female*: sex (1: females; 0: males) + *ses*: is household able to make ends meet? (1: Great Difficulties; 2: Some Difficulties; 3: Fairly Good ; 4 Good) + *paid_job*: has the subject ever done some paid job in their life? (0: No; 1: Yes) + *move_house*: has the subject ever moved country in their life? (0: No; 1: Yes) ```{r} load("Logistic Regression examples.RData") ``` ## IDA phase ### Categorical Variables Before starting to fit any regression model we have to properly code categorical variables. This is usually done in the initial data analysis (IDA) and the descriptive statistics phase. The first step consists in identifying which variables are categorical. This may seem as an easy step but it isn't always. If fact the choice for some variables (i.e. indexes and scales) is not obvious and it requires either knowledge of the data we are analysing or a bit more effort on the data modelling. The important message is that we can't assume that it is correct to model a variable as numerical only because it was of type numerical/integer in the dataset imported in R. Once we have identified the variables that we think are categorical we can process them. We can have three cases that leads to a slightly different procedure according to the R type of the variable: + **Binary variables** + **Variables of type numerical or integer** + **Variables of type character** ### Binary Variables Binary variables can either be treated as numerical, usually dummy 0/1 variables, or as factors. It does not make any difference in terms of the results obtained from the model but we have to be aware of how they are coded for a correct interpretation of the model. In general, the model will always have 1 parameter for a binary variable. However, we have to choose which level the parameter should represent by deciding which group to code with 1. The choice should be made according to how we want the model to be interpreted. For example, for the treatment variable, 0 is usually the placebo or the control treatment and 1 is usually the experimental treatment. Alternatively, in the case of the covariates/exposure variables, we could simply use a statistical criteria: for a better stability of the model it is always better to use the *most frequent* category as a reference. For example, it makes more sense to consider as the reference level of *paid_job* subjects who have had a paid job. This will help with the stability of the model but also it will ease the interpretation of the results. Therefore, we create a new variable *no_paid_job*: ```{r} datiSHAREStress$no_paid_job <- ifelse(datiSHAREStress$paid_job==0,1,0) ``` ### Categorical variables with numerical labels In this case we must transform the variable in a factor, assign labels to the factor levels and choose a reference level. The reference level will be the group all the others levels will be compared to. In fact when a factor variable with *c* levels is added to a model, R internally transforms it in *c-1* dummy variables and a model parameter will be estimated for each of them. Again, is important we choose the reference level. In this case we have *ses* which we have to transform into a factor ```{r} datiSHAREStress$ses <- as.factor(datiSHAREStress$ses) ``` We can now check how R has coded the variable: ```{r} str(datiSHAREStress$ses) contrasts(datiSHAREStress$ses) ``` By default, R uses the group 1 as the reference level. With the *contrasts()* function we can see how the variable will be parametrized in the regression model. In the rows we have the variable levels and on the columns the parameters generated for this variable. It is also useful to assign more meaningful labels: ```{r} datiSHAREStress$ses=factor(datiSHAREStress$ses, labels = c("Great Difficulties", "Some Difficulties", "Fairly Easily", "Easily")) ``` Last but not least we can set the reference category of our choice. In this case we choose "Easily" which is the most frequent category. ```{r} table(datiSHAREStress$ses) datiSHAREStress$ses=relevel(datiSHAREStress$ses,ref="Easily") ``` ### Characters variables If the variable is of type *character* the steps are similar to the previous ones with the exception that we won't need to explicitly set the labels. As a reminder, by default R uses as a reference level the first level in alphabetical order. ## General descriptives statistics and plots We skip this part here, since we already discussed how to describe a dataset in the various IDA examples. Remind that you should pay particular attention to outliers, missing values, continuous variable's distribution shapes, rare categories in categorical ones! ## Univariable Logistic Regression (univariable filtering) A common method used in the analysis of health data for variables selection, is fitting many models with one covariate at the time. This is also a way to begin to explore the dataset, even if as discussed it is not the suggested method to *select* variables in the model. In this case, we can start with the exposure to stress, our variable of interest. This kind of analysis in the medical literature is often referred as "univariable analysis". So we start by fitting the model ```{r} fit_uni_stress <- glm(low_grip~stress,family = binomial,data=datiSHAREStress) ``` As a reminder, the glm function automatically uses the logit as link function unless we state otherwise. The results of the model can be obtained with ```{r} summary(fit_uni_stress) ``` It seems that we can't reject the null hypothesis for $\beta_{stress}$. We can also obtain the 95% Confidence Interval ```{r} confint.default(fit_uni_stress) ``` We can do more univariable models to look for *possible confounders* (previoulsy discussed with an expert if possible..) in the exposure-outcome relationship. ```{r} fit_uni_age <- glm(low_grip~age,family = binomial,data=datiSHAREStress) summary(fit_uni_age) fit_uni_job <- glm(low_grip~no_paid_job,family = binomial,data=datiSHAREStress) summary(fit_uni_job) fit_uni_sex <- glm(low_grip~female,family = binomial,data=datiSHAREStress) summary(fit_uni_sex) fit_uni_move <- glm(low_grip~house_move,family = binomial,data=datiSHAREStress) summary(fit_uni_move) fit_uni_ses <- glm(low_grip~ses,family = binomial,data=datiSHAREStress) summary(fit_uni_ses) ``` When we have covariates with many levels, such as *ses*, we can use the global Wald test which takes into account for multiple testing. ```{r} drop1(fit_uni_ses,test="Chisq") ``` ## Multivariable Logistic Model We can now fit a multivariable logistic model with the variables that were found to be associated with the outcome in the previous analysis and we suspect could act as confounders of the main exposure of interest. Since we have not the possibility do discuss with an expert, we decide to include in the multivariable regression model all covariates with a p-value < 0.1. This criterion could be not the best one, but is widely applied in clinical and epidemiological research. We obviously include in the model *stress* , since it is our exposure of interest. We will treat the others variables are possible confounders in the model, and we will explore if significant associations are present with the exposure of interest. In principle, in the explanatory setting, we should start from a DAG and make the causal assumptions (in order to estimate a causal effect!!) but we do not have here the experts of the matter and we will limit ourselves in the end to a cautious interpretation of the estimated association. ```{r} fit_multi <- glm(low_grip~stress+no_paid_job+age+house_move+ses,family = binomial,data=datiSHAREStress) summary(fit_multi) ``` This model estimates the coefficients of each of the variables independently from all the others. This is key concept of multivariable regression and it is what allows to *remove* confounding. At a first sight it seems that our exposure of interest is not associated with the outcome. Knowledge of the context from which the data comes from can be used to make hypothesis about possible interaction between confounders and the exposure. In this case we want to test for an interaction between *no_paid_job* and stress. In other words we want to see if the association between the outcome and the exposure could vary for different levels of *no_paid_job*. We then fit the model: ```{r} fit_multi2 <- glm(low_grip~stress*no_paid_job+house_move+age+ses,family = binomial,data=datiSHAREStress) summary(fit_multi2) ``` As a reminder, using * introduces both the main effect and the interaction effect between variables in the model. Alternatively, we can also fit the same model as follows: ```{r} fit_multi2 <- glm(low_grip~stress+no_paid_job+age+ses+stress:no_paid_job+house_move,family = binomial,data=datiSHAREStress) ``` This second way of adding an interaction term is useful in case we have multiple interactions involving the same variable. It seems that a significant interaction is present between stress and the no_paid_job confounder. Note that in any case, the main effect of stress is still not significant in the model. Now that we have fitted these two models, which one should we choose? We can use a formal statistical test to compare the two nested models: ```{r} anova(fit_multi,fit_multi2,test="Chisq") ``` The null hypothesis of this test is that the simpler model (i.e. the one with the smaller number of regression parameters) is no different from the more complex model. In this case we reject that hypothesis at a 95% confidence level and we would keep the model with the interaction. ## Calibration for the logistic regression model (simplest basic method) Even if we are here estimating an explanatory model, this does not means that we are not at all interested in evaluating if the predicted probabilities are in line with the observed event rates. A basic procedure to evaluate this aspect in logistic regression is the Hosmer-Lemeshow test (see for details: https://onlinelibrary.wiley.com/doi/book/10.1002/0471722146). In brief, the null hypothesis of the test is that the predicted probabilities (splitted in ordinal groups) are in line with the observed rates of the events in each group. ```{r} library(generalhoslem) logitgof(datiSHAREStress$low_grip,fitted(fit_multi2)) ``` The model doesn't show evidence for poor goodness of fit/calibration. Remind that when we are instead working with models used specifically for prediction, we should use more refined analyses such that the bootstrap overfitting-corrected calibration curves. ## Interpretation of the logistic regression model The next step is interpreting the results obtained from the model which is one of the most important part of the analysis to be discussed with experts. Of course we are keeping things simple here: in reality we would have had to consider many other possible candidate confounders for the outcome in order to properly *control for confounding* in this observational study, at least for the measured ones. ### Continuous covariates We obtain the OR, which is the measure of association selected when reporting and interpreting the results of a logistic model. We start with the OR estimated for age: ```{r} exp(fit_multi2$coefficients)[4] ``` In general, OR>1 indicate an increase of the probability of the outcome whereas OR<1 a decrease. But the question is : with respect to what? We know that the OR is a *relative measure* of association so we must always ask ourselves what comparison we are making when we report an OR we have estimated. The above output does not mean anything by itself. The first important thing to remember is that age is a continuous variable (measured in years) and in the model we have assumed it has a linear effect on the log-odds of the outcome. When we obtain the $\hat{OR}$ by exponentiating the coefficient for age we obtain the OR estimates for an increase of one year of age. Being the $\hat{OR}>1$, the probability of developing low grip increases as the age increases. Specifically, a subject has odds of having low grip 1.11 times greater with respect to a subject who is 1 year younger. For continuous variables is important to report an OR for a difference that is relevant in the application at hand; a proper choice will depend of the scale the variable is measured and on the magnitude of the estimated coefficient. For example here 1 year may not be so relevant from an epidemiological point of view, we may want to report an OR for a 5 years difference in age: ```{r} exp(fit_multi2$coefficients*5)[4] ``` ### Binary Covariates What if we want to interpret the OR of a binary variable? Let's consider the variable *house_move*. If it is coded numerical, then the coefficients refers always to the level 1. In this case 1 stands for having moved country in the past, so if the estimated coefficient was significant we could say that subjects who moved seem to have an odds lower by 17% ($1-\hat{OR}$) compared to people that have never moved countries in their life. However, when we look at the 95% CI we observe that it contains 1 so the association does not seem to be statistically significant. ```{r} #OR exp(fit_multi2$coefficients)[8] exp(confint.default(fit_multi2))[8,] ``` ### Categorical Covariates The interpretation of the OR for categorical variables is similar: we will have to keep in mind that we are always comparing each of the variable levels to the reference level. Here, it seems (interestingly!) that the odds of health worsening is the same for subject with good or fairly good self-perceived socio-economic status while it increases as the economic situation gets worse. ```{r} # OR exp(fit_multi2$coefficients)[5:7] # 95% CI exp(confint.default(fit_multi2))[5:7,] ``` ### Interactions Finally, we will obtain the $\hat{OR}s$ for *stress* and *no_paid_job*. Since there is an interaction involved, we have to be a bit more careful and we have to consider the two variables together. So first of all we have to know which is our reference group. In our case this is the group who have not experienced a stressful period and have done some paid job in their life (the reference levels for both variables involved in the interaction). Then, we can consider all combinations of the levels of the variables. If we want the $\hat{OR}$ for undergoing a stressful period and having had a paid job with respect to not having had a period of stress and having had a paid job we simply exponentiate the main estimated coefficient for stress (keeping at the same values the others covariates in the model): ```{r} exp(fit_multi2$coefficients)[2] ``` and we don't reject the null hypothesis it is equal to 1 (no significant effect): ```{r} exp(confint.default(fit_multi2))[2,] ``` On the other hand, if we want the $\hat{OR}$ for never having had a paid job and not being exposed to stress with respect to having had a paid job and not being exposed to stress we simply exponentiate the coefficient for the main effect of the variable no_paid_job: ```{r} exp(fit_multi2$coefficients)[3] ``` Also this effect is not statistically significant which means that there does not seem to be a difference in the risk of low hand grip with regards to job experience in the non-exposed group (no stress) keeping fixed all others covariates. ```{r} exp(confint.default(fit_multi2))[3,] ``` Last, we can obtain the OR for subjects who were stressed and had never have a paid job with respect to the reference group (not experienced a stressful period and have done some paid job): ```{r} exp(fit_multi2$coefficients[9]+fit_multi2$coefficients[2]+fit_multi2$coefficients[3]) ``` So in this subgroup it seems that the probability of developing low grip increases. However, we should evaluate also the 95% CI for this $\hat{OR}$. One possibility to do it is to re-fit the multivariable logistic model using the *rms* R package: ```{r,eval=F,echo=F} datiSHAREStress$stress <- as.factor(datiSHAREStress$stress) datiSHAREStress$house_move <- as.factor(datiSHAREStress$house_move) datiSHAREStress$no_paid_job <- as.factor(datiSHAREStress$no_paid_job) ``` ```{r} library(rms) dd <- datadist(datiSHAREStress) options(datadist='dd') #define ranges of the covariates fit_multi2b <- lrm(low_grip~stress*no_paid_job+age+ses+house_move,data=datiSHAREStress) ``` The code is quite similar, the main difference is that first we have to run the *datadist* function which stores the distribution summaries of the variables. The print function returns a very similar output of the summary for the glm() object. ```{r} print(fit_multi2b) ``` The summary on the other hand gives you the estimates for the coefficients as well as the ones for the OR. Of note by default for the continuous variables here, such as age, the function calculates the OR for the difference between the $1^{st}$ and the $3^{rd}$ quartile. ```{r} summary(fit_multi2b) ``` So far, the OR for stress and the job experience are the ones for the main effects. We can easily obtain here the OR for the interaction with the built-in *contrast* function: ```{r} c <- contrast(fit_multi2b, list(stress=1,no_paid_job=1), list(stress=0,no_paid_job=0),type="average") c #OR and 95% CI exp(c$Contrast) exp(c$Lower) exp(c$Upper) ``` So it seems that there is a significant interaction between stress and no paid job on the outcome. ## Conclusions Remind our initial research question: whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population. Based on our analysis it seems that is not the single event of a stessful period that increase the occurrence of muscular weakness, but it is the joint impact of having a stressful period coupled with no paid job that is associated with the occurrence of muscular weakness, adjusting for (or independently from) the specific age, house condition or socio-economic status. # Regression on count data: Poisson regression ## Introduction In nature, an event usually takes place in a *very small* amount of time. At any given point of time, the probability of encountering such an event is small. Instead of the probability of the *single* event, now we focus on the *frequency* of the events as a density, which means incidence or 'count' of events over a period of time. (While time is one dimension, the same concept applies to the density of counts of small objects in a two-dimensional area or three-dimensional space). Moreover we can assume that one event is independent from another and that the *densities* in different units of time vary with a variance equal to the average density. We can approximate this kind of random process using the Poisson random variable. When the probability of having an event is affected by some factors, a model is needed to explain and predict the density. Variation among different strata of a population could be explained by the various combination of factors. *Within each stratum* (defined by covariates combination), the distribution of the events is assumed random. Poisson regression deals with outcome variables that are counts in nature (whole numbers or integers). Independent covariates have the same role as those encountered in linear and logistic regression. In epidemiology, Poisson regression is very often used for analysing *grouped* population based or cohort data, looking at incidence density among person-time contributed by subjects that share similar characteristics of interest. Poisson regression is one of 3 common regression models used in epidemiological studies. The other two that are more commonly used are linear regression and logistic regression, which have been already covered. The last family is survival methods, that we will explain in the last block 4. There are two main assumptions for Poisson regression: 1) risk is homogeneous among person-times contributed by different subjects who share the same characteristics of interest (e.g. sex, age-group) and the same period. 2) asymptotically, or as the sample size becomes larger and larger, the mean of the counts is equal to the variance. Straightforward linear regression methods (assuming constant variance and normal errors) are not appropriate for count data for four main reasons: 1. the linear model might lead to the prediction of negative counts 2. the variance of the response may increase with the mean 3. the errors will not be normally distributed 4. zero counts are difficult to handle in transformations Moreover, in studies that invole the time dimension different subjects may have different person-times of exposure. Analysing risk factors while ignoring differences in person-times is wrong. Poisson regression overcomes these limitations. Note that in survival analysis using for example Cox regression (see block 4), the *hazard ratio* will be estimated for each covariate in the model, not the incidence density in each subgroup; in the Cox model the interest will be focused on the "how long until an event occurs - time to event -", instead in the Poisson regression model the focus is on "how many events occur in given interval". ## Example of Poisson model: the Montana smelter study The dataset Montana was extracted from an occupational cohort study conducted to test the association between respiratory deaths (outcome) and exposure to arsenic in the industry, after adjusting for various other risk factors/confounders. The main outcome variable is *respdeath*. This is the count of the number of deaths among *personyrs* or personyears of subjects in each category. The other variables are independent covariates including age group *agegr*, period of employment *period*, starting time of employment *start* and the level of exposure to arsenic during the study period *arsenic* (the exposure of interest). Read in the data first and examine the variables. ```{r} data(Montana) summary(Montana) ``` The last four variables are classed as integers. We need to tell R to interpret them as categorical variables, or factors, and attach labels to each of the levels. This can be done using the factor command with a 'labels' argument included. ```{r} Montana$agegr <- factor(Montana$agegr, labels=c("40-49","50-59","60-69","70-79")) Montana$period <- factor(Montana$period, labels=c("1938-1949", "1950-1959","1960-1969", "1970-1977")) Montana$start <- factor(Montana$start, labels=c("pre-1925", "1925 & after")) Montana$arsenic1 <- factor(Montana$arsenic, labels=c("<1 year", "1-4years","5-14 years", "15+ years")) summary(Montana) ``` We keep the original *arsenic* variable unchanged for use later on. ### Descriptive analyses : breakdown of incidence by age and period Let us explore the person-years breakdown by age and period. Firstly, create a table for total person-years: ```{r} tapply(Montana$personyrs, list(Montana$period, Montana$agegr), sum) -> table.pyears ``` Carry out the same procedure for number of deaths, and compute the table of incidence per 10,000 person years for each cell. ```{r} tapply(Montana$respdeath, list(Montana$period, Montana$agegr), sum) -> table.deaths table.inc10000 <- table.deaths/table.pyears*10000 table.inc10000 ``` Now, create a time-series plot of the incidence: ```{r} plot.ts(table.inc10000, plot.type="single", xlab=" ",ylab="#/10,000 person-years", xaxt="n", col=c("black", "blue","red","green"), lty=c(2,1,1,2), las=1) points(rep(1:4,4), table.inc10000, pch=22, cex=table.pyears/sum(table.pyears) * 20) title(main = "Incidence by age and period") axis(side = 1, at = 1:4, labels = levels(Montana$period)) legend(3.2,40, legend=levels(Montana$agegr)[4:1], col=c("green","red", "blue", "black"), bg = "white", lty=c(2,1,1,2)) ``` The above graph shows that the older age group is generally associated with a higher risk. On the other hand, the sample size (reflected by the size of the squares at each point) decreases with age. The possibility of a confounding effect of age on the exposure of interest can better be examined by using Poisson regression. ### Modelling with Poisson regression Let's estimate a Poisson regression model taking into account only period as a covariate: ```{r} mode11 <- glm(respdeath ~ period, offset = log(personyrs),family = poisson, data=Montana) summary(mode11) ``` The option *offset = log(personyrs)* allows the variable *personyrs* to be the denominator for the counts of *respdeath*. A logarithmic transformation is needed since, for a Poisson generalized linear model, the link function is the natural log, and the default link for the Poisson family is the log link. Remind : an important criterion in the choice of a link function for various families of distributions is to ensure that the fitted values from the modelling stay within reasonable bounds. Specifying a log link (default for Poisson) ensures that the fitted counts are all greater than or equal to zero. For more details on default links for various families of distributions related to generalized linear modelling, see the help in R under *help(family)*. The first model above with *period* as the only independent variable suggests that the death rate increased with time. The model can be tested for goodness of fit and the checked whether the Poisson assumptions mentioned earlier have been violated. ### Goodness of fit test To test the goodness of fit of the Poisson model, type: ```{r} poisgof(mode11) ``` The component '$chisq' is actually computed from the model deviance, a parameter reflecting the level of errors. A large chi-squared value with small degrees of freedom results in a significant violation of the Poisson assumption (p < 0.05). If only the P value is wanted, the command can be shortened. ```{r} poisgof(mode11)$p.value ``` The P value is very small indicating a poor fit. Note:It should be noted that this method works under the assumption of a *large* sample size. An alternative method is to a fit negative binomial regression model (but not covered in the slides!). We now add the second independent variable 'agegr' to the model: ```{r} mode12 <- glm(respdeath~agegr+period, offset=log(personyrs), family = poisson, data=Montana) AIC(mode12) ``` The AIC (Akaike Information Criterion) has decreased remarkably from 'model1' to 'model2' indicating a poor fit of the first model. ```{r} poisgof(mode12)$p.value ``` But 'model2' still violates the Poisson assumption. ```{r} mode13 <- glm(respdeath ~ agegr, offset = log(personyrs), family = poisson, data=Montana) AIC(mode13) poisgof(mode13)$p.value ``` Removal of 'period' further reduces the AIC but still violates the Poisson assumption to the same extent as the previous model. The next step is to add the exposure of interest: 'arsenic1'. ```{r} mode14 <- glm(respdeath ~ agegr + arsenic1, offset=log(personyrs), family = poisson, data=Montana) summary(mode14) ``` ```{r} poisgof(mode14)$p.value ``` Fortunately, 'model4' has a much lower AIC than model3 and it now does not violate the assumption. If we change the reference level for arsenic and we use *1-4 years* vs others: ```{r} Montana$arsenic.b <- relevel(Montana$arsenic1,ref="1-4years") mode15 <- glm(respdeath ~ agegr + arsenic.b, offset=log(personyrs), family = poisson, data=Montana) summary(mode15) ``` It does not appear to be any increase in the risk of death from more than 4 years of exposure to arsenic so it may be worth combining it into just two levels: ```{r} Montana$arsenic2 <- Montana$arsenic1 levels(Montana$arsenic2) <- c("<1 year", rep("1+ years", 3)) model6 <- glm(respdeath ~ agegr + arsenic2,offset=log(personyrs), family=poisson, data=Montana) summary(model6) ``` At this stage, we would accept 'model6' as the final model, since it has the smallest AIC among all the models that we have tried. We conclude that exposure to arsenic for at least one year is associated with an increased risk for the disease by exp(0.8109) or 2.25 times with statistical significance, independently from age. ## Another usage of the model: prediction of the incidence rate In the Poisson model, the outcome is a count. In the general linear model, the relationship between the values of the outcome (as measured in the data and predicted by the model in the fitted values) and the linear predictor is determined by the link function. This link function relates the mean value of the outcome to its linear predictor. By default, the link function for the Poisson distribution is the natural logarithm. With the offset being log(person-time), the value of the outcome becomes the log(incidence rate). The matrix 'table.inc10000' (created previously) gives the crude incidence rate by age group and period. Each of the Poisson regression models above can be used to compute the *predicted* incidence rate when the variables in the model are given. For example, to compute the incidence rate from a population of 100,000 people aged between 40-49 years who were exposed to arsenic for less than one year using 'model6', type: ```{r} newdata <- as.data.frame(list(agegr="40-49",arsenic2="<1 year", personyrs=100000)) predict(model6, newdata, type="response") ``` This population would have an estimated incidence rate of 33.26 per 100,000 person-years. ## Interpretation of the regression coefficients : the incidence rate ratio (IRR) In this example, all subjects pool their follow-up times and this number is called 'person time', which is then used as the denominator for the event, resulting in 'incidence rate'. Comparing the incidence rate among two groups of subjects by their exposure status is fairer than comparing the crude risks (considering population at baseline). The ratio between the incidence rates of two groups is called the incidence rate ratio (IRR), which is an *improved form* of the relative risk. In 'model6', for example, we want to compute the incidence rate ratio between the subjects exposed to arsenic for one or more years against those exposed for less than one year. The shorter way to obtain this IRR is to exponentiate the coefficient of the specific variable 'arsenic', which is the fifth coefficient in the model. ```{r} coef(model6) exp(coef(model6)[5]) ``` The following steps explain how the 95% confidence interval for all variables can be obtained. ```{r message=FALSE} coeff <- coef(model6) coeff.95ci <- cbind(coeff, confint(model6)) coeff.95ci ``` Note that confint(glm6) provides a 95% confidence interval for the model coefficients. ```{r} IRR.95ci <- round(exp(coeff.95ci), 1)[-1,] ``` The required values are obtained from exponentiating the last matrix with the first row or intercept removed. The display is rounded to 1 decimal place for better viewing. Then the matrix column is labelled and the 95% CI is displayed. ```{r} colnames(IRR.95ci) <- c("IRR", "lower95ci", "upper95ci") IRR.95ci ``` A simpler way is to use the command idr.display in epiDisplay (but they use the term IDR: Incidence Density Rate). ```{r} idr.display(model6, decimal=1) ``` The command idr.display gives results to 3 decimal places by default. This can easily be changed by the user. ## Optional material: negative binomial regression Recall that for Poisson regression one of the assumptions for a valid model is that the mean and variance of the count variable are equal. The negative binomial distribution is a more generalized form of distribution used for 'count' response data, allowing for greater dispersion or variance of counts. In practice, it is quite common for the variance of the outcome to be larger than the mean. This is called overdispersion. If a count variable is overdispersed, Poisson regression underestimates the standard errors of the predictor variables. When overdispersion is evident, one solution is to specify that the errors have a negative binomial distribution. Negative binomial regression gives the same coefficients as those from Poisson regression but give larger standard errors. The interpretation of the results is the same as that from Poisson regression. Take an example of counts of water containers infested with mosquito larvae in a field survey. The data is contained in the dataset DHF99. ```{r} library(MASS) data(DHF99) summary(DHF99) ``` ```{r} describeBy(DHF99$containers, group=DHF99$viltype) ``` The function for performing a negative binomial glm is glm.nb. This function is located in the MASS library. In addition, a very helpful function for selecting the best model based on the AIC value is the *step* function, which is located in the *stats* library (a default library loaded on start-up). ```{r} model.poisson <- step(glm(containers ~ education + viltype, family=poisson, data=DHF99)) model.nb <- step(glm.nb(containers ~ education + viltype, data=DHF99)) ``` ```{r} coef(model.poisson) ``` ```{r} coef(model.nb) ``` Both models end up with only 'viltype' being selected. The coefficients are very similar. The Poisson model has significant overdispersion but not the negative binomial model. ```{r} poisgof(model.poisson)$p.value ``` ```{r} poisgof(model.nb)$p.value ``` The AIC of the negative binomial model is also better (smaller) than that of the Poisson model. ```{r} model.poisson$aic ``` ```{r} model.nb$aic ``` Finally, the main differences to be examined are their standard errors, the 95% confidence intervals and P values. ```{r} summary(model.poisson)$coefficients ``` ```{r} summary(model.nb)$coefficients ``` ```{r} idr.display(model.poisson) ``` ```{r} attach(DHF99) idr.display(model.nb) ``` The standard errors from the negative binomial model are slightly larger than those from the Poisson model resulting in wider 95% confidence intervals and larger P values. From the Poisson regression, both urban community and slum area had a significantly lower risk (around 14% and a half reduction, respectively) for infestation. However, from the negative binomial regression, only the urban community had a significantly lower risk.