--- title: "R applications & Exercises - Block 3 (reduced)" subtitle: "STATISTICAL LEARNING IN EPIDEMIOLOGY 2023/2024" author: "Prof. Giulia Barbati & Paolo Dalena" date: "13/05/2024" output: html_document: toc: true number_sections: true toc_float: true theme: united --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Simple Linear Regression (a brief recap) First of all, install and load the required libraries : *rms*, *Hmisc* and *epiDisplay*. ```{r warning=FALSE,message=FALSE} library(rms) library(Hmisc) library(epiDisplay) library(ggplot2) library(psych) library(here) ``` Let's simulate data from a sample of n=100 points along with population linear regression line. The conditional distribution of y|x can be thought of as a vertical slice at x. The unconditional distribution of y is shown on the y-axis. To envision the conditional normal distributions assumed for the underlying population, think of a bell-shaped curve *coming out* of the page, with its base along one of the vertical lines of points. The equal variance assumption (*homoscedasticity*) dictates that the series of Gaussian curves for all the different x values have equal variances. ```{r} n <- 100 set.seed(13) x <- round(rnorm(n, .5, .25), 1) y <- x + rnorm(n, 0, .1) r <- c(-.2, 1.2) ``` Plot: ```{r} plot(x, y, axes=FALSE, xlim=r, ylim=r, xlab=expression(x), ylab=expression(y)) axis(1, at=r, labels=FALSE) axis(2, at=r, labels=FALSE) abline(a=0,b=1) histSpike(y, side=2, add=TRUE) abline(v=.6, lty=2) ``` Simple linear regression is used when: * Only 2 variables are of interest * One variable is a response (continuous scale) and one is a predictor * The mean of the dependent variable is a quantity of interest [otherwise explore for example quantile regression] * No adjustment is needed for confounding or other between-subject variation * The investigator is interested in assessing the strength of the relationship between x and y in real data units, or in predicting y from x * A linear relationship is assumed (visual inspection is strongly recommended...) * Not when one only needs to test for association (use Pearson or Spearman's correlation in case) ## Interval Estimation: evaluating the uncertainty about predictions Estimation of the confidence intervals (CI) for predictions depend on what you want to predict, if at *individual* level or at *mean population* level. ```{r} x1 <- c( 1, 3, 5, 6, 7, 9, 11) y <- c( 5, 10, 70, 58, 85, 89, 135) dd <- datadist(x1, n.unique=5); options(datadist='dd') f <- ols(y ~ x1) p1 <- Predict(f, x1=seq(1,11,length=100), conf.type='mean') p2 <- Predict(f, x1=seq(1,11,length=100), conf.type='individual') p <- rbind(Mean=p1, Individual=p2) ggplot(p, legend.position='none') + geom_point(aes(x1, y), data=data.frame(x1, y, .set.='')) ``` Example usages: * Is a child of age x smaller than predicted for her age? Use the *individual level*, p2 (wider bands) * What is the best estimate of the *population mean* blood pressure for patients on treatment A? Use the *mean population level*, p1 (narrower bands) ## Assessing the Goodness of Fit It is crucial to verify the the assumptions underlying a linear regression model: * In a scatterplot the spread of y about the fitted line should be constant as x increases, and y vs. x should appear linear * Easier to see this with a plot of residuals vs estimated values * In this plot there should be no systematic patterns (no trend in central tendency, no change in spread of points with x) * Trend in central tendency indicates failure of linearity * qqnorm plot of residuals is a useful tool Here an example: we fit a linear regression model where x and y should instead have been log transformed: ```{r} n <- 50 set.seed(2) res <- rnorm(n, sd=.25) x <- runif(n) y <- exp(log(x) + res) f <- ols(y ~ x) plot(fitted(f), resid(f)) ``` This plot depicts non-constant variance of the residuals, which might call for transforming y. Now, we fit a linear model that should have been quadratic (functional form of X): ```{r} x <- runif(n, -1, 1) y <- x ^ 2 + res f <- ols(y ~ x) plot(fitted(f), resid(f)) ``` Finally, we fit a correct model: ```{r} y <- x + res f <- ols(y ~ x) plot(fitted(f), resid(f)) qqnorm(resid(f)); qqline(resid(f)) ``` These plots shows the ideal situation of white noise (no trend, constant variance). The qq plot demonstrates approximate normality of residuals, for a sample of size n = 50. ## Application example of simple linear regression: Hookworm & blood loss The dataset concerns the relationship between hookworm and blood loss from a study conducted in 1970. ```{r} data(Suwit) summary(Suwit) des(Suwit) ``` ```{r} summ(Suwit) attach(Suwit) ``` The file is clean and ready for analysis (this happens here only for didactic purposes: in real life, you will usually spend a couple of hours - at minimum, if not days..- to *clean* datasets). For example, with this small sample size it is somewhat straightforward to verify that there is no repetition of 'id' and no missing values. The records have been sorted in ascending order of 'worm' (number of worms) ranging from 32 in the first subject to 1929 in the last one. Blood loss ('bloss') is however, not sorted. The 13th record has the highest blood loss of 86 ml per day, which is very high. The objective of this analysis is to predict blood loss using worm. First of all, give a look to the data: ```{r} plot(worm, bloss, xlab="No. of worms", ylab="ml. per day", main = "Blood loss by number of hookworms in the bowel") ``` A linear model using the above two variables seems reasonable. ```{r} lm1 <- lm(bloss ~ worm, data=Suwit) lm1 ``` Displaying the model by typing 'lm1' gives limited information (essentially, the estimated regression line coefficients). To get more information, one can look at the attributes of this model, its summary and attributes of its summary. ```{r} attr(lm1, "names") ``` ```{r} summary(lm1) ``` The first section of summary shows the formula that was 'called'. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (34.38) compared to the minimum (-15.84) and the first quartile is further left (-10.81) of the median (0.75) than the third quartile (4.35) is. Otherwise, the median is close to zero. The third section gives coefficients of the intercept and the effect of 'worm' on blood loss. The intercept is 10.8 meaning that when there are no worms, the blood loss is estimated to be 10.8 ml per day. This is however, not significantly different from zero as the P value is 0.0618. The coefficient of 'worm' is 0.04 indicating that each worm is associated with an average increase of 0.04 ml of blood loss per day. Although the value is small, it is highly significantly different from zero. When there are many worms, the level of blood loss can be very substantial. The multiple R-squared value of 0.716 indicates that 71.6% of the variation in the data is explained by the model. The adjusted value is 0.6942. (The calculation of Rsquared is discussed in the analysis of variance section below). The last section describes more details of the residuals and hypothesis testing on the effect of 'worm' using the F-statistic. The P value from this section (0.0000699) is equal to that tested by the t-distribution in the coefficient section. This F-test more commonly appears in the analysis of variance table. ### Analysis of variance table, R-squared and adjusted R-squared ```{r} summary(aov(lm1)) ``` The above analysis of variance (aov) table breaks down the degrees of freedom, sum of squares and mean square of the outcome (blood loss) by sources (in this case there only two: worm + residuals). The so-called 'square' is actually the square of difference between the value and the mean. The total sum of squares of blood loss is therefore: ```{r} SST <- sum((bloss-mean(bloss))^2) SST ``` The sum of squares from residuals is: ```{r} SSR <- sum(residuals(lm1)^2) SSR ``` See also the analysis of variance table. The sum of squares of worm or sum of squares of difference between the fitted values and the grand mean is: ```{r} SSW <- sum((fitted(lm1)-mean(bloss))^2) SSW ``` The latter two sums add up to the first one. The R-squared is the proportion of sum of squares of the fitted values to the total sum of squares. ```{r} SSW/SST ``` Instead of sum of squares, one may consider the mean square as the level of variation. In such a case, the number of worms can reduce the total mean square (or variance) by: (total mean square - residual mean square) / total mean square, or (variance - residual mean square) / variance. ```{r} resid.msq <- sum(residuals(lm1)^2)/lm1$df.residual Radj <- (var(bloss)- resid.msq)/var(bloss) Radj ``` This is the adjusted R-squared shown in *summary(lm1)* in the above section. ### F-test When the mean square of 'worm' is divided by the mean square of residuals, the result is: ```{r} F <- SSW/resid.msq; F ``` Using this F value with the two corresponding degrees of freedom (from 'worm' and residuals) the P value for testing the effect of 'worm' can be computed. ```{r} pf(F, df1=1, df2=13, lower.tail=FALSE) ``` The function *pf* is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument 'lower.tail' is set to FALSE to obtain the right margin of the area under the curve of the F distribution. In summary, both the regression and analysis of variance give the same conclusion; that number of worms has a significant linear relationship with blood loss. Now the regression line can be drawn. ### Regression line, fitted values and residuals A regression line can be added to the scatter plot with the following command: ```{r} plot(worm, bloss, xlab="No. of worms", ylab="ml. per day", main = "Blood loss by number of hookworms in the bowel", type="n") abline(lm1) points(worm, fitted(lm1), pch=18, col="blue") segments(worm, bloss, worm, fitted(lm1), col="red") ``` The regression line has an intercept of 10.8 and a slope of 0.04. The expected value is the value of blood loss estimated from the regression line with a specific value of 'worm'.A residual is the difference between the observed and expected value. The residuals can be drawn by adding the red segments. The actual values of the residuals can be checked from the specific attribute of the defined linear model. ```{r} residuals(lm1) -> lm1.res hist(lm1.res) ``` ### Checking the normality of residuals Histogram of the residuals is not somehow convincing about their normal distribution shape. However, with such a small sample size, it is difficult to draw any conclusion. A better way to check normality is to plot the residuals against the expected normal score or (residual-mean) / standard deviation. A reasonably straight line would indicate normality. Moreover, a test for normality could be calculated. ```{r} a <- qqnorm(lm1.res) shapiro.qqnorm(lm1.res, type="n") text(a$x, a$y, labels=as.character(id)) ``` If the residuals were perfectly normally distributed, the text symbols would have formed along the straight dotted line. The graph suggests that the largest residual (13th) is too high (positive) whereas the smallest value (7th) is not large enough (negative). However, the P value from the Shapiro-Wilk test is 0.08 suggesting that the possibility of residuals being normally distributed cannot be rejected. Finally, the residuals are plotted against the fitted values to see if there is a pattern. ```{r} plot(fitted(lm1), lm1.res, xlab="Fitted values") plot(fitted(lm1), lm1.res, xlab="Fitted values", type="n") text(fitted(lm1), lm1.res, labels=as.character(id)) abline(h=0, col="blue") ``` There is no obvious pattern. The residuals are quite independent of the expected values. With this and the above findings from the *qqnorm* command we may conclude that the residuals are randomly and normally distributed. The above two diagnostic plots for the model 'lm1' can also be obtained from: ```{r} par(mfrow=c(1,2)) plot(lm1, which=1:2) detach(Suwit) ``` ### Final conclusion From the analysis, it is clear that blood loss is associated with number of hookworms. On average, each additional worm is associated with an increase of 0.04 ml of blood loss. The remaining uncertainty of blood loss, apart from hookworm, is explained by random variation or other factors that were not measured. # Multiple linear regression (MLR) Datasets usually contain many variables collected during a study. It is often useful to see the relationship between two variables within the different levels of another third, categorical variable, i.e. to verify for the presence of an interaction. (This is just a didactic example, usually more than 3 variables are involved in MLR models). ## Example: Systolic blood pressure A small survey on blood pressure was carried out. The objective is to see the hypertensive effect of subjects putting additional table salt on their meal.Gender of subjects is also measured. ```{r} data(BP) attach(BP) des(BP) ``` Note that the maximum systolic and diastolic blood pressures are quite high. There are 20 missing values in *saltadd*. The frequencies of the categorical variables *sex* and *saltadd* are now inspected. ```{r} describe(data.frame(sex, saltadd)) ``` The next step is to create a new age variable from birthdate. The calculation is based on 12th March 2001, the date of the survey (date of entry in the study). ```{r} age.in.days <- as.Date("2001-03-12") - birthdate ``` There is a leap year in every four years. Therefore, an average year will have 365.25 days. ```{r} class(age.in.days) age <- as.numeric(age.in.days)/365.25 ``` The function *as.numeric* is needed to transform the units of age (difftime); otherwise modelling would not be possible. ```{r} describeBy(sbp,saltadd) ``` ### Recoding missing values into another category The missing value group has the highest median and average systolic blood pressure. In order to create a new variable with three levels type: ```{r} saltadd1 <- saltadd levels(saltadd1) <- c("no", "yes", "missing") saltadd1[is.na(saltadd)] <- "missing" summary(saltadd1) ``` ```{r} summary(aov(age ~ saltadd1)) ``` Since there is not enough evidence that the missing group is important and for additional reasons of simplicity, we will assume MCAR (*missing completely at random*) and we will ignore this group and continue the analysis with the original *saltadd* variable consisting of only two levels. Before doing this however, a simple regression model and regression line are first fitted. ```{r} lm1 <- lm(sbp ~ age) summary(lm1) ``` Although the R-squared is not very high, the p value is small indicating important influence of age on systolic blood pressure. A scatterplot of age against systolic blood pressure is now shown with the regression line added using the *abline* function. This function can accept many different argument forms, including a regression object. If this object has a *coef* method, and it returns a vector of length 1, then the value is taken to be the slope of a line through the origin, otherwise the first two values are taken to be the intercept and slope, as is the case for *lm1*. ```{r} plot(age, sbp, main = "Systolic BP by age", xlab = "Years", ylab = "mm.Hg") abline(lm1) ``` Subsequent exploration of residuals suggests a non-significant deviation from normality and no pattern. Details of this can be adopted from the techniques already discussed and are omitted here. The next step is to provide different plot patterns for different groups of salt habits. Note that here the sample size is lower (we decided to omit missing values for *saltadd*, reducing the analysis to the complete cases): ```{r} lm2 <- lm(sbp ~ age + saltadd) summary(lm2) ``` On the average, a one year increment of age is associated with an increase in systolic blood pressure by 1.5 mmHg. Adding table salt increases systolic blood pressure significantly by approximately 23 mmHg. ```{r} plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n") points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue") points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18) ``` Note that the red dots corresponding to those who added table salt are higher than the blue circles. The final task is to draw two separate regression lines for each group. We now have two regression lines to draw, one for each group. The intercept for non-salt users will be the first coefficient and for salt users will be the first plus the third. The slope for both groups is the same. ```{r} a0 <- coef(lm2)[1] a1 <- coef(lm2)[1] + coef(lm2)[3] b <- coef(lm2)[2] ``` ```{r} plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n") points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue") points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18) abline(a = a0, b, col = "blue") abline(a = a1, b, col = "red") ``` Note that X-axis does not start at zero. Thus the intercepts are out of the plot frame. The red line is for the red points of salt adders and the blue line is for the blue points of non-adders. In this model, age is assumed to have a constant effect on systolic blood pressure independently from added salt. But look at the distributions of the points of the two colours: the red points are higher than the blue ones but mainly on the right half of the graph. To fit lines with different slopes, a new model with *interaction term* is created. Therefore the next step is to estimate a model with different slopes (or different 'b' for the abline arguments) for the different lines. The model needs an interaction term between *addsalt* and *age*. ```{r} lm3 <- lm(sbp ~ age * saltadd) summary(lm3) ``` For the intercept of the salt users, the second term and the fourth are all zero (since age is zero) but the third should be kept as such. This term is negative. The intercept of salt users is therefore lower than that of the non-users. ```{r} a0 <- coef(lm3)[1] a1 <- coef(lm3)[1] + coef(lm3)[3] ``` For the slope of the non-salt users, the second coefficient alone is enough since the first and the third are not involved with each unit of increment of age and the fourth term has *saltadd* being 0. The slope for the salt users group includes the second and the fourth coefficients since *saltaddyes* is 1. ```{r} b0 <- coef(lm3)[2] b1 <- coef(lm3)[2] + coef(lm3)[4] ``` ```{r} plot(age, sbp, main="Systolic BP by age", xlab="Years",ylab="mm.Hg", pch=18, col=as.numeric(saltadd)) abline(a = a0, b = b0, col = 1) abline(a = a1, b = b1, col = 2) legend("topleft", legend = c("Salt added", "No salt added"),lty=1, col=c("red","black")) ``` Note that *as.numeric(saltadd)* converts the factor levels into the integers 1 (black) and 2 (red), representing the non-salt adders and the salt adders, respectively. These colour codes come from the R colour palette. This model suggests that at the young age, the systolic blood pressure of two groups are not much different as the two lines are close together on the left of the plot. For example, at the age of 25, the difference is 5.7mmHg. Increasing age increases the difference between the two groups. At 70 years of age, the difference is as great as 38mmHg. In this aspect, age *modifies* the effect of adding table salt on blood pressure. On the other hand the slope of age is 1.24mmHg per year among those who did not add salt but becomes 1.24+0.72 = 1.96mmHg among the salt adders. Thus, salt adding *modifies* the effect of age. Note that interaction is a statistical term whereas effect modification is the equivalent epidemiological term. Note also that the coefficient of the interaction term *age:saltaddyes* is not statistically significant !!! This means that the two slopes just differ *by chance* (or that we are low-powered to detect a significant interaction..) This was in fact just a didactic example to show how to introduce an interaction term in a regression model. # Causal inference versus prediction Conceptually, in prediction, in a certain sense we make *comparisons* between outcomes across different combinations of values of input variables to predict the probability of an outcome. In causal inference, we ask *what would happen* to an outcome y *as a result of a treatment or intervention*. Predictive inference relates to comparisons between units (different groups/subjects). Causal inference addresses comparisons of *different treatments* when applied to the *same* unit. ## The FEV dataset The FEV, which is an acronym for Forced Expiratory Volume, is a measure of how much air a person can exhale (in liters) during a forced breath. In this dataset, the FEV of 606 children, between the ages of 6 and 17, were measured. The dataset also provides additional information on these children: their age, their height, their gender and, most importantly, whether the child is a smoker or a non-smoker (the exposure of interest). This is an observational cross-sectional study. The goal of this study was to find out whether or not smoking has an effect on the FEV of children. Load the required libraries ```{r, message = FALSE} library(tidyverse) library(DataExplorer) library(SmartEDA) library(ggplot2) library(ggstatsplot) ``` Set the working directory and import data ```{r message=FALSE} fev <- read_tsv(here("datasets","fev.txt")) head(fev) ``` There are a few things in the formatting of the data that can be improved upon: 1. Both the `gender` and `smoking` can be transformed to factors. 2. The `height` variable is written in inches. Inches are hard to interpret. Let's add a new column, `height_cm`, with the values converted to centimeter using the `mutate` function. (For this example we will not use this variable however). ```{r} fev <- fev %>% mutate(gender = as.factor(gender)) %>% mutate(smoking = as.factor(smoking)) %>% mutate(height_cm = height*2.54) head(fev) ``` ## Data Exploration Now, let's make a first explorative boxplot, showing only the FEV for both smoking categories. ```{r message=FALSE, warning=FALSE} fev %>% ggplot(aes(x=smoking,y=fev,fill=smoking)) + scale_fill_manual(values=c("dimgrey","firebrick")) + theme_bw() + geom_boxplot(outlier.shape=NA) + geom_jitter(width = 0.2, size=0.1) + ggtitle("Boxplot of FEV versus smoking") + ylab("fev (l)") + xlab("smoking status") ``` Did you expect these results?? It appears that children that smoke have a higher median FEV than children that do not smoke. Should we change legislations worldwide and make smoking obligatory for children?? Maybe there is something else going on in the data. Now, we will generate a similar plot, but we will stratify the data based on age (age as factor). ```{r message=FALSE, warning=FALSE} fev %>% ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) + geom_boxplot(outlier.shape=NA) + geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) + theme_bw() + scale_fill_manual(values=c("dimgrey","firebrick")) + ggtitle("Boxplot of FEV versus smoking, stratified on age") + ylab("fev (l)") + xlab("Age (years)") ``` This plot seems to already give us a more plausible picture. First, it seems that we do not have any smoking children of ages 6, 7 or 8. Second, when looking at the results per age "category", it seems no longer the case that smokers have a much higher FEV than non-smokers; for the higher ages, the contrary seems true. This shows that taking into account confounders (in this case) is crucial! If we simply analyse the dataset based on the smoking status and FEV values only, our inference might be incorrect: ```{r} fit1 <- lm(fev~smoking, data=fev) # wrong beta0star, beta1star fit2 <- lm(fev~age+smoking, data=fev) # "true" beta0, beta2, beta1 fitage <- lm(age~smoking, data=fev) # gamma0, gamma1 fit1 fit2 fitage ``` ```{r} beta0 <- coef(fit2)[1] beta2 <- coef(fit2)[2] beta1 <- coef(fit2)[3] gamma0 <- coef(fitage)[1] gamma1 <- coef(fitage)[2] beta0star <- coef(fit1)[1] beta1star <- coef(fit1)[2] ``` check that: beta0star = beta0 + beta2*gamma0 : ```{r} beta0 + beta2*gamma0 beta0star ``` and also check that (wrong) beta1star = beta1 + beta2*gamma1: ```{r} beta1 + beta2*gamma1 beta1star ``` Therefore, a beneficial estimated smoking effect is obtained when age is ignored. *If* the causal inference assumptions hold (see slides) we can consider *beta1* as the average smoking effect in the population under study. ## Factors that affect causal inference estimates Imbalance and lack of complete overlap can make causal inference difficult. Remind : imbalance is when treatment groups differ with respect to an important covariate. Lack of complete overlap: when some combination of treatment level and covariate level is lacking (no observations, violation of the positivity assumption). To explain fev, sex seems to matter, especially among older individuals: ```{r} plot(fev~age, col=gender, data=fev) legend("topleft", pch=1, col=1:2, levels(fev$gender)) ``` Another way to plot the same thing is: ```{r message=FALSE, warning=FALSE} fev %>% ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) + geom_boxplot(outlier.shape=NA) + geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) + theme_bw() + scale_fill_manual(values=c("dimgrey","firebrick")) + ggtitle("Boxplot of FEV versus smoking, stratified on age and gender") + ylab("fev (l)") + xlab("smoking status") + facet_grid(rows = vars(gender)) ``` Especially for higher ages, the median FEV is higher for males as compared to females. [This could suggest a kind of *interaction* between gender and age, that could be explored in the regression model even if the interpretation of such interaction could be quite tricky (many levels for age!)]. Moreover, there is a slight gender imbalance among age categories: ```{r} counts <- table(fev$gender, as.factor(fev$age)) percentages <- round(prop.table(table(fev$gender, as.factor(fev$age)),2),digits=3) barplot(percentages, main="Gender distribution", xlab="ages", col=c("pink", "darkblue"), legend = rownames(counts)) ``` For imbalanced samples, simple comparisons of sample means between groups are not good estimates of treatment/risk factors effects. A model adjustment is of course one way to better estimate a treatment effect, where we add the covariate to the model. In this case for example we can add also gender in the regression model: ```{r} fit3 <- lm(fev~age+smoking+gender, data=fev) summary(fit3) ``` So now the effect of smoking is estimated conditional on both age and gender, and as expected is a negative effect on FEV; one possible problem however with these estimates could be that for some combinations of age-gender-smoking we do not have *any* observed data, so that the extrapolation of the regression model could be not so reliable. Also in this case (no interaction) if causal assumptions hold we can interpret the effect of smoking also as a marginal effect. ```{r} index <- fev$smoking==0 counts.noS <- table(fev[index,]$gender, as.factor(fev[index,]$age)) percentages.noS <- round(prop.table(table(fev[index,]$gender, as.factor(fev[index,]$age)),2),digits=3) index1 <- fev$smoking==1 counts.S <- table(fev[index1,]$gender, as.factor(fev[index1,]$age)) percentages.S <- round(prop.table(table(fev[index1,]$gender, as.factor(fev[index1,]$age)),2),digits=3) par(mfrow=c(1,2)) barplot(percentages.noS, main="Gender distribution among non-smokers",cex.main=0.8, xlab="ages", col=c("pink", "darkblue"), legend = rownames(percentages.noS)) barplot(percentages.S, main="Gender distribution among smokers",cex.main=0.8, xlab="ages", col=c("pink", "darkblue"), legend = rownames(percentages.S)) ``` Observe that in fact there is also a lack of “smoking” children below age 9: lack of complete overlap is when there are no observations at all for some combination(s) of treatment levels / covariate levels. For lack of complete overlap, there is no data available for some comparisons. This requires extrapolation using a model to make comparisons. This is in fact the job that the regression model does, but again extrapolation is always a risky business for the model in regions where no data are available. This is an even more serious problem than imbalance. Matching is a possible strategy in these situations to overcome (avoid) imbalance, even if some data will be discarded (see in the propensity score methods examples). # Estimating causal effects from observational studies using the propensity score approach Fist of all, we will upload the required libraries: ```{r warning=FALSE,message=FALSE} library(twang) library(magrittr) library(tidyverse) library(gtsummary) library(stddiff) library(ggplot2) library(data.table) library(boot) library(splines) library(PSAgraphics) library(Matching) library(sandwich) library(survey) library(rms) ``` We will use a dataset coming from an observational study of 996 patients receiving an initial Percutaneous Coronary Intervention (PCI) at Ohio Heart Health, Christ Hospital, Cincinnati in 1997 and followed for at least 6 months by the staff of the Lindner Center. The patients thought to be more severely diseased were assigned to treatment with abciximab (an expensive, high-molecular-weight IIb/IIIa cascade blocker); in fact, only 298 (29.9 percent) of patients received usual-care-alone with their initial PCI. Our research question aims at estimating the *treatment effect* of abciximab+PCI (abcix) vs the standard care on the probability of of being deceased at 6 months. In this practical, we will apply the methods based on the propensity score. Measured pre-treatment characteristics that could *confound* the treatment-outcome relationship are: - acutemi: Recent acute myocardial infarction ( 0 No, 1 Yes) - ejecfrac: Left Ejection Fraction (%) - ves1proc: number of vessels involved ( from 1 to 5) - stent: 1 indicates coronary stent inserted - diabetic: 1 indicates the subject is diabetic - height: height of the subject in cm - female: 1 indicates female subjects ## Exploring the data ```{r} data("lindner",package="twang") set.seed(123) ``` ```{r} summary(lindner) ``` How is the treatment variable distributed in the population? ```{r} # Exposure lindner %>% dplyr::select(abcix) %>% tbl_summary() ``` ```{r} ggplot(lindner)+ geom_bar(aes(x=abcix,fill=as.factor(abcix)),stat="count")+ scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` Is the outcome rare? ```{r} # Outcome lindner$sixMonthDeath <- 1-lindner$sixMonthSurvive lindner %>% dplyr::select(sixMonthDeath) %>% tbl_summary() ``` What is the *crude* odds ratio for mortality? ```{r} # Crude OR fit.crude <- glm(sixMonthDeath~abcix,family = binomial,data=lindner) tbl_regression(fit.crude,exponentiate=T) ``` Let's now summarize the confounding variables by treatment group and by outcome, to have a general idea about the observed associations: ```{r} # Possible confounders lindner %>% dplyr::select(acutemi, ejecfrac, ves1proc, stent, diabetic, female, height) %>% tbl_summary() ``` ```{r} # Descriptive statistics of patients'characteristics by treatment group lindner %>% dplyr::select(acutemi, ejecfrac, ves1proc, stent, diabetic, female, height, abcix) %>% tbl_summary(by=abcix) %>% add_overall() %>% add_p() ``` ```{r} # Descriptive statistics of patients'characteristics by outcome lindner %>% dplyr::select(acutemi, ejecfrac, ves1proc, stent, diabetic, female, height, sixMonthDeath) %>% tbl_summary(by=sixMonthDeath) %>% add_overall() %>% add_p() ``` We can now calculate the *Standardized Difference*,which can be use as a measure of balance in the treatment groups. It is a measure of difference between groups that is *independent* from statistical testing (remember that p values always depend on sample size !!). It is very similar to the definition of *effect size* that we discussed in Block 2. It can be defined for a continuous covariate as: $SD_{c}=\frac{\overline{x_{1}}-\overline{x_{0}}} {\sqrt{\frac{s_1^2+s_0^2}{2}}}$ and for a dichotomous covariate as: $SD_{d}=\frac{\overline{p_{1}}-\overline{p_{0}}} {\sqrt{\frac{\overline{p_{1}}(1-\overline{p_{1})}+\overline{p_{0}}(1-\overline{p_{0}})}{2}}}$ The rough interpretation is that imbalance is present if the standardized difference is greater than 0.1 or 0.2. ```{r} s1 <- stddiff.numeric(vcol="height",gcol="abcix",data=lindner) s2 <- stddiff.numeric(vcol="ejecfrac",gcol="abcix",data=lindner) s3 <- stddiff.numeric(vcol="ves1proc",gcol="abcix",data=lindner) s4 <- stddiff.binary(vcol="stent",gcol="abcix",data=lindner) s5 <- stddiff.binary(vcol="female",gcol="abcix",data=lindner) s6 <- stddiff.binary(vcol="diabetic",gcol="abcix",data=lindner) s7 <- stddiff.binary(vcol="acutemi",gcol="abcix",data=lindner) cont.var <- as.data.frame(rbind(s1,s2,s3)) rownames(cont.var) <- c("height", "ejecfrac", "ves1proc") cont.var bin.var <- as.data.frame(rbind(s4,s5,s6, s7)) rownames(bin.var) <- c("stent", "female", "diabetic", "acutemi") bin.var ``` ## Estimating the propensity score Fit now a propensity score model — a logistic regression model with abciximab+PCI (vs. PCI) as the outcome, and the confounders listed in the table above included as covariates. We exclude from the list the variable *height*, since there was not a relevant difference between the groups. ```{r} # Fit a propensity score model fit.ps<- glm(abcix~ acutemi+ ejecfrac+ ves1proc+ stent+ diabetic+ female, data=lindner,family = binomial) summary(fit.ps) ``` ```{r} # Save the estimated propensity score lindner$ps <- fitted(fit.ps) ``` ```{r} # Plot estimated ps ggplot(lindner) + geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) + scale_y_continuous("Estimated PS") + scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` ```{r} ggplot(lindner) + geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) + scale_y_continuous("Estimated PS") + facet_grid(cols=vars(abcix))+ scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` Assess whether there are non-overlapping scores (positivity violation) in the two exposure groups: ```{r} lindner %>% dplyr::select(ps,abcix) %>% tbl_summary(type = list(ps~"continuous2"),by=abcix, statistic = all_continuous2() ~c( "{median} ({p25}, {p75})", "{min}, {max}")) ``` Investigate overlap: ```{r} lindner %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) & ps<=max(ps[abcix==0]),1,0)) # non-overlap: treatment group have higher ps than any non-abciximab user and # control group have smaller ps than any abciximab user with(lindner,table(overlap,abcix)) with(lindner,prop.table(table(overlap,abcix)),2) ``` In the successive steps, we remove subjects that does not overlap. This step reduce the original sample size, but we should respect the assumption of positivity in order to estimate a reasonable causal effect. It makes no sense including subjects that have "near-zero" probability to receive the treatment or to have a "match" in the successive analyses. ## First option: "adjusting" for the propensity score We can use the estimated PS as a covariate in a logistic regression model for the outcome: ```{r} # Model 1: Linear relationship between ps and outcome fit.out<- glm(sixMonthDeath~abcix+ps, data=lindner, family = binomial, subset=overlap==1) # Model summary summary(fit.out) ``` The second step is to save the predicted probabilities for the treated and the untreated and estimate the causal effects of interest in the population: ```{r} # fitted values (probabilities) lindner$predY0<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[3]*lindner$ps) # PCI subjects lindner$predY1<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[2]+coef(fit.out)[3]*lindner$ps) # PCI+abciximab subjects # ATE effect Y1<-mean(lindner$predY1) Y0<-mean(lindner$predY0) # ATT effect Y1_1<-mean(lindner$predY1[lindner$abcix==1]) Y0_1<-mean(lindner$predY0[lindner$abcix==1]) # ATE effect Y1-Y0 # ATT effect Y1_1-Y0_1 # Estimate odds ratios related to the "ATE" and the "ATT" (Y1/(1-Y1))/(Y0/(1-Y0)) (Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1)) ``` The ATE effect is quite similar to the ATT effect, indicating that there is a protective effect of the PCI+abciximab vs PCI alone. To obtain the corresponding confidence intervals we can use the bootstrap approach. We do not here outline this procedure, see at the end of this practical the supplementary code. This model relies on two additional assumptions: no interaction between propensity score and treatment, and a linear relationship between the propensity score and treatment. Do these assumptions appear reasonable here? We can try to fit different models, and then compare the AIC: ```{r} # Model 2: Non-linear relationship between ps and outcome fit.out2<- glm(sixMonthDeath~abcix+ps+I(ps^2), data=lindner, family = binomial, subset=overlap==1) summary(fit.out2) # Model 3: Non-linear relationship between ps and outcome and interaction between ps and treatment fit.out3<- glm(sixMonthDeath~abcix*ps+I(ps^2), data=lindner, family = binomial, subset=overlap==1) summary(fit.out3) ``` It seems that the relationship could be partially non-linear, but there is no a statistical significance very strong, as well as for the interaction. So probably the best parsimonious model to keep is Model 1. ## Second option: Stratification Create propensity score strata: this could be an iterative process, since we should verify if we have enough subjects/event in each stratum. ```{r} lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.25,0.5,0.75,1)),include.lowest=T,labels=c(1:4))) ``` ```{r} #Check they have been created correctly summary(lindner$strata) tapply(lindner$ps,lindner$strata,summary) ``` ```{r} #Look at numbers of events and patients in each strata/exposure group table(lindner$sixMonthDeath,lindner$strata,lindner$abcix) ``` These strata seem quite "sparse" as number of events. Another possibility is : ```{r} #Create propensity score strata lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.33,0.66,1)),include.lowest=T,labels=c(1:3))) #Check they have been created correctly summary(lindner$strata) #Look at numbers of events and patients in each strata/exposure group table(lindner$sixMonthDeath,lindner$strata,lindner$abcix) ``` Remind : we should also check for the balance of the confounders in each strata ! See the supplementary material for that. For now, let's just estimate the OR in each stratum: ```{r} beta.treat<-numeric(3) nstrata<-table(lindner$strata) treated.strata<-table(lindner$strata,lindner$abcix)[,2] for (i in 1:3){ ms<-glm(sixMonthDeath~abcix,data=lindner,subset = strata==i,family="binomial") beta.treat[i]<-coef(ms)[2] print(summary(ms)) } ``` And, finally, let's estimate the *weighted* OR related to the ATE and the ATT as a weighted average of the ORs in the various strata: ```{r} exp(sum(beta.treat*nstrata)/nrow(lindner)) exp(sum(beta.treat*treated.strata)/sum(treated.strata)) ``` Also here, we should use a bootstrap approach to estimate the corresponding confidence intervals. ## Third option: Matching Here we should create a reduced dataset retaining only patients with the overlap: ```{r} lindner.overlap <- lindner %>% filter(overlap==1) ``` Now we proceed with the matching algorithm: there is plenty of different algorithms in R that produce matching, here we use one from the *library(Matching)*. ```{r} library(Matching) match <- Match(Y=lindner.overlap$sixMonthDeath, Tr=lindner.overlap$abcix, X=lindner.overlap$ps, caliper=0.2,# all matches not equal to or within 0.2 standard deviations of ps are dropped M=1, ties=FALSE, replace=TRUE # 1:1 ) ``` ```{r} # Number of pairs nn <- length(match$index.treated) # Create matched dataset lindnerMatched <- cbind(rbind(lindner.overlap[match$index.treated,], lindner.overlap[match$index.control,]), pair=c(1:nn,1:nn)) table(lindner.overlap$abcix) ``` ```{r} #Check number of treated patients table(lindnerMatched$abcix) ``` ```{r} #Look at people being used multiple times in the matched sample summary(as.factor(table(match$index.treated))) summary(as.factor(table(match$index.control))) ``` ```{r} #Look at the propensity score distribution in the matched dataset ggplot(lindnerMatched) + geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) + scale_y_continuous("Estimated PS") + scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` ```{r} ggplot(lindnerMatched) + geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) + scale_y_continuous("Estimated PS") + facet_grid(cols=vars(abcix))+ scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) + theme_classic() ``` Let's check now the balance: ```{r warning=FALSE} # Balance Diagnostics before and after matching bal <- MatchBalance(abcix~ stent+ female+ diabetic+ acutemi+ ejecfrac+ ves1proc, data=lindner.overlap, match.out = match) ``` ```{r} lindnerMatched %>% dplyr::select(stent,female,diabetic,acutemi,ejecfrac,ves1proc,abcix) %>% tbl_summary(by=abcix) %>% add_overall() %>% add_p() ``` Note that the number of stent has not been well balanced after the matching procedure. For this reason, we use this covariate in the regression model for the outcome. Now we estimate the causal effect: ```{r} fit.out4<-glm(sixMonthDeath~abcix+stent, data=lindnerMatched,family=binomial) summary(fit.out4) ``` We can see that the number of stent is statistically significant in the model,so it has been a good idea to control for it, since it was not well balanced in the matching procedure. As we already have discussed, sometimes also covariates that are not confounders for the effect of the treatment on the outcome could be included in the final model in order to obtain more accurate estimates of the effect. ## Fourth option: IPTW Definition of the weights: ```{r} # Definition of weights for ATE lindner.overlap %<>%mutate(w_ATE=case_when(abcix==1~1/ps, abcix==0~1/(1-ps))) #Definition of weights for ATT lindner.overlap %<>%mutate(w_ATT=case_when(abcix==1~1, abcix==0~ps/(1-ps))) ``` Check the extreme weights: sometimes it is useful to use truncated or stabilized weights, in order to reduce the variance of the final estimates, but we do not cover here this aspect. ```{r} #Check extremes quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1)) quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1)) quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1)) quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1)) ``` ```{r} # Balance diagnostics #ATE bal_IPTW_ATE <- dx.wts(x=lindner.overlap$w_ATE, data=lindner.overlap, vars=colnames(lindner.overlap)[4:10], treat.var = colnames(lindner.overlap)[3], estimand = "ATE") #ATT bal_IPTW_ATT <- dx.wts(x=lindner.overlap$w_ATT, data=lindner.overlap, x.as.weights = T, vars=colnames(lindner.overlap)[4:10], treat.var = colnames(lindner.overlap)[3], estimand = "ATT") bal.table(bal_IPTW_ATE) bal.table(bal_IPTW_ATT) ``` Finally, let's estimate the ATE causal effect on the weighted dataset: ```{r warning=FALSE, message=FALSE} # Estimate ATE design.lindnerATE <- svydesign(ids=~1, weights = ~w_ATE, data=lindner.overlap) fit_itpw_ATE <- svyglm(sixMonthDeath~abcix, family=binomial, design=design.lindnerATE) tbl_regression(fit_itpw_ATE,exponentiate = T) ``` And the ATT: ```{r} design.lindnerATT <- svydesign(ids=~1, weights = ~w_ATT, data=lindner.overlap) fit_iptw_ATT <- svyglm(sixMonthDeath~abcix, family=binomial, design=design.lindnerATT) tbl_regression(fit_iptw_ATT,exponentiate = T) ``` Also here it is possible to estimate the 95% CI using boostrap methods (that are in general more robust). ## Boostrap confidence intervals for METHOD 1: covariate adjustement ```{r} results <- data.frame(ATE=rep(NA,4),ATT=rep(NA,4)) ``` ```{r} f_PSadj <- function(data, indices,outcome.formula) { d <- data[indices,] # allows boot to select sample # estimation of ps m1<-glm(abcix~ acutemi+ ejecfrac+ ves1proc+ stent+ diabetic+ female, data=d, family = binomial) d$ps<-fitted.values(m1) # overlap d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) & ps<=max(ps[abcix==0]),1,0)) # outcome model m2<-glm(outcome.formula,data=d,family="binomial",subset = overlap==1) if(!m2$converged) print("Model did not converged") d$predY0<-m2$family$linkinv(coef(m2)[1]+coef(m2)[3]*d$ps) d$predY1<-m2$family$linkinv(coef(m2)[1]+coef(m2)[2]+coef(m2)[3]*d$ps) Y1<-mean(d$predY1) Y0<-mean(d$predY0) Y1_1<-mean(d$predY1[lindner$abcix==1]) Y0_1<-mean(d$predY0[lindner$abcix==1]) ATE_PSadj<-(Y1/(1-Y1))/(Y0/(1-Y0)) ATT_PSadj<-(Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1)) return(c(ATE_PSadj,ATT_PSadj)) } res_boot <- function(obj.boot,type="percent",digits=3){ suppressWarnings({ orig <- round(obj.boot$t0,digits) ciATE <- paste(round(boot.ci(obj.boot,index=1)[[type]][4:5],digits),collapse = "-") ciATT <- paste(round(boot.ci(obj.boot,index=2)[[type]][4:5],digits),collapse = "-") }) res <- paste(orig,rbind(ciATE,ciATT),sep="(") res.u <- paste(res,rep(")",2),sep = "") return(res.u) } ``` ```{r warning=FALSE} boot.out <- boot(data=lindner, statistic=f_PSadj,R=1000,outcome.formula=fit.out$formula) print(boot.out) # Get 95% confidence interval results$ATE[1]<-res_boot(boot.out,digits=4)[1] results$ATT[1]<-res_boot(boot.out,digits=4)[2] results ``` ## Boostrap confidence intervals for METHOD 2: stratification Very often with the stratification method there are many problems of convergence of the regression algorithm, since in the strata we have very few events ! ```{r} f_PSstrat <- function(data, indices) { d <- data[indices,] # allows boot to select sample m1<-glm(abcix~ acutemi+ ejecfrac+ ves1proc+ stent+ diabetic+ female, data=d,family = binomial) d$ps<-fitted.values(m1) quart_PS<-quantile(d$ps,c(0,0.33,0.66,1)) d$strata<-cut(d$ps, quart_PS, labels=c(1:3)) for (i in 1:3){ ms<-glm(sixMonthDeath~abcix,data=d,subset = strata==i,family="binomial") beta.treat[i]<-coef(ms)[2] } ATE <- exp(sum(beta.treat*nstrata)/nrow(lindner)) ATT <- exp(sum(beta.treat*treated.strata)/sum(treated.strata)) return(c(ATE,ATT)) } ``` ```{r warning=FALSE} boot.out4 <- boot(data=lindner, statistic=f_PSstrat,R=1000) # Get 95% confidence interval results$ATE[2]<-res_boot(boot.out4,digits=3)[1] results$ATT[2]<-res_boot(boot.out4,digits=3)[2] ``` ## Robust confidence intervals for METHOD 3: matching We now estimate the robust standard errors related to the estimate on the matched dataset: ```{r warning=FALSE} cov <- vcovHC(fit.out4, type = "HC0") std.err <- sqrt(diag(cov)) q.val <- qnorm(0.975) r <- cbind( Estimate = coef(fit.out4) , "Robust SE" = std.err , z = (coef(fit.out4)/std.err) , "Pr(>|z|) "= 2 * pnorm(abs(coef(fit.out4)/std.err), lower.tail = FALSE) , LL = coef(fit.out4) - q.val * std.err , UL = coef(fit.out4) + q.val * std.err ) #Exponential to get the OR results$ATT[3]<- paste0(round(exp(r[2,1]),4),"(",round(exp(r[2,5]),4),"-",round(exp(r[2,6]),4),")") ``` ## Boostrap confidence intervals for method 4: IPTW ```{r warning=FALSE} f_IPTW <- function(data, indices) { d <- data[indices,] # allows boot to select sample # estimation of ps m1<-glm(abcix~ acutemi+ ejecfrac+ ves1proc+ stent+ diabetic+ female, data=d, family = binomial) d$ps<-fitted.values(m1) # overlap d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) & ps<=max(ps[abcix==0]),1,0)) %>% filter(overlap==1) # Definition of weights for ATE d %<>%mutate(w_ATE=case_when(abcix==1~1/ps, abcix==0~1/(1-ps))) #Definition of weights for ATT d %<>%mutate(w_ATT=case_when(abcix==1~1, abcix==0~ps/(1-ps))) # Estimate ATE design.lindnerATE <- svydesign(ids=~1, weights = ~w_ATE, data=d) fit_itpw_ATE <- svyglm(sixMonthDeath~abcix, family=binomial, design=design.lindnerATE) # Estimate ATT design.lindnerATT <- svydesign(ids=~1, weights = ~w_ATT, data=d) fit_iptw_ATT <- svyglm(sixMonthDeath~abcix, family=binomial, design=design.lindnerATT) ATE <- exp(fit_itpw_ATE$coefficients[2]) ATT <- exp(fit_iptw_ATT$coefficients[2]) return(c(ATE,ATT)) } ``` ```{r warning=FALSE} boot.out5 <- boot(data=lindner, statistic=f_IPTW,R=1000) results$ATE[4]<-res_boot(boot.out5,digits=4)[1] results$ATT[4]<-res_boot(boot.out5,digits=4)[2] ``` ## Final Comparison across methods ! ```{r} # recall the crude OR from the original dataset tbl_regression(fit.crude,exponentiate=T) ``` ```{r} # PS results row.names(results) <- c("PS Adj Linear","Stratification","Matching","IPTW") results ``` We can observe that using the propensity score based methods we obtain a stronger estimate of the treatment effect with respect to the crude odds ratio. We can also observe that stratification produce very unstable results due to the low sample size in the different strata. In conclusion, adjusting for confounders was important in this context ! # Survey of Health, Ageing and Retirement in Europe : example of an analysis of a real dataset Load the required libraries ```{r} library(haven) library(tidyverse) library(magrittr) ``` Data comes from SHARE, the *Survey of Health, Ageing and Retirement in Europe* (https://share-eric.eu/data/). It is a multidisciplinary and cross-national panel database of data on health, socio-economic status and social and family networks of about 140000 individuals aged 50 or older (around 380000 interviews). SHARE covers 27 European countries and Israel. The dataset we use here contains only part of the data the survey produced. The research question is whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population: therefore we can consider it from the point of view of an *explanatory* (causal) model, not a predictive one. This question could be of interest since muscular weakness can be considered a proxy of poor physical health. Muscular strength was defined as the hand grip measured with a dynamo-meter and the weakness was defined as having a grip strength below a threshold calculated according to BMI and sex. The exposure group was defined as subjects who underwent a stressful period between the first and the last time they had been interviewed; all other subject were defined as non-exposed. Of note, an inclusion criteria was to have a normal hand strength at the start of the study. Once the subjects were divided in the exposed and non-exposed group, the hand grip measurement after two years from the exposure definition was used to define the outcome. In this example we will consider the following variables: + *low_grip*: our binary outcome variable (0: No; 1: Yes) + *stress*: our binary exposure variable (0: No; 1: Yes) + *age*: age of the subject in years + *female*: sex (1: females; 0: males) + *ses*: is household able to make ends meet? (1: Great Difficulties; 2: Some Difficulties; 3: Fairly Good ; 4 Good) + *paid_job*: has the subject ever done some paid job in their life? (0: No; 1: Yes) + *move_house*: has the subject ever moved country in their life? (0: No; 1: Yes) ```{r} load("Logistic Regression examples.RData") ``` ## IDA phase ### Categorical Variables Before starting to fit any regression model we have to properly code categorical variables. This is usually done in the initial data analysis (IDA) and the descriptive statistics phase. The first step consists in identifying which variables are categorical. This may seem as an easy step but it isn't always. If fact the choice for some variables (i.e. indexes and scales) is not obvious and it requires either knowledge of the data we are analysing or a bit more effort on the data modelling. The important message is that we can't assume that it is correct to model a variable as numerical only because it was of type numerical/integer in the dataset imported in R. Once we have identified the variables that we think are categorical we can process them. We can have three cases that leads to a slightly different procedure according to the R type of the variable: + **Binary variables** + **Variables of type numerical or integer** + **Variables of type character** ### Binary Variables Binary variables can either be treated as numerical, usually dummy 0/1 variables, or as factors. It does not make any difference in terms of the results obtained from the model but we have to be aware of how they are coded for a correct interpretation of the model. In general, the model will always have 1 parameter for a binary variable. However, we have to choose which level the parameter should represent by deciding which group to code with 1. The choice should be made according to how we want the model to be interpreted. For example, for the treatment variable, 0 is usually the placebo or the control treatment and 1 is usually the experimental treatment. Alternatively, in the case of the covariates/exposure variables, we could simply use a statistical criteria: for a better stability of the model it is always better to use the *most frequent* category as a reference. For example, it makes more sense to consider as the reference level of *paid_job* subjects who have had a paid job. This will help with the stability of the model but also it will ease the interpretation of the results. Therefore, we create a new variable *no_paid_job*: ```{r} datiSHAREStress$no_paid_job <- ifelse(datiSHAREStress$paid_job==0,1,0) ``` ### Categorical variables with numerical labels In this case we must transform the variable in a factor, assign labels to the factor levels and choose a reference level. The reference level will be the group all the others levels will be compared to. In fact when a factor variable with *c* levels is added to a model, R internally transforms it in *c-1* dummy variables and a model parameter will be estimated for each of them. Again, is important we choose the reference level. In this case we have *ses* which we have to transform into a factor ```{r} datiSHAREStress$ses <- as.factor(datiSHAREStress$ses) ``` We can now check how R has coded the variable: ```{r} str(datiSHAREStress$ses) contrasts(datiSHAREStress$ses) ``` By default, R uses the group 1 as the reference level. With the *contrasts()* function we can see how the variable will be parametrized in the regression model. In the rows we have the variable levels and on the columns the parameters generated for this variable. It is also useful to assign more meaningful labels: ```{r} datiSHAREStress$ses=factor(datiSHAREStress$ses, labels = c("Great Difficulties", "Some Difficulties", "Fairly Easily", "Easily")) ``` Last but not least we can set the reference category of our choice. In this case we choose "Easily" which is the most frequent category. ```{r} table(datiSHAREStress$ses) datiSHAREStress$ses=relevel(datiSHAREStress$ses,ref="Easily") ``` ### Characters variables If the variable is of type *character* the steps are similar to the previous ones with the exception that we won't need to explicitly set the labels. As a reminder, by default R uses as a reference level the first level in alphabetical order. ## General descriptives statistics and plots We skip this part here, since we already discussed how to describe a dataset in the various IDA examples. Remind that you should pay particular attention to outliers, missing values, continuous variable's distribution shapes, rare categories in categorical ones! ## Univariable Logistic Regression (univariable filtering) A common method used in the analysis of health data for variables selection, is fitting many models with one covariate at the time. This is also a way to begin to explore the dataset, even if as discussed it is not the suggested method to *select* variables in the model. In this case, we can start with the exposure to stress, our variable of interest. This kind of analysis in the medical literature is often referred as "univariable analysis". So we start by fitting the model ```{r} fit_uni_stress <- glm(low_grip~stress,family = binomial,data=datiSHAREStress) ``` As a reminder, the glm function automatically uses the logit as link function unless we state otherwise. The results of the model can be obtained with ```{r} summary(fit_uni_stress) ``` It seems that we can't reject the null hypothesis for $\beta_{stress}$. We can also obtain the 95% Confidence Interval ```{r} confint.default(fit_uni_stress) ``` We can do more univariable models to look for *possible confounders* (previoulsy discussed with an expert if possible..) in the exposure-outcome relationship. ```{r} fit_uni_age <- glm(low_grip~age,family = binomial,data=datiSHAREStress) summary(fit_uni_age) fit_uni_job <- glm(low_grip~no_paid_job,family = binomial,data=datiSHAREStress) summary(fit_uni_job) fit_uni_sex <- glm(low_grip~female,family = binomial,data=datiSHAREStress) summary(fit_uni_sex) fit_uni_move <- glm(low_grip~house_move,family = binomial,data=datiSHAREStress) summary(fit_uni_move) fit_uni_ses <- glm(low_grip~ses,family = binomial,data=datiSHAREStress) summary(fit_uni_ses) ``` When we have covariates with many levels, such as *ses*, we can use the global Wald test which takes into account for multiple testing. ```{r} drop1(fit_uni_ses,test="Chisq") ``` ## Multivariable Logistic Model We can now fit a multivariable logistic model with the variables that were found to be associated with the outcome in the previous analysis and we suspect could act as confounders of the main exposure of interest. Since we have not the possibility do discuss with an expert, we decide to include in the multivariable regression model all covariates with a p-value < 0.1. This criterion could be not the best one, but is widely applied in clinical and epidemiological research. We obviously include in the model *stress* , since it is our exposure of interest. We will treat the others variables are possible confounders in the model, and we will explore if significant associations are present with the exposure of interest. In principle, in the explanatory setting, we should start from a DAG and make the causal assumptions (in order to estimate a causal effect!!) but we do not have here the experts of the matter and we will limit ourselves in the end to a cautious interpretation of the estimated association. ```{r} fit_multi <- glm(low_grip~stress+no_paid_job+age+house_move+ses,family = binomial,data=datiSHAREStress) summary(fit_multi) ``` This model estimates the coefficients of each of the variables independently from all the others. This is key concept of multivariable regression and it is what allows to *remove* confounding. At a first sight it seems that our exposure of interest is not associated with the outcome. Knowledge of the context from which the data comes from can be used to make hypothesis about possible interaction between confounders and the exposure. In this case we want to test for an interaction between *no_paid_job* and stress. In other words we want to see if the association between the outcome and the exposure could vary for different levels of *no_paid_job*. We then fit the model: ```{r} fit_multi2 <- glm(low_grip~stress*no_paid_job+house_move+age+ses,family = binomial,data=datiSHAREStress) summary(fit_multi2) ``` As a reminder, using * introduces both the main effect and the interaction effect between variables in the model. Alternatively, we can also fit the same model as follows: ```{r} fit_multi2 <- glm(low_grip~stress+no_paid_job+age+ses+stress:no_paid_job+house_move,family = binomial,data=datiSHAREStress) ``` This second way of adding an interaction term is useful in case we have multiple interactions involving the same variable. It seems that a significant interaction is present between stress and the no_paid_job confounder. Note that in any case, the main effect of stress is still not significant in the model. Now that we have fitted these two models, which one should we choose? We can use a formal statistical test to compare the two nested models: ```{r} anova(fit_multi,fit_multi2,test="Chisq") ``` The null hypothesis of this test is that the simpler model (i.e. the one with the smaller number of regression parameters) is no different from the more complex model. In this case we reject that hypothesis at a 95% confidence level and we would keep the model with the interaction. ## Calibration for the logistic regression model (simplest basic method) Even if we are here estimating an explanatory model, this does not means that we are not at all interested in evaluating if the predicted probabilities are in line with the observed event rates. A basic procedure to evaluate this aspect in logistic regression is the Hosmer-Lemeshow test (see for details: https://onlinelibrary.wiley.com/doi/book/10.1002/0471722146). In brief, the null hypothesis of the test is that the predicted probabilities (splitted in ordinal groups) are in line with the observed rates of the events in each group. ```{r} library(generalhoslem) logitgof(datiSHAREStress$low_grip,fitted(fit_multi2)) ``` The model doesn't show evidence for poor goodness of fit/calibration. Remind that when we are instead working with models used specifically for prediction, we should use more refined analyses such that the bootstrap overfitting-corrected calibration curves. ## Interpretation of the logistic regression model The next step is interpreting the results obtained from the model which is one of the most important part of the analysis to be discussed with experts. Of course we are keeping things simple here: in reality we would have had to consider many other possible candidate confounders for the outcome in order to properly *control for confounding* in this observational study, at least for the measured ones. ### Continuous covariates We obtain the OR, which is the measure of association selected when reporting and interpreting the results of a logistic model. We start with the OR estimated for age: ```{r} exp(fit_multi2$coefficients)[4] ``` In general, OR>1 indicate an increase of the probability of the outcome whereas OR<1 a decrease. But the question is : with respect to what? We know that the OR is a *relative measure* of association so we must always ask ourselves what comparison we are making when we report an OR we have estimated. The above output does not mean anything by itself. The first important thing to remember is that age is a continuous variable (measured in years) and in the model we have assumed it has a linear effect on the log-odds of the outcome. When we obtain the $\hat{OR}$ by exponentiating the coefficient for age we obtain the OR estimates for an increase of one year of age. Being the $\hat{OR}>1$, the probability of developing low grip increases as the age increases. Specifically, a subject has odds of having low grip 1.11 times greater with respect to a subject who is 1 year younger. For continuous variables is important to report an OR for a difference that is relevant in the application at hand; a proper choice will depend of the scale the variable is measured and on the magnitude of the estimated coefficient. For example here 1 year may not be so relevant from an epidemiological point of view, we may want to report an OR for a 5 years difference in age: ```{r} exp(fit_multi2$coefficients*5)[4] ``` ### Binary Covariates What if we want to interpret the OR of a binary variable? Let's consider the variable *house_move*. If it is coded numerical, then the coefficients refers always to the level 1. In this case 1 stands for having moved country in the past, so if the estimated coefficient was significant we could say that subjects who moved seem to have an odds lower by 17% ($1-\hat{OR}$) compared to people that have never moved countries in their life. However, when we look at the 95% CI we observe that it contains 1 so the association does not seem to be statistically significant. ```{r} #OR exp(fit_multi2$coefficients)[8] exp(confint.default(fit_multi2))[8,] ``` ### Categorical Covariates The interpretation of the OR for categorical variables is similar: we will have to keep in mind that we are always comparing each of the variable levels to the reference level. Here, it seems (interestingly!) that the odds of health worsening is the same for subject with good or fairly good self-perceived socio-economic status while it increases as the economic situation gets worse. ```{r} # OR exp(fit_multi2$coefficients)[5:7] # 95% CI exp(confint.default(fit_multi2))[5:7,] ``` ### Interactions Finally, we will obtain the $\hat{OR}s$ for *stress* and *no_paid_job*. Since there is an interaction involved, we have to be a bit more careful and we have to consider the two variables together. So first of all we have to know which is our reference group. In our case this is the group who have not experienced a stressful period and have done some paid job in their life (the reference levels for both variables involved in the interaction). Then, we can consider all combinations of the levels of the variables. If we want the $\hat{OR}$ for undergoing a stressful period and having had a paid job with respect to not having had a period of stress and having had a paid job we simply exponentiate the main estimated coefficient for stress (keeping at the same values the others covariates in the model): ```{r} exp(fit_multi2$coefficients)[2] ``` and we don't reject the null hypothesis it is equal to 1 (no significant effect): ```{r} exp(confint.default(fit_multi2))[2,] ``` On the other hand, if we want the $\hat{OR}$ for never having had a paid job and not being exposed to stress with respect to having had a paid job and not being exposed to stress we simply exponentiate the coefficient for the main effect of the variable no_paid_job: ```{r} exp(fit_multi2$coefficients)[3] ``` Also this effect is not statistically significant which means that there does not seem to be a difference in the risk of low hand grip with regards to job experience in the non-exposed group (no stress) keeping fixed all others covariates. ```{r} exp(confint.default(fit_multi2))[3,] ``` Last, we can obtain the OR for subjects who were stressed and had never have a paid job with respect to the reference group (not experienced a stressful period and have done some paid job): ```{r} exp(fit_multi2$coefficients[9]+fit_multi2$coefficients[2]+fit_multi2$coefficients[3]) ``` So in this subgroup it seems that the probability of developing low grip increases. However, we should evaluate also the 95% CI for this $\hat{OR}$. One possibility to do it is to re-fit the multivariable logistic model using the *rms* R package: ```{r,eval=F,echo=F} datiSHAREStress$stress <- as.factor(datiSHAREStress$stress) datiSHAREStress$house_move <- as.factor(datiSHAREStress$house_move) datiSHAREStress$no_paid_job <- as.factor(datiSHAREStress$no_paid_job) ``` ```{r} library(rms) dd <- datadist(datiSHAREStress) options(datadist='dd') #define ranges of the covariates fit_multi2b <- lrm(low_grip~stress*no_paid_job+age+ses+house_move,data=datiSHAREStress) ``` The code is quite similar, the main difference is that first we have to run the *datadist* function which stores the distribution summaries of the variables. The print function returns a very similar output of the summary for the glm() object. ```{r} print(fit_multi2b) ``` The summary on the other hand gives you the estimates for the coefficients as well as the ones for the OR. Of note by default for the continuous variables here, such as age, the function calculates the OR for the difference between the $1^{st}$ and the $3^{rd}$ quartile. ```{r} summary(fit_multi2b) ``` So far, the OR for stress and the job experience are the ones for the main effects. We can easily obtain here the OR for the interaction with the built-in *contrast* function: ```{r} c <- contrast(fit_multi2b, list(stress=1,no_paid_job=1), list(stress=0,no_paid_job=0),type="average") c #OR and 95% CI exp(c$Contrast) exp(c$Lower) exp(c$Upper) ``` So it seems that there is a significant interaction between stress and no paid job on the outcome. ## Conclusions Remind our initial research question: whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population. Based on our analysis it seems that is not the single event of a stessful period that increase the occurrence of muscular weakness, but it is the joint impact of having a stressful period coupled with no paid job that is associated with the occurrence of muscular weakness, adjusting for (or independently from) the specific age, house condition or socio-economic status. # Regression on count data: Poisson regression ## Introduction In nature, an event usually takes place in a *very small* amount of time. At any given point of time, the probability of encountering such an event is small. Instead of the probability of the *single* event, now we focus on the *frequency* of the events as a density, which means incidence or 'count' of events over a period of time. (While time is one dimension, the same concept applies to the density of counts of small objects in a two-dimensional area or three-dimensional space). Moreover we can assume that one event is independent from another and that the *densities* in different units of time vary with a variance equal to the average density. We can approximate this kind of random process using the Poisson random variable. When the probability of having an event is affected by some factors, a model is needed to explain and predict the density. Variation among different strata of a population could be explained by the various combination of factors. *Within each stratum* (defined by covariates combination), the distribution of the events is assumed random. Poisson regression deals with outcome variables that are counts in nature (whole numbers or integers). Independent covariates have the same role as those encountered in linear and logistic regression. In epidemiology, Poisson regression is very often used for analysing *grouped* population based or cohort data, looking at incidence density among person-time contributed by subjects that share similar characteristics of interest. Poisson regression is one of 3 common regression models used in epidemiological studies. The other two that are more commonly used are linear regression and logistic regression, which have been already covered. The last family is survival methods, that we will explain in the last block 4. There are two main assumptions for Poisson regression: 1) risk is homogeneous among person-times contributed by different subjects who share the same characteristics of interest (e.g. sex, age-group) and the same period. 2) asymptotically, or as the sample size becomes larger and larger, the mean of the counts is equal to the variance. Straightforward linear regression methods (assuming constant variance and normal errors) are not appropriate for count data for four main reasons: 1. the linear model might lead to the prediction of negative counts 2. the variance of the response may increase with the mean 3. the errors will not be normally distributed 4. zero counts are difficult to handle in transformations Moreover, in studies that invole the time dimension different subjects may have different person-times of exposure. Analysing risk factors while ignoring differences in person-times is wrong. Poisson regression overcomes these limitations. Note that in survival analysis using for example Cox regression (see block 4), the *hazard ratio* will be estimated for each covariate in the model, not the incidence density in each subgroup; in the Cox model the interest will be focused on the "how long until an event occurs - time to event -", instead in the Poisson regression model the focus is on "how many events occur in given interval". ## Example of Poisson model: the Montana smelter study The dataset Montana was extracted from an occupational cohort study conducted to test the association between respiratory deaths (outcome) and exposure to arsenic in the industry, after adjusting for various other risk factors/confounders. The main outcome variable is *respdeath*. This is the count of the number of deaths among *personyrs* or personyears of subjects in each category. The other variables are independent covariates including age group *agegr*, period of employment *period*, starting time of employment *start* and the level of exposure to arsenic during the study period *arsenic* (the exposure of interest). Read in the data first and examine the variables. ```{r} data(Montana) summary(Montana) ``` The last four variables are classed as integers. We need to tell R to interpret them as categorical variables, or factors, and attach labels to each of the levels. This can be done using the factor command with a 'labels' argument included. ```{r} Montana$agegr <- factor(Montana$agegr, labels=c("40-49","50-59","60-69","70-79")) Montana$period <- factor(Montana$period, labels=c("1938-1949", "1950-1959","1960-1969", "1970-1977")) Montana$start <- factor(Montana$start, labels=c("pre-1925", "1925 & after")) Montana$arsenic1 <- factor(Montana$arsenic, labels=c("<1 year", "1-4years","5-14 years", "15+ years")) summary(Montana) ``` We keep the original *arsenic* variable unchanged for use later on. ### Descriptive analyses : breakdown of incidence by age and period Let us explore the person-years breakdown by age and period. Firstly, create a table for total person-years: ```{r} tapply(Montana$personyrs, list(Montana$period, Montana$agegr), sum) -> table.pyears ``` Carry out the same procedure for number of deaths, and compute the table of incidence per 10,000 person years for each cell. ```{r} tapply(Montana$respdeath, list(Montana$period, Montana$agegr), sum) -> table.deaths table.inc10000 <- table.deaths/table.pyears*10000 table.inc10000 ``` Now, create a time-series plot of the incidence: ```{r} plot.ts(table.inc10000, plot.type="single", xlab=" ",ylab="#/10,000 person-years", xaxt="n", col=c("black", "blue","red","green"), lty=c(2,1,1,2), las=1) points(rep(1:4,4), table.inc10000, pch=22, cex=table.pyears/sum(table.pyears) * 20) title(main = "Incidence by age and period") axis(side = 1, at = 1:4, labels = levels(Montana$period)) legend(3.2,40, legend=levels(Montana$agegr)[4:1], col=c("green","red", "blue", "black"), bg = "white", lty=c(2,1,1,2)) ``` The above graph shows that the older age group is generally associated with a higher risk. On the other hand, the sample size (reflected by the size of the squares at each point) decreases with age. The possibility of a confounding effect of age on the exposure of interest can better be examined by using Poisson regression. ### Modelling with Poisson regression Let's estimate a Poisson regression model taking into account only period as a covariate: ```{r} mode11 <- glm(respdeath ~ period, offset = log(personyrs),family = poisson, data=Montana) summary(mode11) ``` The option *offset = log(personyrs)* allows the variable *personyrs* to be the denominator for the counts of *respdeath*. A logarithmic transformation is needed since, for a Poisson generalized linear model, the link function is the natural log, and the default link for the Poisson family is the log link. Remind : an important criterion in the choice of a link function for various families of distributions is to ensure that the fitted values from the modelling stay within reasonable bounds. Specifying a log link (default for Poisson) ensures that the fitted counts are all greater than or equal to zero. For more details on default links for various families of distributions related to generalized linear modelling, see the help in R under *help(family)*. The first model above with *period* as the only independent variable suggests that the death rate increased with time. The model can be tested for goodness of fit and the checked whether the Poisson assumptions mentioned earlier have been violated. ### Goodness of fit test To test the goodness of fit of the Poisson model, type: ```{r} poisgof(mode11) ``` The component '$chisq' is actually computed from the model deviance, a parameter reflecting the level of errors. A large chi-squared value with small degrees of freedom results in a significant violation of the Poisson assumption (p < 0.05). If only the P value is wanted, the command can be shortened. ```{r} poisgof(mode11)$p.value ``` The P value is very small indicating a poor fit. Note:It should be noted that this method works under the assumption of a *large* sample size. An alternative method is to a fit negative binomial regression model (but not covered in the slides!). We now add the second independent variable 'agegr' to the model: ```{r} mode12 <- glm(respdeath~agegr+period, offset=log(personyrs), family = poisson, data=Montana) AIC(mode12) ``` The AIC (Akaike Information Criterion) has decreased remarkably from 'model1' to 'model2' indicating a poor fit of the first model. ```{r} poisgof(mode12)$p.value ``` But 'model2' still violates the Poisson assumption. ```{r} mode13 <- glm(respdeath ~ agegr, offset = log(personyrs), family = poisson, data=Montana) AIC(mode13) poisgof(mode13)$p.value ``` Removal of 'period' further reduces the AIC but still violates the Poisson assumption to the same extent as the previous model. The next step is to add the exposure of interest: 'arsenic1'. ```{r} mode14 <- glm(respdeath ~ agegr + arsenic1, offset=log(personyrs), family = poisson, data=Montana) summary(mode14) ``` ```{r} poisgof(mode14)$p.value ``` Fortunately, 'model4' has a much lower AIC than model3 and it now does not violate the assumption. If we change the reference level for arsenic and we use *1-4 years* vs others: ```{r} Montana$arsenic.b <- relevel(Montana$arsenic1,ref="1-4years") mode15 <- glm(respdeath ~ agegr + arsenic.b, offset=log(personyrs), family = poisson, data=Montana) summary(mode15) ``` It does not appear to be any increase in the risk of death from more than 4 years of exposure to arsenic so it may be worth combining it into just two levels: ```{r} Montana$arsenic2 <- Montana$arsenic1 levels(Montana$arsenic2) <- c("<1 year", rep("1+ years", 3)) model6 <- glm(respdeath ~ agegr + arsenic2,offset=log(personyrs), family=poisson, data=Montana) summary(model6) ``` At this stage, we would accept 'model6' as the final model, since it has the smallest AIC among all the models that we have tried. We conclude that exposure to arsenic for at least one year is associated with an increased risk for the disease by exp(0.8109) or 2.25 times with statistical significance, independently from age. # Minimum Sample Size Required for Developing a Multivariable Model We now want to compute the minimum sample size required for the development of a new multivariable model using the criteria proposed by Riley et al. The required sample size aims to minimize model overfitting and to ensure key parameters (such as the model intercept) are estimated precisely. As for any sample size calculation, the approach requires the user to specify anticipated values for key parameters. The package *pmsampsize* can be used to calculate the minimum sample size for the development of models with continuous, binary or survival (time-to-event) outcomes. Riley et al. lay out a series of criteria the sample size should meet. These aim to minimise the overfitting and to ensure precise estimation of key parameters in the prediction model. For continuous outcomes, there are four criteria: 1. small overfitting defined by an expected shrinkage of predictor effects by 10% or less 2. small absolute difference of 0.05 in the model's apparent and adjusted R-squared value 3. precise estimation of the residual standard deviation 4. precise estimation of the average outcome value The sample size calculation requires the user to pre-specify (e.g. based on previous evidence) the anticipated R-squared of the model, and the average outcome value and standard deviation of outcome values in the population of interest. For binary or survival (time-to-event) outcomes, there are three major criteria: 1. small overfitting defined by an expected shrinkage of predictor effects by 10% or less, 2. small absolute difference of 0.05 in the model's apparent and adjusted Nagelkerke's R-squared value 3. precise estimation (within +/- 0.05) of the average outcome risk in the population [for a key timepoint of interest for prediction in case of survival data, see later block 4]. ```{r warning=FALSE,message=FALSE} library(pmsampsize) ``` Arguments of the function: **type** specifies the type of analysis for which sample size is being calculated "c"" specifies sample size calculation for a prediction model with a continuous outcome "b" specifies sample size calculation for a prediction model with a binary outcome "s" specifies sample size calculation for a prediction model with a survival (time-to-event) outcome **rsquared** specifies the expected value of the (Cox-Snell) R-squared of the new model, where R-squared is the percentage of variation in outcome values explained by the model. For example, the user may input the value of the (Cox-Snell) Rsquared reported for a previous prediction model study in the same field. If taking a value from a previous prediction model development study, users should input the model's adjusted R-squared value, not the apparent R-squared value, as the latter is optimistic (biased). However, if taking the R-squared value from an external validation of a previous model, the apparent R-squared can be used (as the validation data was not used for development, and so R-squared apparent is then unbiased). Note that for binary and survival outcome models, the Cox-Snell R-squared value is required; this is the generalised version of the well-known Rsquared for continuous outcomes, based on the likelihood. The papers by Riley et al. (see references) outline how to obtain the Cox-Snell R-squared value from published studies if they are not reported, using other information (such as the Cstatistic [see cstatistic() option below] or Nagelkerke's R-squared). Users should be conservative with their chosen R-squared value; for example, by taking the R-squared value from a previous model, even if they hope their new model will improve performance. **parameters** specifies the number of candidate predictor parameters for potential inclusion in the new prediction model. Note that this may be larger than the number of candidate predictors, as categorical and continuous predictors often require two or more parameters to be estimated. **shrinkage** specifies the level of shrinkage desired at internal validation after developing the new model. Shrinkage is a measure of overfitting, and can range from 0 to 1, with higher values denoting less overfitting. A shrinkage = 0.9 is recommended (the default in pmsampsize), which indicates that the predictor effect (beta coefficients) in the model would need to be shrunk by 10% to adjust for overfitting. See references below for further information. **prevalence** (binary outcome option) specifies the overall outcome proportion (for a prognostic model) or overall prevalence (for a diagnostic model) expected within the model development dataset. This should be derived based on previous studies in the same population. **cstatistic** (binary outcome option) specifies the C-statistic reported in an existing prediction model study to be used in conjunction with the expected prevalence to approximate the Cox-Snell R-squared using the approach of Riley et al. 2020. Ideally, this should be an optimism-adjusted C-statistic. The approximate Cox- Snell R-squared value is used as described above for the rsquared() option, and so is treated as a baseline for the expected performance of the new model. **seed** (binary outcome option) specifies the initial value of the random-number seed used by the random-number functions when simulating data to approximate the Cox-Snell R-squared based on reported C-statistic and expect prevalence as described by Riley et al. 2020 **rate** (survival outcome option) specifies the overall event rate in the population of interest, for example as obtained from a previous study, for the survival outcome of interest. NB: rate must be given in time units used for meanfup and timepoint options. **timepoint** (survival outcome option) specifies the timepoint of interest for prediction. NB: time units must be the same as given for meanfup option (e.g. years, months). **meanfup** (survival outcome option) specifies the average (mean) follow-up time anticipated for individuals in the model development dataset, for example as taken from a previous study in the population of interest. NB: time units must be the same as given for timepoint option. **intercept** (continuous outcome options) specifies the average outcome value in the population of interest e.g. the average blood pressure, or average pain score. This could be based on a previous study, or on clinical knowledge. **sd** (continuous outcome options) specifies the standard deviation (SD) of outcome values in the population e.g. the SD for blood pressure in patients with all other predictors set to the average. This could again be based on a previous study, or on clinical knowledge. **mmoe** (continuous outcome options) multiplicative margin of error (MMOE) acceptable for calculation of the intercept. The default is a MMOE of 10%. Confidence interval for the intercept will be displayed in the output for reference. See references below for further information. ## Continuous outcome (Linear model) Self-identified race or ethnic group is used to determine normal reference standards in the prediction of pulmonary function. A study was conducted in 2010 to determine whether the genetically determined percentage of African ancestry is associated with lung function and whether its use could improve predictions of lung function among persons who identified themselves as African American (see:https://www.nejm.org/doi/full/10.1056/NEJMoa0907897) Authors assessed the ancestry of 777 participants self-identified as African American and evaluated the relation between pulmonary function and ancestry by means of linear regression. African ancestry was inversely related to forced expiratory volume in 1 second (FEV1) and forced vital capacity. Assuming to use 25 candidate parameters, and an intercept of 1.9, a standard deviation of 0.6 (from the published paper) and a lower bound for the R squared of 0.20: ```{r warning=FALSE,message=FALSE} pmsampsize(type = "c", rsquared = 0.2, parameters = 25, intercept=1.9, sd=0.6) ``` SPP indicates the "Subjects per Predictor parameter" and as you can observe this number vary accordingly to the criterion used (from 10 to 37). The confidence interval reported for the intercept is based on a 10% margin of error. ## Binary outcome (Logistic model) Chagas disease is a tropical parasitic disease.It is spread mostly by insects known as Triatominae, or "kissing bugs". The symptoms change over the course of the infection. In the early stage, symptoms are typically either not present or mild, and may include fever, swollen lymph nodes, headaches, or swelling at the site of the bite.After four to eight weeks, untreated individuals enter the chronic phase of disease, which in most cases does not result in further symptoms.Up to 45% of people with chronic infection develop heart disease 10-30 years after the initial illness, which can lead to heart failure. Digestive complications, including an enlarged esophagus or an enlarged colon, may also occur in up to 21% of people, and up to 10% of people may experience nerve damage. With the globalization of Chagas disease, unexperienced health care providers may have difficulties in identifying which patients should be examined for this condition. This study published in 2016 aimed to develop and validate a diagnostic clinical model for chronic Chagas disease : https://www.scielo.br/j/rsbmt/a/WMwS4xvKGxMBMzybxwkbvVv/?lang=en We use *pmsampsize* to calculate the minimum sample size required to develop a multivariable prediction model for a binary outcome using 24 candidate predictor parameters. Based on the published evidence, the outcome prevalence is anticipated to be 0.174 (17.4%) and a lower bound (taken from the adjusted Cox-Snell R-squared of an existing prediction model) for the new model's R-squared value is 0.288. ```{r warning=FALSE,message=FALSE} # pmsampsize(type = "b", rsquared = 0.288, parameters = 24, prevalence = 0.174) ``` EPP here indicates the "Event per Predictor parameter" and as you can observe this number vary accordingly to the criterion used (from 1.6 to 4.8). ## Binary outcome (Logistic model) using C-statistic Here we are interested in developing a diagnostic model for the presence of DVT (deep venous thrombosis). DVT is a blood clot that forms in a leg vein and may migrate to the lungs leading to blockage of arterial flow, preventing oxygenation of the blood and potentially causing death. Multivariable diagnostic prediction models have been proposed during the past decades to safely exclude DVT without having to refer for further burdening (reference standard) testing. In the reference http://dx.doi.org/10.1016/j.jclinepi.2014.06.018 was reported a C statistic of 0.79 for a model with 8 parameters and an outcome prevalence estimated at 22%. ```{r warning=FALSE,message=FALSE} pmsampsize(type = "b", cstatistic=0.79, parameters = 9, prevalence = 0.22) ``` ## References Van Calster, B., Nieboer, D., Vergouwe, Y., De Cock, B., Pencina, M.J., Steyerberg, E.W. (2016). A calibration hierarchy for risk models was defined: from utopia to empirical data. Journal of Clinical Epidemiology, 74, pp. 167-176 Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ (Clinical research ed). 2020 Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS. Minimum sample size required for developing a multivariable prediction model: Part I continuous outcomes. Statistics in Medicine. doi: 10.1002/sim.7993 Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS. Minimum sample size required for developing a multivariable prediction model: Part II binary and time-to-event outcomes. Statistics in Medicine. doi: 10.1002/sim.7992 van Smeden M, Moons KG, de Groot JA, et al. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Methods Med Res. 2019;28(8):2455-74 Riley, RD, Van Calster, B, Collins, GS. A note on estimating the Cox-Snell R2 from a reported C statistic (AUROC) to inform sample size calculations for developing a prediction model with a binary outcome. Statistics in Medicine. 2020