First of all, install and load the required libraries :
library(rms)
library(Hmisc)
library(epiDisplay)
library(tidyverse)
library(dplyr)
In epidemiological data, most of the outcomes are often binary or dichotomous. For example, in the investigation of the cause of a disease, the status of the outcome, the disease, is diseased vs non-diseased. For a mortality study, the outcome is usually died or survived. For a continuous variable such as weight or height, we can think that the single representative number for the population or sample is the mean or median. For dichotomous data, the representative number is the proportion or percentage of one type of the outcome. For example, as you should remember from Block 1, prevalence is the proportion of the population with the disease of interest. Case-fatality is the proportion of deaths among the people with the disease. The other related term is probability. Proportion is a simple straightforward term. Probability denotes the likeliness, which is more theoretical (assuming a random mechanism that generate the observed data). In the case of a dichotomous variable, the proportion is used as the estimated probability, for example assuming a binomial random variable behaviour. For computation, the outcome is often represented with 1 and 0 otherwise. The prevalence is then the mean of diseased values among the study sample. Assuming the binomial mechanism, the standard regression model is the logistic regression, using the log(odds) as a dependent variable in the model. If P is the probability of having a disease, 1-P is probability of not having the disease. The odds is thus P/(1-P). As you remind, the relationship between probability and odds, mainly log(odds) can be plotted as follows:
p <- seq(from=0, to=1, by=.01)
odds <- p/(1-p)
plot(log(odds), p, type="l", col="blue", ylab="Probability",
main="Relationship between log odds and probability", las=1)
abline(h=.5)
abline(v=0)
Being on a linear and well balanced scale, the logit is a more appropriate scale for regression model for a binary outcome than the probability itself.
Similarly to the conversion between probabilities and odds, we can devise functions that convert both ways between probability and log-odds:
logit <- function(p) log(p/(1-p))
tigol <- function(t) 1/(1+exp(-t))
(tigol is just the letters of logit reversed).
Modelling logit(Y|X) ~ \(\beta\)X is the general form of logistic regression. It means that we are modelling the logit of Y given X (or conditioning on X), where X denotes one or more independent variables.
Suppose there are independent or exposure variables: X1 and X2. \(\beta\)X would be \(\beta_{0}\)+ \(\beta_{1}\)X1 + \(\beta_{2}\)X2, where \(\beta_{0}\) is the intercept.
The X can be age, sex, and other prognostic variables. In the explanatory/causal setting, among these X variables, one specific X is usually the focus of the study. Others are potential confounders and covariates.
Using logistic regression it turns out that the probability of observing Y given X Pr(Y|X) is equal to exp\(\beta\)X/(1 + exp\(\beta\)X). Hence, logistic regression is often used to compute the probability of an outcome under a given set of exposures. For example, prediction of probability of getting a disease under a given set of age, sex, and behaviour groups, etc.
The dataset Decay is a simple dataset containing two variables: ‘decay’, which is binary and ‘strep’, which is a continuous variable.
data(Decay)
summary(Decay)
## decay strep
## Min. :0.0000 Min. : 0.50
## 1st Qu.:0.0000 1st Qu.: 43.00
## Median :1.0000 Median :105.00
## Mean :0.6307 Mean : 95.25
## 3rd Qu.:1.0000 3rd Qu.:150.00
## Max. :1.0000 Max. :152.50
The outcome variable is ‘decay’, which indicates whether a person has at least one decayed tooth (1) or not (0). The exposure variable is ‘strep’, the number of colony forming units (CFU) of streptococci, a group of bacteria suspected to cause tooth decay. The prevalence of having decayed teeth is equal to the mean of the ‘decay’ variable, i.e. 0.63. To look at the ‘strep’ variable type:
summary(Decay$strep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.50 43.00 105.00 95.25 150.00 152.50
hist(Decay$strep)
The plot shows that the vast majority have the value at about 150. Since the natural distribution of bacteria is logarithmic, a transformed variable is created and used as the independent variable. Let’s now estimate an univariable logistic regression model:
Decay$log10.strep <- log10(Decay$strep)
glm0 <- glm(decay~log10.strep, family=binomial, data=Decay)
Ask for the summary of the estimated model:
summary(glm0)
##
## Call:
## glm(formula = decay ~ log10.strep, family = binomial, data = Decay)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6674 -1.1951 0.7568 0.8705 1.6369
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5539 0.5184 -4.927 8.36e-07 ***
## log10.strep 1.6808 0.2764 6.082 1.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 574.27 on 435 degrees of freedom
## Residual deviance: 531.83 on 434 degrees of freedom
## AIC: 535.83
##
## Number of Fisher Scoring iterations: 4
Both the coefficients of the intercept and ‘log10.strep’ are statistically significant. Pr(>|z|) for ‘log10.strep’ is the P value from Wald’s test. This tests whether the coefficient, 1.681, is significantly different from 0. In this case it is. The estimated intercept is -2.554. This means that when log10.strep is 0 (or strep equals 1 CFU), the logit of having at least a decayed tooth is -2.55. We can then calculate the related baseline odds and probability.
exp(-2.554) -> baseline.odds
baseline.odds
## [1] 0.07776996
baseline.odds/(1+baseline.odds) -> baseline.prob
baseline.prob
## [1] 0.07215822
There is an odds of 0.077 or a probability of 7.2% of having at least one decayed tooth if the number of CFU of the mutan strep is at 1 CFU.
The coefficient of log10.strep is 1.681. For every unit increment of log10(strep), or an increment of 10 CFU, the logit will increase by 1.681.
This increment of logit is constant but not the increment of probability because the latter is not on a linear scale.
The probability at each point of CFU is computed by replacing both coefficients obtained from the model. For example, at 100 CFU, the probability is:
prob.100 <- coef(glm0)[1] + log10(100)*coef(glm0)[2]
prob.100
## (Intercept)
## 0.8078015
To see the relationship for the whole dataset:
plot(Decay$log10.strep, fitted(glm0), ylim=c(0,0.80))
A logistic nature of the curve is partly demonstrated. To make it clearer, the ranges of X and Y axes are both expanded to allow a more extensive curve fitting.
plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1))
Another vector of the same name ‘log10.strep’ is created in the form of a data frame for plotting a fitted line on the same graph.
newdata <- data.frame(log10.strep=seq(from=-2, to=4, by=.01))
predicted.line <- predict.glm(glm0,newdata,type="response")
The values for predicted line on the above command must be on the same scale as the ‘response’ variable. Since the response is either 0 or 1, the predicted line would be in between, ie. the predicted probability for each value of log10(strep).
plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1), xlab=" ", ylab=" ", xaxt="n", las=1)
lines(newdata$log10.strep, predicted.line, col="blue")
axis(side=1, at=-2:4, labels=as.character(10^(-2:4)))
title(main="Relationship between mutan streptococci \n and probability of tooth decay", xlab="CFU",
ylab="Probability of having decayed teeth")
The above example of caries data has a continuous variable ‘log10.strep’ as the key independent variable.
In most epidemiological datasets, the independent variables are often categorical.
We will examine an example coming from an outbreak investigation.
On 25 August 1990, the local health officer in Supan Buri Province of Thailand reported the occurrence of an outbreak of acute gastrointestinal illness on a national handicapped sports day. Epidemiologists went to investigate. The dataset is called Outbreak. Most variable names are self explanatory. Variables are coded as 0 = no, 1 = yes and 9 = missing/unknown for three food items consumed by participants: ‘beefcurry’ (beef curry), ‘saltegg’ (salted eggs) and ‘water’.
Also on the menu were eclairs, a finger-shaped iced cake of choux pastry filled with cream. This variable records the number of pieces eaten by each participant. Missing values were coded as follows: 88 = “ate but do not remember how much”, while code 90 represents totally missing information.
Some participants experienced gastrointestinal symptoms, such as: nausea, vomiting, abdominal pain and diarrhea. The ages of each participant are recorded in years with 99 representing a missing value.
The variables ‘exptime’ and ‘onset’ are the exposure and onset times, which are in character format.
Let’s look at the data.
data(Outbreak)
summary(Outbreak)
## id sex age exptime
## Min. : 1.0 Min. :0.000 Min. : 1.00 Length:1094
## 1st Qu.: 274.2 1st Qu.:0.000 1st Qu.:14.00 Class :AsIs
## Median : 547.5 Median :1.000 Median :18.00 Mode :character
## Mean : 547.5 Mean :0.659 Mean :23.69
## 3rd Qu.: 820.8 3rd Qu.:1.000 3rd Qu.:24.00
## Max. :1094.0 Max. :1.000 Max. :99.00
## beefcurry saltegg eclair water
## Min. :0.0000 Min. :0.000 Min. : 0.00 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.: 0.00 1st Qu.:1.000
## Median :1.0000 Median :1.000 Median : 2.00 Median :1.000
## Mean :0.9534 Mean :0.957 Mean :11.48 Mean :1.021
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.: 2.00 3rd Qu.:1.000
## Max. :9.0000 Max. :9.000 Max. :90.00 Max. :9.000
## onset nausea vomiting abdpain
## Length:1094 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :AsIs 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3985 Mean :0.3793 Mean :0.3501
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## diarrhea
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2148
## 3rd Qu.:0.0000
## Max. :1.0000
We will first of all define the cases. It was agreed among the investigators that a case should be defined as a person who had any of the four symptoms: ‘nausea’, ‘vomiting’, ‘abdpain’ or ‘diarrhea’. A case can then by computed as follows:
Outbreak$case <- (Outbreak$nausea==1)|(Outbreak$vomiting==1)|(Outbreak$abdpain==1)|(Outbreak$diarrhea==1)
The variable ‘case’ is now incorporated into the data.
summary(Outbreak)
## id sex age exptime
## Min. : 1.0 Min. :0.000 Min. : 1.00 Length:1094
## 1st Qu.: 274.2 1st Qu.:0.000 1st Qu.:14.00 Class :AsIs
## Median : 547.5 Median :1.000 Median :18.00 Mode :character
## Mean : 547.5 Mean :0.659 Mean :23.69
## 3rd Qu.: 820.8 3rd Qu.:1.000 3rd Qu.:24.00
## Max. :1094.0 Max. :1.000 Max. :99.00
## beefcurry saltegg eclair water
## Min. :0.0000 Min. :0.000 Min. : 0.00 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.: 0.00 1st Qu.:1.000
## Median :1.0000 Median :1.000 Median : 2.00 Median :1.000
## Mean :0.9534 Mean :0.957 Mean :11.48 Mean :1.021
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.: 2.00 3rd Qu.:1.000
## Max. :9.0000 Max. :9.000 Max. :90.00 Max. :9.000
## onset nausea vomiting abdpain
## Length:1094 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :AsIs 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3985 Mean :0.3793 Mean :0.3501
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## diarrhea case
## Min. :0.0000 Mode :logical
## 1st Qu.:0.0000 FALSE:625
## Median :0.0000 TRUE :469
## Mean :0.2148
## 3rd Qu.:0.0000
## Max. :1.0000
We now recode missing values.We do not impute them, so subjects with missing values will be removed from the analyses.
Outbreak = Outbreak %>%
mutate(
age=na_if(age, 99),
beefcurry=na_if(beefcurry, 9),
saltegg=na_if(saltegg, 9),
water=na_if(water, 9),
eclair=na_if(eclair, 90))
summary(Outbreak)
## id sex age exptime
## Min. : 1.0 Min. :0.000 Min. : 1.00 Length:1094
## 1st Qu.: 274.2 1st Qu.:0.000 1st Qu.:14.00 Class :AsIs
## Median : 547.5 Median :1.000 Median :17.00 Mode :character
## Mean : 547.5 Mean :0.659 Mean :19.32
## 3rd Qu.: 820.8 3rd Qu.:1.000 3rd Qu.:22.00
## Max. :1094.0 Max. :1.000 Max. :58.00
## NA's :60
## beefcurry saltegg eclair water
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.: 0.000 1st Qu.:1.000
## Median :1.0000 Median :1.0000 Median : 2.000 Median :1.000
## Mean :0.9164 Mean :0.9201 Mean : 2.072 Mean :0.977
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 2.000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.0000 Max. :80.000 Max. :1.000
## NA's :5 NA's :5 NA's :117 NA's :6
## onset nausea vomiting abdpain
## Length:1094 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :AsIs 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3985 Mean :0.3793 Mean :0.3501
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## diarrhea case
## Min. :0.0000 Mode :logical
## 1st Qu.:0.0000 FALSE:625
## Median :0.0000 TRUE :469
## Mean :0.2148
## 3rd Qu.:0.0000
## Max. :1.0000
##
The three variables can also be changed to factors:
Outbreak= Outbreak %>%
mutate(
beefcurry=as.factor(beefcurry),
saltegg=as.factor(saltegg),
water=as.factor(water))
summary(Outbreak)
## id sex age exptime beefcurry
## Min. : 1.0 Min. :0.000 Min. : 1.00 Length:1094 0 : 91
## 1st Qu.: 274.2 1st Qu.:0.000 1st Qu.:14.00 Class :AsIs 1 :998
## Median : 547.5 Median :1.000 Median :17.00 Mode :character NA's: 5
## Mean : 547.5 Mean :0.659 Mean :19.32
## 3rd Qu.: 820.8 3rd Qu.:1.000 3rd Qu.:22.00
## Max. :1094.0 Max. :1.000 Max. :58.00
## NA's :60
## saltegg eclair water onset nausea
## 0 : 87 Min. : 0.000 0 : 25 Length:1094 Min. :0.0000
## 1 :1002 1st Qu.: 0.000 1 :1063 Class :AsIs 1st Qu.:0.0000
## NA's: 5 Median : 2.000 NA's: 6 Mode :character Median :0.0000
## Mean : 2.072 Mean :0.3985
## 3rd Qu.: 2.000 3rd Qu.:1.0000
## Max. :80.000 Max. :1.0000
## NA's :117
## vomiting abdpain diarrhea case
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Mode :logical
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 FALSE:625
## Median :0.0000 Median :0.0000 Median :0.0000 TRUE :469
## Mean :0.3793 Mean :0.3501 Mean :0.2148
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
All variables now look fine except ‘eclair’ which still contains the value 80 representing “ate but not remember how much”. We will analyse its relationship with ‘case’ by considering it as an ordered categorical variable.
tabpct(Outbreak$eclair, Outbreak$case)
##
## Original table
## Outbreak$case
## Outbreak$eclair FALSE TRUE Total
## 0 279 15 294
## 0.5 5 10 15
## 1 49 41 90
## 2 203 243 446
## 3 9 22 31
## 4 14 33 47
## 5 2 5 7
## 6 8 20 28
## 8 1 5 6
## 10 1 3 4
## 12 1 1 2
## 19 1 0 1
## 20 1 0 1
## 80 5 0 5
## Total 579 398 977
##
## Row percent
## Outbreak$case
## Outbreak$eclair FALSE TRUE Total
## 0 279 15 294
## (94.9) (5.1) (100)
## 0.5 5 10 15
## (33.3) (66.7) (100)
## 1 49 41 90
## (54.4) (45.6) (100)
## 2 203 243 446
## (45.5) (54.5) (100)
## 3 9 22 31
## (29) (71) (100)
## 4 14 33 47
## (29.8) (70.2) (100)
## 5 2 5 7
## (28.6) (71.4) (100)
## 6 8 20 28
## (28.6) (71.4) (100)
## 8 1 5 6
## (16.7) (83.3) (100)
## 10 1 3 4
## (25) (75) (100)
## 12 1 1 2
## (50) (50) (100)
## 19 1 0 1
## (100) (0) (100)
## 20 1 0 1
## (100) (0) (100)
## 80 5 0 5
## (100) (0) (100)
##
## Column percent
## Outbreak$case
## Outbreak$eclair FALSE % TRUE %
## 0 279 (48.2) 15 (3.8)
## 0.5 5 (0.9) 10 (2.5)
## 1 49 (8.5) 41 (10.3)
## 2 203 (35.1) 243 (61.1)
## 3 9 (1.6) 22 (5.5)
## 4 14 (2.4) 33 (8.3)
## 5 2 (0.3) 5 (1.3)
## 6 8 (1.4) 20 (5.0)
## 8 1 (0.2) 5 (1.3)
## 10 1 (0.2) 3 (0.8)
## 12 1 (0.2) 1 (0.3)
## 19 1 (0.2) 0 (0.0)
## 20 1 (0.2) 0 (0.0)
## 80 5 (0.9) 0 (0.0)
## Total 579 (100) 398 (100)
The width of the columns of the mosaic graph denotes the relative frequency of that category. The highest frequency is 2 pieces followed by 0 and 1 piece. The other numbers have relatively low frequencies; particularly the 5 records where ‘eclair’ was coded as 80.
There is a tendency of increasing red area or attack rate from left to right indicating that the risk was increased when more pieces of eclair were consumed.
We will use the distribution of these proportions to guide our grouping of eclair consumption.
The first column of zero consumption has a very low attack rate, therefore it should be a separate category. Only a few took half a piece and this could be combined with those who took only one piece.
Persons consuming 2 pieces should be kept as one category as their frequency is very high. Others who ate more than two pieces should be grouped into another category. Finally, those coded as ‘80’ will be dropped due to the unknown amount of consumption as well as its low frequency.
Remind that collapsing categories with low numbers help in increasing the power of the statistical estimate of the regression coefficients.
Outbreak$eclairgr <- cut(Outbreak$eclair, breaks = c(0, 0.4, 1, 2, 79),
include.lowest = TRUE, labels=c("0","1","2",">2"))
The argument ‘include.lowest=TRUE’ indicates that 0 eclair must be included in the lowest category.
tabpct(Outbreak$eclairgr, Outbreak$case)
##
## Original table
## Outbreak$case
## Outbreak$eclairgr FALSE TRUE Total
## 0 279 15 294
## 1 54 51 105
## 2 203 243 446
## >2 38 89 127
## Total 574 398 972
##
## Row percent
## Outbreak$case
## Outbreak$eclairgr FALSE TRUE Total
## 0 279 15 294
## (94.9) (5.1) (100)
## 1 54 51 105
## (51.4) (48.6) (100)
## 2 203 243 446
## (45.5) (54.5) (100)
## >2 38 89 127
## (29.9) (70.1) (100)
##
## Column percent
## Outbreak$case
## Outbreak$eclairgr FALSE % TRUE %
## 0 279 (48.6) 15 (3.8)
## 1 54 (9.4) 51 (12.8)
## 2 203 (35.4) 243 (61.1)
## >2 38 (6.6) 89 (22.4)
## Total 574 (100) 398 (100)
The attack rate or percentage of diseased in each category of exposure, as shown in the bracket of the column TRUE, increases from 5.1% among those who did not eat any eclairs to 70.1% among those heavy eaters of eclair.
The graph output is similar to the preceding one except that the groups now are more concise. We now have a continuous variable of ‘eclair’ and a categorical variable of ‘eclairgr’. The next step is to create a binary exposure for eclair.
Outbreak$eclair.eat <- Outbreak$eclair > 0
This binary exposure variable is now similar to the others, i.e. ‘beefcurry’, ‘saltegg’ and ‘water’.
We now model ‘case’ as the binary outcome variable and take ‘eclair.eat’ as the only explanatory variable:
glm0 <- glm(case ~ eclair.eat, family=binomial, data=Outbreak)
summary(glm0)
##
## Call:
## glm(formula = case ~ eclair.eat, family = binomial, data = Outbreak)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2827 -1.2827 -0.3236 1.0756 2.4395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.923 0.265 -11.03 <2e-16 ***
## eclair.eatTRUE 3.167 0.276 11.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1320.7 on 976 degrees of freedom
## Residual deviance: 1055.2 on 975 degrees of freedom
## (117 osservazioni eliminate a causa di valori mancanti)
## AIC: 1059.2
##
## Number of Fisher Scoring iterations: 5
logistic.display(glm0)
##
## Logistic regression predicting case
##
## OR(95%CI) P(Wald's test) P(LR-test)
## eclair.eat 23.75 (13.82,40.79) < 0.001 < 0.001
##
## Log-likelihood = -527.6075
## No. of observations = 977
## AIC value = 1059.2149
The odds ratio from the logistic regression is derived from exponentiation of the estimate, i.e. 23.75 is obtained from:
exp(coef(summary(glm0))[2,1])
## [1] 23.746
The 95% confidence interval of the odds ratio is obtained from:
exp(coef(summary(glm0))[2,1] + c(-1,1) * 1.96 *coef(summary(glm0))[2,2])
## [1] 13.82402 40.78933
These values are close to simple calculation of the 2-by-2 table.
The output from logistic.display also contains the ‘LR-test’ (likelihood ratio test) result, which checks whether the likelihood of the given model, ‘glm0’, would be significantly different from the model without ‘eclair.eat’, which in this case would be the “null” model. For an independent variable with two levels, the LR-test does not add further important information because Wald’s test has already tested the hypothesis. When the independent variable has more than two levels, the LR-test is more important than Wald’s test as the following example demonstrates.
glm1 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm1)
##
## Logistic regression predicting case
##
## OR(95%CI) P(Wald's test) P(LR-test)
## eclairgr: ref.=0 < 0.001
## 1 17.57 (9.21,33.49) < 0.001
## 2 22.27 (12.82,38.66) < 0.001
## >2 43.56 (22.89,82.91) < 0.001
##
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471
Interpreting Wald’s test alone, one would conclude that all levels of eclair eaten would be significant. However, this depends on the reference level. By default, R assumes that the first level of an independent factor is the reference level. If we relevel the reference level to be 2 pieces of eclair, Wald’s test gives a different impression.
Outbreak$eclairgr <- relevel(Outbreak$eclairgr, ref="2")
glm2 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm2)
##
## Logistic regression predicting case
##
## OR(95%CI) P(Wald's test) P(LR-test)
## eclairgr: ref.=2 < 0.001
## 0 0.04 (0.03,0.08) < 0.001
## 1 0.79 (0.52,1.21) 0.275
## >2 1.96 (1.28,2.99) 0.002
##
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471
The results show that eating only one piece of eclair does not reduce the risk significantly compared to eating two pieces. While results from Wald’s test depend on the reference level of the explanatory variable, the LR-test is concerned only with the contribution of the variable as a whole and ignores the reference level.
Imagine that we want to have an idea of the presence of possible confounders with respect to the effect of eclair and we have not the possibility to talk with an expert, so that we can investigate relationship observing the data. In this case a possibility is to examine univariable effects and then building multivariable models paying attention to the possible change in the effects estimation for the variable of interest when introducing the other. This procedure is not considered as valid as defining a DAG a priori with the help of an expert, from a causal inference point of view,but it give some indications at least.
Therefore, try ‘saltegg’ now as the only explanatory variable:
glm3 <- glm(case ~ saltegg, family = binomial, data=Outbreak)
logistic.display(glm3)
##
## Logistic regression predicting case
##
## OR(95%CI) P(Wald's test) P(LR-test)
## saltegg: 1 vs 0 2.54 (1.53,4.22) < 0.001 < 0.001
##
## Log-likelihood = -736.998
## No. of observations = 1089
## AIC value = 1477.996
The odds ratio for ‘saltegg’ is statistically significant. The number of valid records is also higher than the model containing ‘eclairgr’.
Note: One should always be very careful when analysing data that contain missing values. Advanced methods to handle missing values are beyond the scope of this course and for reasons of simplicity are ignored here. Data scientists should be advised to deal with missing values properly prior to conducting their analysis.
To check whether the odds ratio of ‘saltegg’ is confounded by ‘eclairgr’, the two explanatory variables are put together in the next model:
glm4 <- glm(case ~ eclairgr + saltegg, family=binomial, data=Outbreak)
logistic.display(glm4, crude.p.value=TRUE)
##
## Logistic regression predicting case
##
## crude OR(95%CI) crude P value adj. OR(95%CI)
## eclairgr: ref.=2
## 0 0.04 (0.03,0.08) < 0.001 0.04 (0.03,0.08)
## 1 0.79 (0.52,1.21) 0.275 0.79 (0.51,1.21)
## >2 1.96 (1.28,2.99) 0.002 1.96 (1.28,2.99)
##
## saltegg: 1 vs 0 2.37 (1.41,3.99) 0.001 1.01 (0.53,1.93)
##
## P(Wald's test) P(LR-test)
## eclairgr: ref.=2 < 0.001
## 0 < 0.001
## 1 0.279
## >2 0.002
##
## saltegg: 1 vs 0 0.975 0.975
##
## Log-likelihood = -516.8231
## No. of observations = 972
## AIC value = 1043.6461
The odds ratios of the explanatory variables in glm4 are adjusted for each other. The crude odds ratios are exactly the same as from the previous models with only single variable.
The adjusted odds ratios of ‘eclairgr’ do not change suggesting that it is not confounded by ‘saltegg’, whereas the odds ratio of ‘saltegg’ is clearly changed towards unity, and now has a very large P value.
The difference between the adjusted odds ratio and the crude odds ratio is an indication that ‘saltegg’ is confounded by ‘eclairgr’, which could be considered as an independent risk factor.
Now that we have a model containing two explanatory variables, we can compare models ‘glm4’ and ‘glm2’ using the lrtest command.
lrtest(glm4, glm2)
## Likelihood ratio test for MLE method
## Chi-squared 1 d.f. = 0.0009809013 , P value = 0.9750149
The P value of 0.975 is the same as that from ‘P(LR-test)’ of ‘saltegg’ obtained from the preceding command. The test determines whether removal of ‘saltegg’ in a model would make a significant difference than if it were kept.
When there is more than one explanatory variable, ‘P(LR-test)’ from logistic.display is actually obtained from the lrtest command, which compares the current model against one in which the particular variable is removed, while keeping all remaining variables.
Let’s further add covariates in the model:
glm5 <- glm(case~eclairgr+saltegg+sex, family=binomial, data=Outbreak)
logistic.display(glm5)
##
## Logistic regression predicting case
##
## crude OR(95%CI) adj. OR(95%CI) P(Wald's test) P(LR-test)
## eclairgr: ref.=2 < 0.001
## 0 0.04 (0.03,0.08) 0.04 (0.02,0.07) < 0.001
## 1 0.79 (0.52,1.21) 0.75 (0.49,1.16) 0.2
## >2 1.96 (1.28,2.99) 1.82 (1.19,2.8) 0.006
##
## saltegg: 1 vs 0 2.37 (1.41,3.99) 0.92 (0.48,1.76) 0.807 0.808
##
## sex: 1 vs 0 1.58 (1.19,2.08) 1.85 (1.35,2.53) < 0.001 < 0.001
##
## Log-likelihood = -509.5181
## No. of observations = 972
## AIC value = 1031.0361
The third explanatory variable ‘sex’ is another independent risk factor. Since females are the reference level, males have an increased risk compared to females. This variable is not considered a confounder to either of the preceding variables because it has not substantially changed the odds ratios of any of them (from ‘glm4’). The reason for not being able to confound is its lack of association with either of the preceding explanatory variables. In other words, males and females were not different in terms of eating eclairs and salted eggs.