First of all, install and load the required libraries :

library(rms)
library(Hmisc)
library(epiDisplay)
library(tidyverse)
library(dplyr)

Introduction

In epidemiological data, most of the outcomes are often binary or dichotomous. For example, in the investigation of the cause of a disease, the status of the outcome, the disease, is diseased vs non-diseased. For a mortality study, the outcome is usually died or survived. For a continuous variable such as weight or height, we can think that the single representative number for the population or sample is the mean or median. For dichotomous data, the representative number is the proportion or percentage of one type of the outcome. For example, as you should remember from Block 1, prevalence is the proportion of the population with the disease of interest. Case-fatality is the proportion of deaths among the people with the disease. The other related term is probability. Proportion is a simple straightforward term. Probability denotes the likeliness, which is more theoretical (assuming a random mechanism that generate the observed data). In the case of a dichotomous variable, the proportion is used as the estimated probability, for example assuming a binomial random variable behaviour. For computation, the outcome is often represented with 1 and 0 otherwise. The prevalence is then the mean of diseased values among the study sample. Assuming the binomial mechanism, the standard regression model is the logistic regression, using the log(odds) as a dependent variable in the model. If P is the probability of having a disease, 1-P is probability of not having the disease. The odds is thus P/(1-P). As you remind, the relationship between probability and odds, mainly log(odds) can be plotted as follows:

p <- seq(from=0, to=1, by=.01)
odds <- p/(1-p)
plot(log(odds), p, type="l", col="blue", ylab="Probability",
main="Relationship between log odds and probability", las=1)
abline(h=.5)
abline(v=0)

Being on a linear and well balanced scale, the logit is a more appropriate scale for regression model for a binary outcome than the probability itself.

Similarly to the conversion between probabilities and odds, we can devise functions that convert both ways between probability and log-odds:

logit <- function(p) log(p/(1-p))
tigol <- function(t) 1/(1+exp(-t))

(tigol is just the letters of logit reversed).

Modelling logit(Y|X) ~ \(\beta\)X is the general form of logistic regression. It means that we are modelling the logit of Y given X (or conditioning on X), where X denotes one or more independent variables.

Suppose there are independent or exposure variables: X1 and X2. \(\beta\)X would be \(\beta_{0}\)+ \(\beta_{1}\)X1 + \(\beta_{2}\)X2, where \(\beta_{0}\) is the intercept.

The X can be age, sex, and other prognostic variables. In the explanatory/causal setting, among these X variables, one specific X is usually the focus of the study. Others are potential confounders and covariates.

Using logistic regression it turns out that the probability of observing Y given X Pr(Y|X) is equal to exp\(\beta\)X/(1 + exp\(\beta\)X). Hence, logistic regression is often used to compute the probability of an outcome under a given set of exposures. For example, prediction of probability of getting a disease under a given set of age, sex, and behaviour groups, etc.

Example: Tooth decay

The dataset Decay is a simple dataset containing two variables: ‘decay’, which is binary and ‘strep’, which is a continuous variable.

data(Decay)
summary(Decay)

##      decay            strep       
##  Min.   :0.0000   Min.   :  0.50  
##  1st Qu.:0.0000   1st Qu.: 43.00  
##  Median :1.0000   Median :105.00  
##  Mean   :0.6307   Mean   : 95.25  
##  3rd Qu.:1.0000   3rd Qu.:150.00  
##  Max.   :1.0000   Max.   :152.50

The outcome variable is ‘decay’, which indicates whether a person has at least one decayed tooth (1) or not (0). The exposure variable is ‘strep’, the number of colony forming units (CFU) of streptococci, a group of bacteria suspected to cause tooth decay. The prevalence of having decayed teeth is equal to the mean of the ‘decay’ variable, i.e. 0.63. To look at the ‘strep’ variable type:

summary(Decay$strep)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50   43.00  105.00   95.25  150.00  152.50

hist(Decay$strep)

The plot shows that the vast majority have the value at about 150. Since the natural distribution of bacteria is logarithmic, a transformed variable is created and used as the independent variable. Let’s now estimate an univariable logistic regression model:

Decay$log10.strep <- log10(Decay$strep)
glm0 <- glm(decay~log10.strep, family=binomial, data=Decay)

Ask for the summary of the estimated model:

summary(glm0)

## 
## Call:
## glm(formula = decay ~ log10.strep, family = binomial, data = Decay)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6674  -1.1951   0.7568   0.8705   1.6369  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.5539     0.5184  -4.927 8.36e-07 ***
## log10.strep   1.6808     0.2764   6.082 1.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 574.27  on 435  degrees of freedom
## Residual deviance: 531.83  on 434  degrees of freedom
## AIC: 535.83
## 
## Number of Fisher Scoring iterations: 4

Both the coefficients of the intercept and ‘log10.strep’ are statistically significant. Pr(>|z|) for ‘log10.strep’ is the P value from Wald’s test. This tests whether the coefficient, 1.681, is significantly different from 0. In this case it is. The estimated intercept is -2.554. This means that when log10.strep is 0 (or strep equals 1 CFU), the logit of having at least a decayed tooth is -2.55. We can then calculate the related baseline odds and probability.

exp(-2.554) -> baseline.odds
baseline.odds

## [1] 0.07776996

baseline.odds/(1+baseline.odds) -> baseline.prob
baseline.prob

## [1] 0.07215822

There is an odds of 0.077 or a probability of 7.2% of having at least one decayed tooth if the number of CFU of the mutan strep is at 1 CFU.

The coefficient of log10.strep is 1.681. For every unit increment of log10(strep), or an increment of 10 CFU, the logit will increase by 1.681.

This increment of logit is constant but not the increment of probability because the latter is not on a linear scale.

The probability at each point of CFU is computed by replacing both coefficients obtained from the model. For example, at 100 CFU, the probability is:

prob.100 <- coef(glm0)[1] + log10(100)*coef(glm0)[2]
prob.100

## (Intercept) 
##   0.8078015

To see the relationship for the whole dataset:

plot(Decay$log10.strep, fitted(glm0), ylim=c(0,0.80))

A logistic nature of the curve is partly demonstrated. To make it clearer, the ranges of X and Y axes are both expanded to allow a more extensive curve fitting.

plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1))

Another vector of the same name ‘log10.strep’ is created in the form of a data frame for plotting a fitted line on the same graph.

newdata        <- data.frame(log10.strep=seq(from=-2, to=4, by=.01))
predicted.line <- predict.glm(glm0,newdata,type="response")

The values for predicted line on the above command must be on the same scale as the ‘response’ variable. Since the response is either 0 or 1, the predicted line would be in between, ie. the predicted probability for each value of log10(strep).

plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1), xlab=" ", ylab=" ", xaxt="n", las=1)
lines(newdata$log10.strep, predicted.line, col="blue")
axis(side=1, at=-2:4, labels=as.character(10^(-2:4)))
title(main="Relationship between mutan streptococci \n and probability of tooth decay", xlab="CFU",
ylab="Probability of having decayed teeth")

Another example of logistic regression

The above example of caries data has a continuous variable ‘log10.strep’ as the key independent variable.

In most epidemiological datasets, the independent variables are often categorical.

We will examine an example coming from an outbreak investigation.

On 25 August 1990, the local health officer in Supan Buri Province of Thailand reported the occurrence of an outbreak of acute gastrointestinal illness on a national handicapped sports day. Epidemiologists went to investigate. The dataset is called Outbreak. Most variable names are self explanatory. Variables are coded as 0 = no, 1 = yes and 9 = missing/unknown for three food items consumed by participants: ‘beefcurry’ (beef curry), ‘saltegg’ (salted eggs) and ‘water’.

Also on the menu were eclairs, a finger-shaped iced cake of choux pastry filled with cream. This variable records the number of pieces eaten by each participant. Missing values were coded as follows: 88 = “ate but do not remember how much”, while code 90 represents totally missing information.

Some participants experienced gastrointestinal symptoms, such as: nausea, vomiting, abdominal pain and diarrhea. The ages of each participant are recorded in years with 99 representing a missing value.

The variables ‘exptime’ and ‘onset’ are the exposure and onset times, which are in character format.

Let’s look at the data.

data(Outbreak)
summary(Outbreak)

##        id              sex             age          exptime         
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094       
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs       
##  Median : 547.5   Median :1.000   Median :18.00   Mode  :character  
##  Mean   : 547.5   Mean   :0.659   Mean   :23.69                     
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:24.00                     
##  Max.   :1094.0   Max.   :1.000   Max.   :99.00                     
##    beefcurry         saltegg          eclair          water      
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.: 0.00   1st Qu.:1.000  
##  Median :1.0000   Median :1.000   Median : 2.00   Median :1.000  
##  Mean   :0.9534   Mean   :0.957   Mean   :11.48   Mean   :1.021  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.: 2.00   3rd Qu.:1.000  
##  Max.   :9.0000   Max.   :9.000   Max.   :90.00   Max.   :9.000  
##     onset               nausea          vomiting         abdpain      
##  Length:1094        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :AsIs        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.0000  
##                     Mean   :0.3985   Mean   :0.3793   Mean   :0.3501  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     diarrhea     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2148  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

We will first of all define the cases. It was agreed among the investigators that a case should be defined as a person who had any of the four symptoms: ‘nausea’, ‘vomiting’, ‘abdpain’ or ‘diarrhea’. A case can then by computed as follows:

Outbreak$case <- (Outbreak$nausea==1)|(Outbreak$vomiting==1)|(Outbreak$abdpain==1)|(Outbreak$diarrhea==1)

The variable ‘case’ is now incorporated into the data.

summary(Outbreak)

##        id              sex             age          exptime         
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094       
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs       
##  Median : 547.5   Median :1.000   Median :18.00   Mode  :character  
##  Mean   : 547.5   Mean   :0.659   Mean   :23.69                     
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:24.00                     
##  Max.   :1094.0   Max.   :1.000   Max.   :99.00                     
##    beefcurry         saltegg          eclair          water      
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.: 0.00   1st Qu.:1.000  
##  Median :1.0000   Median :1.000   Median : 2.00   Median :1.000  
##  Mean   :0.9534   Mean   :0.957   Mean   :11.48   Mean   :1.021  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.: 2.00   3rd Qu.:1.000  
##  Max.   :9.0000   Max.   :9.000   Max.   :90.00   Max.   :9.000  
##     onset               nausea          vomiting         abdpain      
##  Length:1094        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :AsIs        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.0000  
##                     Mean   :0.3985   Mean   :0.3793   Mean   :0.3501  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     diarrhea         case        
##  Min.   :0.0000   Mode :logical  
##  1st Qu.:0.0000   FALSE:625      
##  Median :0.0000   TRUE :469      
##  Mean   :0.2148                  
##  3rd Qu.:0.0000                  
##  Max.   :1.0000

Recoding missing values

We now recode missing values.We do not impute them, so subjects with missing values will be removed from the analyses.

Outbreak = Outbreak %>%
  mutate(
    age=na_if(age, 99),
    beefcurry=na_if(beefcurry, 9),
    saltegg=na_if(saltegg, 9),
    water=na_if(water, 9),
    eclair=na_if(eclair, 90))
summary(Outbreak)

##        id              sex             age          exptime         
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094       
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs       
##  Median : 547.5   Median :1.000   Median :17.00   Mode  :character  
##  Mean   : 547.5   Mean   :0.659   Mean   :19.32                     
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:22.00                     
##  Max.   :1094.0   Max.   :1.000   Max.   :58.00                     
##                                   NA's   :60                        
##    beefcurry         saltegg           eclair           water      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.: 0.000   1st Qu.:1.000  
##  Median :1.0000   Median :1.0000   Median : 2.000   Median :1.000  
##  Mean   :0.9164   Mean   :0.9201   Mean   : 2.072   Mean   :0.977  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 2.000   3rd Qu.:1.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :80.000   Max.   :1.000  
##  NA's   :5        NA's   :5        NA's   :117      NA's   :6      
##     onset               nausea          vomiting         abdpain      
##  Length:1094        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :AsIs        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.0000  
##                     Mean   :0.3985   Mean   :0.3793   Mean   :0.3501  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                       
##     diarrhea         case        
##  Min.   :0.0000   Mode :logical  
##  1st Qu.:0.0000   FALSE:625      
##  Median :0.0000   TRUE :469      
##  Mean   :0.2148                  
##  3rd Qu.:0.0000                  
##  Max.   :1.0000                  
##

The three variables can also be changed to factors:

Outbreak= Outbreak %>%
  mutate(
    beefcurry=as.factor(beefcurry),
     saltegg=as.factor(saltegg),
  water=as.factor(water))
summary(Outbreak)

##        id              sex             age          exptime          beefcurry 
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094        0   : 91  
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs        1   :998  
##  Median : 547.5   Median :1.000   Median :17.00   Mode  :character   NA's:  5  
##  Mean   : 547.5   Mean   :0.659   Mean   :19.32                                
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:22.00                                
##  Max.   :1094.0   Max.   :1.000   Max.   :58.00                                
##                                   NA's   :60                                   
##  saltegg         eclair        water         onset               nausea      
##  0   :  87   Min.   : 0.000   0   :  25   Length:1094        Min.   :0.0000  
##  1   :1002   1st Qu.: 0.000   1   :1063   Class :AsIs        1st Qu.:0.0000  
##  NA's:   5   Median : 2.000   NA's:   6   Mode  :character   Median :0.0000  
##              Mean   : 2.072                                  Mean   :0.3985  
##              3rd Qu.: 2.000                                  3rd Qu.:1.0000  
##              Max.   :80.000                                  Max.   :1.0000  
##              NA's   :117                                                     
##     vomiting         abdpain          diarrhea         case        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Mode :logical  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   FALSE:625      
##  Median :0.0000   Median :0.0000   Median :0.0000   TRUE :469      
##  Mean   :0.3793   Mean   :0.3501   Mean   :0.2148                  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000                  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000                  
##

All variables now look fine except ‘eclair’ which still contains the value 80 representing “ate but not remember how much”. We will analyse its relationship with ‘case’ by considering it as an ordered categorical variable.

tabpct(Outbreak$eclair, Outbreak$case)

## 
## Original table 
##                Outbreak$case
## Outbreak$eclair  FALSE  TRUE  Total
##           0        279    15    294
##           0.5        5    10     15
##           1         49    41     90
##           2        203   243    446
##           3          9    22     31
##           4         14    33     47
##           5          2     5      7
##           6          8    20     28
##           8          1     5      6
##           10         1     3      4
##           12         1     1      2
##           19         1     0      1
##           20         1     0      1
##           80         5     0      5
##           Total    579   398    977
## 
## Row percent 
##                Outbreak$case
## Outbreak$eclair   FALSE    TRUE  Total
##             0       279      15    294
##                  (94.9)   (5.1)  (100)
##             0.5       5      10     15
##                  (33.3)  (66.7)  (100)
##             1        49      41     90
##                  (54.4)  (45.6)  (100)
##             2       203     243    446
##                  (45.5)  (54.5)  (100)
##             3         9      22     31
##                    (29)    (71)  (100)
##             4        14      33     47
##                  (29.8)  (70.2)  (100)
##             5         2       5      7
##                  (28.6)  (71.4)  (100)
##             6         8      20     28
##                  (28.6)  (71.4)  (100)
##             8         1       5      6
##                  (16.7)  (83.3)  (100)
##             10        1       3      4
##                    (25)    (75)  (100)
##             12        1       1      2
##                    (50)    (50)  (100)
##             19        1       0      1
##                   (100)     (0)  (100)
##             20        1       0      1
##                   (100)     (0)  (100)
##             80        5       0      5
##                   (100)     (0)  (100)
## 
## Column percent 
##                Outbreak$case
## Outbreak$eclair  FALSE       %  TRUE       %
##           0        279  (48.2)    15   (3.8)
##           0.5        5   (0.9)    10   (2.5)
##           1         49   (8.5)    41  (10.3)
##           2        203  (35.1)   243  (61.1)
##           3          9   (1.6)    22   (5.5)
##           4         14   (2.4)    33   (8.3)
##           5          2   (0.3)     5   (1.3)
##           6          8   (1.4)    20   (5.0)
##           8          1   (0.2)     5   (1.3)
##           10         1   (0.2)     3   (0.8)
##           12         1   (0.2)     1   (0.3)
##           19         1   (0.2)     0   (0.0)
##           20         1   (0.2)     0   (0.0)
##           80         5   (0.9)     0   (0.0)
##           Total    579   (100)   398   (100)

The width of the columns of the mosaic graph denotes the relative frequency of that category. The highest frequency is 2 pieces followed by 0 and 1 piece. The other numbers have relatively low frequencies; particularly the 5 records where ‘eclair’ was coded as 80.

There is a tendency of increasing red area or attack rate from left to right indicating that the risk was increased when more pieces of eclair were consumed.

We will use the distribution of these proportions to guide our grouping of eclair consumption.

The first column of zero consumption has a very low attack rate, therefore it should be a separate category. Only a few took half a piece and this could be combined with those who took only one piece.

Persons consuming 2 pieces should be kept as one category as their frequency is very high. Others who ate more than two pieces should be grouped into another category. Finally, those coded as ‘80’ will be dropped due to the unknown amount of consumption as well as its low frequency.

Remind that collapsing categories with low numbers help in increasing the power of the statistical estimate of the regression coefficients.

Outbreak$eclairgr <- cut(Outbreak$eclair, breaks = c(0, 0.4, 1, 2, 79),
include.lowest = TRUE, labels=c("0","1","2",">2"))

The argument ‘include.lowest=TRUE’ indicates that 0 eclair must be included in the lowest category.

tabpct(Outbreak$eclairgr, Outbreak$case)

## 
## Original table 
##                  Outbreak$case
## Outbreak$eclairgr  FALSE  TRUE  Total
##             0        279    15    294
##             1         54    51    105
##             2        203   243    446
##             >2        38    89    127
##             Total    574   398    972
## 
## Row percent 
##                  Outbreak$case
## Outbreak$eclairgr   FALSE    TRUE  Total
##                0      279      15    294
##                    (94.9)   (5.1)  (100)
##                1       54      51    105
##                    (51.4)  (48.6)  (100)
##                2      203     243    446
##                    (45.5)  (54.5)  (100)
##                >2      38      89    127
##                    (29.9)  (70.1)  (100)
## 
## Column percent 
##                  Outbreak$case
## Outbreak$eclairgr  FALSE       %  TRUE       %
##             0        279  (48.6)    15   (3.8)
##             1         54   (9.4)    51  (12.8)
##             2        203  (35.4)   243  (61.1)
##             >2        38   (6.6)    89  (22.4)
##             Total    574   (100)   398   (100)

The attack rate or percentage of diseased in each category of exposure, as shown in the bracket of the column TRUE, increases from 5.1% among those who did not eat any eclairs to 70.1% among those heavy eaters of eclair.

The graph output is similar to the preceding one except that the groups now are more concise. We now have a continuous variable of ‘eclair’ and a categorical variable of ‘eclairgr’. The next step is to create a binary exposure for eclair.

Outbreak$eclair.eat <- Outbreak$eclair > 0

This binary exposure variable is now similar to the others, i.e. ‘beefcurry’, ‘saltegg’ and ‘water’.

We now model ‘case’ as the binary outcome variable and take ‘eclair.eat’ as the only explanatory variable:

glm0 <- glm(case ~ eclair.eat, family=binomial, data=Outbreak)
summary(glm0)

## 
## Call:
## glm(formula = case ~ eclair.eat, family = binomial, data = Outbreak)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2827  -1.2827  -0.3236   1.0756   2.4395  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.923      0.265  -11.03   <2e-16 ***
## eclair.eatTRUE    3.167      0.276   11.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1320.7  on 976  degrees of freedom
## Residual deviance: 1055.2  on 975  degrees of freedom
##   (117 osservazioni eliminate a causa di valori mancanti)
## AIC: 1059.2
## 
## Number of Fisher Scoring iterations: 5

logistic.display(glm0)

## 
## Logistic regression predicting case 
##  
##            OR(95%CI)            P(Wald's test) P(LR-test)
## eclair.eat 23.75 (13.82,40.79)  < 0.001        < 0.001   
##                                                          
## Log-likelihood = -527.6075
## No. of observations = 977
## AIC value = 1059.2149

The odds ratio from the logistic regression is derived from exponentiation of the estimate, i.e. 23.75 is obtained from:

exp(coef(summary(glm0))[2,1])

## [1] 23.746

The 95% confidence interval of the odds ratio is obtained from:

exp(coef(summary(glm0))[2,1] + c(-1,1) * 1.96 *coef(summary(glm0))[2,2])

## [1] 13.82402 40.78933

These values are close to simple calculation of the 2-by-2 table.

The output from logistic.display also contains the ‘LR-test’ (likelihood ratio test) result, which checks whether the likelihood of the given model, ‘glm0’, would be significantly different from the model without ‘eclair.eat’, which in this case would be the “null” model. For an independent variable with two levels, the LR-test does not add further important information because Wald’s test has already tested the hypothesis. When the independent variable has more than two levels, the LR-test is more important than Wald’s test as the following example demonstrates.

glm1 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm1)

## 
## Logistic regression predicting case 
##  
##                  OR(95%CI)            P(Wald's test) P(LR-test)
## eclairgr: ref.=0                                     < 0.001   
##    1             17.57 (9.21,33.49)   < 0.001                  
##    2             22.27 (12.82,38.66)  < 0.001                  
##    >2            43.56 (22.89,82.91)  < 0.001                  
##                                                                
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471

Interpreting Wald’s test alone, one would conclude that all levels of eclair eaten would be significant. However, this depends on the reference level. By default, R assumes that the first level of an independent factor is the reference level. If we relevel the reference level to be 2 pieces of eclair, Wald’s test gives a different impression.

Outbreak$eclairgr <- relevel(Outbreak$eclairgr, ref="2")
glm2 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm2)

## 
## Logistic regression predicting case 
##  
##                  OR(95%CI)         P(Wald's test) P(LR-test)
## eclairgr: ref.=2                                  < 0.001   
##    0             0.04 (0.03,0.08)  < 0.001                  
##    1             0.79 (0.52,1.21)  0.275                    
##    >2            1.96 (1.28,2.99)  0.002                    
##                                                             
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471

The results show that eating only one piece of eclair does not reduce the risk significantly compared to eating two pieces. While results from Wald’s test depend on the reference level of the explanatory variable, the LR-test is concerned only with the contribution of the variable as a whole and ignores the reference level.

Try to evaluate the relationship between variables (when discussion with an expert is not possible)

Imagine that we want to have an idea of the presence of possible confounders with respect to the effect of eclair and we have not the possibility to talk with an expert, so that we can investigate relationship observing the data. In this case a possibility is to examine univariable effects and then building multivariable models paying attention to the possible change in the effects estimation for the variable of interest when introducing the other. This procedure is not considered as valid as defining a DAG a priori with the help of an expert, from a causal inference point of view,but it give some indications at least.

Therefore, try ‘saltegg’ now as the only explanatory variable:

glm3 <- glm(case ~ saltegg, family = binomial, data=Outbreak)
logistic.display(glm3)

## 
## Logistic regression predicting case 
##  
##                 OR(95%CI)         P(Wald's test) P(LR-test)
## saltegg: 1 vs 0 2.54 (1.53,4.22)  < 0.001        < 0.001   
##                                                            
## Log-likelihood = -736.998
## No. of observations = 1089
## AIC value = 1477.996

The odds ratio for ‘saltegg’ is statistically significant. The number of valid records is also higher than the model containing ‘eclairgr’.

Note: One should always be very careful when analysing data that contain missing values. Advanced methods to handle missing values are beyond the scope of this course and for reasons of simplicity are ignored here. Data scientists should be advised to deal with missing values properly prior to conducting their analysis.

To check whether the odds ratio of ‘saltegg’ is confounded by ‘eclairgr’, the two explanatory variables are put together in the next model:

glm4 <- glm(case ~ eclairgr + saltegg, family=binomial, data=Outbreak)
logistic.display(glm4, crude.p.value=TRUE)

## 
## Logistic regression predicting case 
##  
##                  crude OR(95%CI)   crude P value adj. OR(95%CI)   
## eclairgr: ref.=2                                                  
##    0             0.04 (0.03,0.08)  < 0.001       0.04 (0.03,0.08) 
##    1             0.79 (0.52,1.21)  0.275         0.79 (0.51,1.21) 
##    >2            1.96 (1.28,2.99)  0.002         1.96 (1.28,2.99) 
##                                                                   
## saltegg: 1 vs 0  2.37 (1.41,3.99)  0.001         1.01 (0.53,1.93) 
##                                                                   
##                  P(Wald's test) P(LR-test)
## eclairgr: ref.=2                < 0.001   
##    0             < 0.001                  
##    1             0.279                    
##    >2            0.002                    
##                                           
## saltegg: 1 vs 0  0.975          0.975     
##                                           
## Log-likelihood = -516.8231
## No. of observations = 972
## AIC value = 1043.6461

The odds ratios of the explanatory variables in glm4 are adjusted for each other. The crude odds ratios are exactly the same as from the previous models with only single variable.

The adjusted odds ratios of ‘eclairgr’ do not change suggesting that it is not confounded by ‘saltegg’, whereas the odds ratio of ‘saltegg’ is clearly changed towards unity, and now has a very large P value.

The difference between the adjusted odds ratio and the crude odds ratio is an indication that ‘saltegg’ is confounded by ‘eclairgr’, which could be considered as an independent risk factor.

Now that we have a model containing two explanatory variables, we can compare models ‘glm4’ and ‘glm2’ using the lrtest command.

lrtest(glm4, glm2)

## Likelihood ratio test for MLE method 
## Chi-squared 1 d.f. =  0.0009809013 , P value =  0.9750149

The P value of 0.975 is the same as that from ‘P(LR-test)’ of ‘saltegg’ obtained from the preceding command. The test determines whether removal of ‘saltegg’ in a model would make a significant difference than if it were kept.

When there is more than one explanatory variable, ‘P(LR-test)’ from logistic.display is actually obtained from the lrtest command, which compares the current model against one in which the particular variable is removed, while keeping all remaining variables.

Let’s further add covariates in the model:

glm5 <- glm(case~eclairgr+saltegg+sex, family=binomial, data=Outbreak)
logistic.display(glm5)

## 
## Logistic regression predicting case 
##  
##                  crude OR(95%CI)   adj. OR(95%CI)    P(Wald's test) P(LR-test)
## eclairgr: ref.=2                                                    < 0.001   
##    0             0.04 (0.03,0.08)  0.04 (0.02,0.07)  < 0.001                  
##    1             0.79 (0.52,1.21)  0.75 (0.49,1.16)  0.2                      
##    >2            1.96 (1.28,2.99)  1.82 (1.19,2.8)   0.006                    
##                                                                               
## saltegg: 1 vs 0  2.37 (1.41,3.99)  0.92 (0.48,1.76)  0.807          0.808     
##                                                                               
## sex: 1 vs 0      1.58 (1.19,2.08)  1.85 (1.35,2.53)  < 0.001        < 0.001   
##                                                                               
## Log-likelihood = -509.5181
## No. of observations = 972
## AIC value = 1031.0361

The third explanatory variable ‘sex’ is another independent risk factor. Since females are the reference level, males have an increased risk compared to females. This variable is not considered a confounder to either of the preceding variables because it has not substantially changed the odds ratios of any of them (from ‘glm4’). The reason for not being able to confound is its lack of association with either of the preceding explanatory variables. In other words, males and females were not different in terms of eating eclairs and salted eggs.

Logistic Regression (I)

Introduction

Example: Tooth decay

Another example of logistic regression

Recoding missing values

Try to evaluate the relationship between variables (when discussion with an expert is not possible)