First of all, install and load the required libraries :

library(rms)
library(Hmisc)
library(epiDisplay)
library(tidyverse)
library(dplyr)

The relevance of the logistic regression model in health research

In epidemiological data, most of the outcomes are often binary or dichotomous. For example, in the investigation of the cause of a disease, the status of the outcome, the disease, is diseased vs non-diseased. For a mortality study, the outcome is usually died or survived. For a continuous variable such as weight or height, we can think that the single representative number for the population or sample is the mean or median. For dichotomous data, the representative number is the proportion or percentage of one type of the outcome. For example, as you should remember from Block 1, prevalence is the proportion of the population with the disease of interest. Case-fatality is the proportion of deaths among the people with the disease. The other related term is probability. Proportion is a simple straightforward term. Probability denotes the likeliness, which is more theoretical (assuming a random mechanism that generate the observed data). In the case of a dichotomous variable, the proportion is used as the estimated probability, for example assuming a binomial random variable behaviour. For computation, the outcome is often represented with 1 and 0 otherwise. The prevalence is then the mean of diseased values among the study sample. Assuming the binomial mechanism, the standard regression model is the logistic regression, using the log(odds) as a dependent variable in the model. If P is the probability of having a disease, 1-P is probability of not having the disease. The odds is thus P/(1-P). As you know, the relationship between probability and odds, mainly log(odds) can be plotted as follows:

p <- seq(from=0, to=1, by=.01)
odds <- p/(1-p)
plot(log(odds), p, type="l", col="blue", ylab="Probability",
main="Relationship between log odds and probability", las=1)
abline(h=.5)
abline(v=0)

Being on a linear and well balanced scale, the logit is a more appropriate scale for regression model for a binary outcome than the probability itself.

Similarly to the conversion between probabilities and odds, we can devise functions that convert both ways between probability and log-odds:

logit <- function(p) log(p/(1-p))
tigol <- function(t) 1/(1+exp(-t))

(tigol is just the letters of logit reversed).

Modelling logit(Y|X) ~ \(\beta\)X is the general form of logistic regression. It means that we are modelling the logit of Y given X (or conditioning on X), where X denotes one or more independent variables.

Suppose there are independent or exposure variables: X1 and X2. \(\beta\)X would be \(\beta_{0}\)+ \(\beta_{1}\)X1 + \(\beta_{2}\)X2, where \(\beta_{0}\) is the intercept.

The X can be age, sex, and other prognostic variables. In the explanatory/causal setting, among these X variables, one specific X is usually the focus of the study. Others are potential confounders and covariates.

Using logistic regression it turns out that the probability of observing Y given X Pr(Y|X) is equal to exp\(\beta\)X/(1 + exp\(\beta\)X). Hence, logistic regression is often used to compute the probability of an outcome under a given set of exposures. For example, prediction of probability of getting a disease under a given set of age, sex, and behaviour groups, etc.

Example: Tooth decay

The dataset Decay is a simple dataset containing two variables: ‘decay’, which is binary and ‘strep’, which is a continuous variable.

data(Decay)
summary(Decay)
##      decay            strep       
##  Min.   :0.0000   Min.   :  0.50  
##  1st Qu.:0.0000   1st Qu.: 43.00  
##  Median :1.0000   Median :105.00  
##  Mean   :0.6307   Mean   : 95.25  
##  3rd Qu.:1.0000   3rd Qu.:150.00  
##  Max.   :1.0000   Max.   :152.50

The outcome variable is ‘decay’, which indicates whether a person has at least one decayed tooth (1) or not (0). The exposure variable is ‘strep’, the number of colony forming units (CFU) of streptococci, a group of bacteria suspected to cause tooth decay. The prevalence of having decayed teeth is equal to the mean of the ‘decay’ variable, i.e. 0.63. To look at the ‘strep’ variable type:

summary(Decay$strep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50   43.00  105.00   95.25  150.00  152.50
hist(Decay$strep)

The plot shows that the vast majority have the value at about 150. Since the natural distribution of bacteria is logarithmic, a transformed variable is created and used as the independent variable. Let’s now estimate an univariable logistic regression model:

Decay$log10.strep <- log10(Decay$strep)
glm0 <- glm(decay~log10.strep, family=binomial, data=Decay)

Ask for the summary of the estimated model:

summary(glm0)
## 
## Call:
## glm(formula = decay ~ log10.strep, family = binomial, data = Decay)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.5539     0.5184  -4.927 8.36e-07 ***
## log10.strep   1.6808     0.2764   6.082 1.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 574.27  on 435  degrees of freedom
## Residual deviance: 531.83  on 434  degrees of freedom
## AIC: 535.83
## 
## Number of Fisher Scoring iterations: 4

Both the coefficients of the intercept and ‘log10.strep’ are statistically significant. Pr(>|z|) for ‘log10.strep’ is the P value from Wald’s test. This tests whether the coefficient, 1.681, is significantly different from 0. In this case it is. The estimated intercept is -2.554. This means that when log10.strep is 0 (or strep equals 1 CFU), the logit of having at least a decayed tooth is -2.55. We can then calculate the related baseline odds and probability.

exp(-2.554) -> baseline.odds
baseline.odds
## [1] 0.07776996
baseline.odds/(1+baseline.odds) -> baseline.prob
baseline.prob
## [1] 0.07215822

There is an odds of 0.077 or a probability of 7.2% of having at least one decayed tooth if the number of CFU of the mutan strep is at 1 CFU.

The coefficient of log10.strep is 1.681. For every unit increment of log10(strep), or an increment of 10 CFU, the logit will increase by 1.681.

This increment of logit is constant but not the increment of probability because the latter is not on a linear scale.

The probability at each point of CFU is computed by replacing both coefficients obtained from the model. For example, at 100 CFU, the probability is:

prob.100 <- coef(glm0)[1] + log10(100)*coef(glm0)[2]
prob.100
## (Intercept) 
##   0.8078015

To see the relationship for the whole dataset:

plot(Decay$log10.strep, fitted(glm0), ylim=c(0,0.80))

A logistic nature of the curve is partly demonstrated. To make it clearer, the ranges of X and Y axes are both expanded to allow a more extensive curve fitting.

plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1))

Another vector of the same name ‘log10.strep’ is created in the form of a data frame for plotting a fitted line on the same graph.

newdata        <- data.frame(log10.strep=seq(from=-2, to=4, by=.01))
predicted.line <- predict.glm(glm0,newdata,type="response")

The values for predicted line on the above command must be on the same scale as the ‘response’ variable. Since the response is either 0 or 1, the predicted line would be in between, ie. the predicted probability for each value of log10(strep).

plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1), xlab=" ", ylab=" ", xaxt="n", las=1)
lines(newdata$log10.strep, predicted.line, col="blue")
axis(side=1, at=-2:4, labels=as.character(10^(-2:4)))
title(main="Relationship between mutan streptococci \n and probability of tooth decay", xlab="CFU",
ylab="Probability of having decayed teeth")

Another example of logistic regression

The above example of caries data has a continuous variable ‘log10.strep’ as the key independent variable.

In most epidemiological datasets, the independent variables are often categorical.

We will examine an example coming from an outbreak investigation.

On 25 August 1990, the local health officer in Supan Buri Province of Thailand reported the occurrence of an outbreak of acute gastrointestinal illness on a national handicapped sports day. Epidemiologists went to investigate. The dataset is called Outbreak. Most variable names are self explanatory. Variables are coded as 0 = no, 1 = yes and 9 = missing/unknown for three food items consumed by participants: ‘beefcurry’ (beef curry), ‘saltegg’ (salted eggs) and ‘water’.

Also on the menu were eclairs, a finger-shaped iced cake of choux pastry filled with cream. This variable records the number of pieces eaten by each participant. Missing values were coded as follows: 88 = “ate but do not remember how much”, while code 90 represents totally missing information.

Some participants experienced gastrointestinal symptoms, such as: nausea, vomiting, abdominal pain and diarrhea. The ages of each participant are recorded in years with 99 representing a missing value.

The variables ‘exptime’ and ‘onset’ are the exposure and onset times, which are in character format.

Let’s look at the data.

data(Outbreak)
summary(Outbreak)
##        id              sex             age          exptime         
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094       
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs       
##  Median : 547.5   Median :1.000   Median :18.00   Mode  :character  
##  Mean   : 547.5   Mean   :0.659   Mean   :23.69                     
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:24.00                     
##  Max.   :1094.0   Max.   :1.000   Max.   :99.00                     
##    beefcurry         saltegg          eclair          water      
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.: 0.00   1st Qu.:1.000  
##  Median :1.0000   Median :1.000   Median : 2.00   Median :1.000  
##  Mean   :0.9534   Mean   :0.957   Mean   :11.48   Mean   :1.021  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.: 2.00   3rd Qu.:1.000  
##  Max.   :9.0000   Max.   :9.000   Max.   :90.00   Max.   :9.000  
##     onset               nausea          vomiting         abdpain      
##  Length:1094        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :AsIs        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.0000  
##                     Mean   :0.3985   Mean   :0.3793   Mean   :0.3501  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     diarrhea     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2148  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

We will first of all define the cases. It was agreed among the investigators that a case should be defined as a person who had any of the four symptoms: ‘nausea’, ‘vomiting’, ‘abdpain’ or ‘diarrhea’. A case can then by computed as follows:

Outbreak$case <- (Outbreak$nausea==1)|(Outbreak$vomiting==1)|(Outbreak$abdpain==1)|(Outbreak$diarrhea==1)

The variable ‘case’ is now incorporated into the data.

summary(Outbreak)
##        id              sex             age          exptime         
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094       
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs       
##  Median : 547.5   Median :1.000   Median :18.00   Mode  :character  
##  Mean   : 547.5   Mean   :0.659   Mean   :23.69                     
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:24.00                     
##  Max.   :1094.0   Max.   :1.000   Max.   :99.00                     
##    beefcurry         saltegg          eclair          water      
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.: 0.00   1st Qu.:1.000  
##  Median :1.0000   Median :1.000   Median : 2.00   Median :1.000  
##  Mean   :0.9534   Mean   :0.957   Mean   :11.48   Mean   :1.021  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.: 2.00   3rd Qu.:1.000  
##  Max.   :9.0000   Max.   :9.000   Max.   :90.00   Max.   :9.000  
##     onset               nausea          vomiting         abdpain      
##  Length:1094        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :AsIs        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.0000  
##                     Mean   :0.3985   Mean   :0.3793   Mean   :0.3501  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     diarrhea         case        
##  Min.   :0.0000   Mode :logical  
##  1st Qu.:0.0000   FALSE:625      
##  Median :0.0000   TRUE :469      
##  Mean   :0.2148                  
##  3rd Qu.:0.0000                  
##  Max.   :1.0000

Recoding missing values

We now recode missing values.We do not impute them, so subjects with missing values will be removed from the analyses.

Outbreak = Outbreak %>%
  mutate(
    age=na_if(age, 99),
    beefcurry=na_if(beefcurry, 9),
    saltegg=na_if(saltegg, 9),
    water=na_if(water, 9),
    eclair=na_if(eclair, 90))
summary(Outbreak)
##        id              sex             age          exptime         
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094       
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs       
##  Median : 547.5   Median :1.000   Median :17.00   Mode  :character  
##  Mean   : 547.5   Mean   :0.659   Mean   :19.32                     
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:22.00                     
##  Max.   :1094.0   Max.   :1.000   Max.   :58.00                     
##                                   NA's   :60                        
##    beefcurry         saltegg           eclair           water      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.: 0.000   1st Qu.:1.000  
##  Median :1.0000   Median :1.0000   Median : 2.000   Median :1.000  
##  Mean   :0.9164   Mean   :0.9201   Mean   : 2.072   Mean   :0.977  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 2.000   3rd Qu.:1.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :80.000   Max.   :1.000  
##  NA's   :5        NA's   :5        NA's   :117      NA's   :6      
##     onset               nausea          vomiting         abdpain      
##  Length:1094        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :AsIs        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.0000  
##                     Mean   :0.3985   Mean   :0.3793   Mean   :0.3501  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                       
##     diarrhea         case        
##  Min.   :0.0000   Mode :logical  
##  1st Qu.:0.0000   FALSE:625      
##  Median :0.0000   TRUE :469      
##  Mean   :0.2148                  
##  3rd Qu.:0.0000                  
##  Max.   :1.0000                  
## 

The three variables can also be changed to factors:

Outbreak= Outbreak %>%
  mutate(
    beefcurry=as.factor(beefcurry),
     saltegg=as.factor(saltegg),
  water=as.factor(water))
summary(Outbreak)
##        id              sex             age          exptime          beefcurry 
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094        0   : 91  
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs        1   :998  
##  Median : 547.5   Median :1.000   Median :17.00   Mode  :character   NA's:  5  
##  Mean   : 547.5   Mean   :0.659   Mean   :19.32                                
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:22.00                                
##  Max.   :1094.0   Max.   :1.000   Max.   :58.00                                
##                                   NA's   :60                                   
##  saltegg         eclair        water         onset               nausea      
##  0   :  87   Min.   : 0.000   0   :  25   Length:1094        Min.   :0.0000  
##  1   :1002   1st Qu.: 0.000   1   :1063   Class :AsIs        1st Qu.:0.0000  
##  NA's:   5   Median : 2.000   NA's:   6   Mode  :character   Median :0.0000  
##              Mean   : 2.072                                  Mean   :0.3985  
##              3rd Qu.: 2.000                                  3rd Qu.:1.0000  
##              Max.   :80.000                                  Max.   :1.0000  
##              NA's   :117                                                     
##     vomiting         abdpain          diarrhea         case        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Mode :logical  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   FALSE:625      
##  Median :0.0000   Median :0.0000   Median :0.0000   TRUE :469      
##  Mean   :0.3793   Mean   :0.3501   Mean   :0.2148                  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000                  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000                  
## 

All variables now look fine except ‘eclair’ which still contains the value 80 representing “ate but not remember how much”. We will analyse its relationship with ‘case’ by considering it as an ordered categorical variable.

tabpct(Outbreak$eclair, Outbreak$case)
## 
## Original table 
##                Outbreak$case
## Outbreak$eclair  FALSE  TRUE  Total
##           0        279    15    294
##           0.5        5    10     15
##           1         49    41     90
##           2        203   243    446
##           3          9    22     31
##           4         14    33     47
##           5          2     5      7
##           6          8    20     28
##           8          1     5      6
##           10         1     3      4
##           12         1     1      2
##           19         1     0      1
##           20         1     0      1
##           80         5     0      5
##           Total    579   398    977
## 
## Row percent 
##                Outbreak$case
## Outbreak$eclair   FALSE    TRUE  Total
##             0       279      15    294
##                  (94.9)   (5.1)  (100)
##             0.5       5      10     15
##                  (33.3)  (66.7)  (100)
##             1        49      41     90
##                  (54.4)  (45.6)  (100)
##             2       203     243    446
##                  (45.5)  (54.5)  (100)
##             3         9      22     31
##                    (29)    (71)  (100)
##             4        14      33     47
##                  (29.8)  (70.2)  (100)
##             5         2       5      7
##                  (28.6)  (71.4)  (100)
##             6         8      20     28
##                  (28.6)  (71.4)  (100)
##             8         1       5      6
##                  (16.7)  (83.3)  (100)
##             10        1       3      4
##                    (25)    (75)  (100)
##             12        1       1      2
##                    (50)    (50)  (100)
##             19        1       0      1
##                   (100)     (0)  (100)
##             20        1       0      1
##                   (100)     (0)  (100)
##             80        5       0      5
##                   (100)     (0)  (100)
## 
## Column percent 
##                Outbreak$case
## Outbreak$eclair  FALSE       %  TRUE       %
##           0        279  (48.2)    15   (3.8)
##           0.5        5   (0.9)    10   (2.5)
##           1         49   (8.5)    41  (10.3)
##           2        203  (35.1)   243  (61.1)
##           3          9   (1.6)    22   (5.5)
##           4         14   (2.4)    33   (8.3)
##           5          2   (0.3)     5   (1.3)
##           6          8   (1.4)    20   (5.0)
##           8          1   (0.2)     5   (1.3)
##           10         1   (0.2)     3   (0.8)
##           12         1   (0.2)     1   (0.3)
##           19         1   (0.2)     0   (0.0)
##           20         1   (0.2)     0   (0.0)
##           80         5   (0.9)     0   (0.0)
##           Total    579   (100)   398   (100)

The width of the columns of the mosaic graph denotes the relative frequency of that category. The highest frequency is 2 pieces followed by 0 and 1 piece. The other numbers have relatively low frequencies; particularly the 5 records where ‘eclair’ was coded as 80.

There is a tendency of increasing red area or attack rate from left to right indicating that the risk was increased when more pieces of eclair were consumed.

We will use the distribution of these proportions to guide our grouping of eclair consumption.

The first column of zero consumption has a very low attack rate, therefore it should be a separate category. Only a few took half a piece and this could be combined with those who took only one piece.

Persons consuming 2 pieces should be kept as one category as their frequency is very high. Others who ate more than two pieces should be grouped into another category. Finally, those coded as ‘80’ will be dropped due to the unknown amount of consumption as well as its low frequency.

Remind that collapsing categories with low numbers help in increasing the power of the statistical estimate of the regression coefficients.

Outbreak$eclairgr <- cut(Outbreak$eclair, breaks = c(0, 0.4, 1, 2, 79),
include.lowest = TRUE, labels=c("0","1","2",">2"))

The argument ‘include.lowest=TRUE’ indicates that 0 eclair must be included in the lowest category.

tabpct(Outbreak$eclairgr, Outbreak$case)
## 
## Original table 
##                  Outbreak$case
## Outbreak$eclairgr  FALSE  TRUE  Total
##             0        279    15    294
##             1         54    51    105
##             2        203   243    446
##             >2        38    89    127
##             Total    574   398    972
## 
## Row percent 
##                  Outbreak$case
## Outbreak$eclairgr   FALSE    TRUE  Total
##                0      279      15    294
##                    (94.9)   (5.1)  (100)
##                1       54      51    105
##                    (51.4)  (48.6)  (100)
##                2      203     243    446
##                    (45.5)  (54.5)  (100)
##                >2      38      89    127
##                    (29.9)  (70.1)  (100)
## 
## Column percent 
##                  Outbreak$case
## Outbreak$eclairgr  FALSE       %  TRUE       %
##             0        279  (48.6)    15   (3.8)
##             1         54   (9.4)    51  (12.8)
##             2        203  (35.4)   243  (61.1)
##             >2        38   (6.6)    89  (22.4)
##             Total    574   (100)   398   (100)

The attack rate or percentage of diseased in each category of exposure, as shown in the bracket of the column TRUE, increases from 5.1% among those who did not eat any eclairs to 70.1% among those heavy eaters of eclair.

The graph output is similar to the preceding one except that the groups now are more concise. We now have a continuous variable of ‘eclair’ and a categorical variable of ‘eclairgr’. The next step is to create a binary exposure for eclair.

Outbreak$eclair.eat <- Outbreak$eclair > 0

This binary exposure variable is now similar to the others, i.e. ‘beefcurry’, ‘saltegg’ and ‘water’.

We now model ‘case’ as the binary outcome variable and take ‘eclair.eat’ as the only explanatory variable:

glm0 <- glm(case ~ eclair.eat, family=binomial, data=Outbreak)
summary(glm0)
## 
## Call:
## glm(formula = case ~ eclair.eat, family = binomial, data = Outbreak)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.923      0.265  -11.03   <2e-16 ***
## eclair.eatTRUE    3.167      0.276   11.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1320.7  on 976  degrees of freedom
## Residual deviance: 1055.2  on 975  degrees of freedom
##   (117 osservazioni eliminate a causa di valori mancanti)
## AIC: 1059.2
## 
## Number of Fisher Scoring iterations: 5
logistic.display(glm0)
## 
## Logistic regression predicting case 
##  
##            OR(95%CI)            P(Wald's test) P(LR-test)
## eclair.eat 23.75 (13.82,40.79)  < 0.001        < 0.001   
##                                                          
## Log-likelihood = -527.6075
## No. of observations = 977
## AIC value = 1059.2149

The odds ratio from the logistic regression is derived from exponentiation of the estimate, i.e. 23.75 is obtained from:

exp(coef(summary(glm0))[2,1])
## [1] 23.746

The 95% confidence interval of the odds ratio is obtained from:

exp(coef(summary(glm0))[2,1] + c(-1,1) * 1.96 *coef(summary(glm0))[2,2])
## [1] 13.82402 40.78933

These values are close to simple calculation of the 2-by-2 table.

The output from logistic.display also contains the ‘LR-test’ (likelihood ratio test) result, which checks whether the likelihood of the given model, ‘glm0’, would be significantly different from the model without ‘eclair.eat’, which in this case would be the “null” model. For an independent variable with two levels, the LR-test does not add further important information because Wald’s test has already tested the hypothesis. When the independent variable has more than two levels, the LR-test is more important than Wald’s test as the following example demonstrates.

glm1 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm1)
## 
## Logistic regression predicting case 
##  
##                  OR(95%CI)            P(Wald's test) P(LR-test)
## eclairgr: ref.=0                                     < 0.001   
##    1             17.57 (9.21,33.49)   < 0.001                  
##    2             22.27 (12.82,38.66)  < 0.001                  
##    >2            43.56 (22.89,82.91)  < 0.001                  
##                                                                
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471

Interpreting Wald’s test alone, one would conclude that all levels of eclair eaten would be significant. However, this depends on the reference level. By default, R assumes that the first level of an independent factor is the reference level. If we relevel the reference level to be 2 pieces of eclair, Wald’s test gives a different impression.

Outbreak$eclairgr <- relevel(Outbreak$eclairgr, ref="2")
glm2 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm2)
## 
## Logistic regression predicting case 
##  
##                  OR(95%CI)         P(Wald's test) P(LR-test)
## eclairgr: ref.=2                                  < 0.001   
##    0             0.04 (0.03,0.08)  < 0.001                  
##    1             0.79 (0.52,1.21)  0.275                    
##    >2            1.96 (1.28,2.99)  0.002                    
##                                                             
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471

The results show that eating only one piece of eclair does not reduce the risk significantly compared to eating two pieces. While results from Wald’s test depend on the reference level of the explanatory variable, the LR-test is concerned only with the contribution of the variable as a whole and ignores the reference level.

From univariable to multivariable logistic model

If now we use ‘saltegg’ as the only explanatory variable:

glm3 <- glm(case ~ saltegg, family = binomial, data=Outbreak)
logistic.display(glm3)
## 
## Logistic regression predicting case 
##  
##                 OR(95%CI)         P(Wald's test) P(LR-test)
## saltegg: 1 vs 0 2.54 (1.53,4.22)  < 0.001        < 0.001   
##                                                            
## Log-likelihood = -736.998
## No. of observations = 1089
## AIC value = 1477.996

The odds ratio for ‘saltegg’ is statistically significant. The number of valid records is also higher than the model containing ‘eclairgr’.

Note: One should always be very careful when analysing data that contain missing values !!! Advanced methods to handle missing values are beyond the scope of this course and for reasons of simplicity are ignored here. Data scientists should be advised to deal with missing values properly prior to conducting their analysis.

To check whether the odds ratio of ‘saltegg’ change if we include in the model also ‘eclairgr’, the two explanatory variables are put together in the next multivariable model:

glm4 <- glm(case ~ eclairgr + saltegg, family=binomial, data=Outbreak)
logistic.display(glm4, crude.p.value=TRUE)
## 
## Logistic regression predicting case 
##  
##                  crude OR(95%CI)   crude P value adj. OR(95%CI)   
## eclairgr: ref.=2                                                  
##    0             0.04 (0.03,0.08)  < 0.001       0.04 (0.03,0.08) 
##    1             0.79 (0.52,1.21)  0.275         0.79 (0.51,1.21) 
##    >2            1.96 (1.28,2.99)  0.002         1.96 (1.28,2.99) 
##                                                                   
## saltegg: 1 vs 0  2.37 (1.41,3.99)  0.001         1.01 (0.53,1.93) 
##                                                                   
##                  P(Wald's test) P(LR-test)
## eclairgr: ref.=2                < 0.001   
##    0             < 0.001                  
##    1             0.279                    
##    >2            0.002                    
##                                           
## saltegg: 1 vs 0  0.975          0.975     
##                                           
## Log-likelihood = -516.8231
## No. of observations = 972
## AIC value = 1043.6461

The odds ratios of the explanatory variables in glm4 are now adjusted one for each other. The adjusted odds ratios of ‘eclairgr’ do not change suggesting that it is not confounded by ‘saltegg’, whereas the odds ratio of ‘saltegg’ is clearly changed towards unity, and now has a very large P value.

The difference between the adjusted odds ratio and the crude odds ratio is an indication that ‘saltegg’ is confounded by ‘eclairgr’, which could be considered as an independent risk factor.

Now that we have a model containing two explanatory variables, we can compare models ‘glm4’ and ‘glm2’ using the lrtest (likelihood ratio) command.

lrtest(glm4, glm2)
## Likelihood ratio test for MLE method 
## Chi-squared 1 d.f. =  0.0009809013 , P value =  0.9750149

The P value of 0.975 is the same as that from ‘P(LR-test)’ of ‘saltegg’ obtained from the preceding command. The test determines whether removal of ‘saltegg’ in a model would make a significant difference than if it were kept.

When there is more than one explanatory variable, ‘P(LR-test)’ from logistic.display is actually obtained from the lrtest command, which compares the current model against one in which the particular variable is removed, while keeping all remaining variables.

Let’s now further add covariates in the model:

glm5 <- glm(case~eclairgr+saltegg+sex, family=binomial, data=Outbreak)
logistic.display(glm5)
## 
## Logistic regression predicting case 
##  
##                  crude OR(95%CI)   adj. OR(95%CI)    P(Wald's test) P(LR-test)
## eclairgr: ref.=2                                                    < 0.001   
##    0             0.04 (0.03,0.08)  0.04 (0.02,0.07)  < 0.001                  
##    1             0.79 (0.52,1.21)  0.75 (0.49,1.16)  0.2                      
##    >2            1.96 (1.28,2.99)  1.82 (1.19,2.8)   0.006                    
##                                                                               
## saltegg: 1 vs 0  2.37 (1.41,3.99)  0.92 (0.48,1.76)  0.807          0.808     
##                                                                               
## sex: 1 vs 0      1.58 (1.19,2.08)  1.85 (1.35,2.53)  < 0.001        < 0.001   
##                                                                               
## Log-likelihood = -509.5181
## No. of observations = 972
## AIC value = 1031.0361

The third explanatory variable ‘sex’ is another significant independent risk factor. Since females are the reference level, males have an increased risk compared to females. This variable is not considered a confounder to either of the preceding variables because it has not substantially changed the odds ratios of any of them (from ‘glm4’). The reason for not being able to confound is its lack of association with either of the preceding explanatory variables. In other words, males and females were not different in terms of eating eclairs and salted eggs.

Poisson regression in healthcare research

Load the required library:

library(epiDisplay)
library(psych)

In nature, an event usually takes place in a very small amount of time. At any given point of time, the probability of encountering such an event is small. Instead of the probability of the single event, now we focus on the frequency of the events as a density, which means incidence or ‘count’ of events over a period of time. (While time is one dimension, the same concept applies to the density of counts of small objects in a two-dimensional area or three-dimensional space). Moreover we can assume that one event is independent from another and that the densities in different units of time vary with a variance equal to the average density. We can approximate this kind of random process using the Poisson random variable. When the probability of having an event is affected by some factors, a model is needed to explain and predict the density. Variation among different strata of a population could be explained by the various combination of factors. Within each stratum (defined by covariates combination), the distribution of the events is assumed random. Poisson regression deals with outcome variables that are counts in nature (whole numbers or integers). Independent covariates have the same role as those encountered in linear and logistic regression. In epidemiology, Poisson regression is very often used for analysing grouped population based or cohort data, looking at incidence density among person-time contributed by subjects that share similar characteristics of interest. Poisson regression is one of 3 common regression models used in epidemiological studies. The other two that are more commonly used are linear regression and logistic regression, which have been already covered. The last family belong to survival methods, that we will explain in the last block 4 of the course.

There are two main assumptions for Poisson regression:

  1. risk is homogeneous among person-times contributed by different subjects who share the same characteristics of interest (e.g. sex, age-group, i.e. conditional on covariates) and the same period.

  2. asymptotically, or as the sample size becomes larger and larger, the mean of the counts is equal to the variance.

Straightforward linear regression methods (assuming constant variance and normal errors) are not appropriate for count data for three main reasons:

  1. the linear model might lead to the prediction of negative counts

  2. the variance of the response may increase with the mean

  3. the errors will not be normally distributed

Moreover, in studies that involve the time dimension different subjects may have different person-times of exposure. Analyzing risk factors while ignoring differences in person-times is wrong.

Poisson regression overcomes these limitations.

Note that in survival analysis using for example Cox regression (see block 4), the hazard ratio will be estimated for each covariate in the model, not the incidence density in each subgroup; in the Cox model the interest will be focused on the “how long until an event occurs - time to event -”, instead in the Poisson regression model the focus is on “how many events occur in a given interval”.

Example of Poisson model: the Montana smelter study

The dataset Montana was extracted from an occupational cohort study conducted to test the association between respiratory deaths (outcome) and exposure to arsenic in the industry, adjusting for various other variables. The main outcome variable is respdeath. This is the count of the number of deaths among personyrs or personyears of subjects in each category. The other variables are independent covariates including age group agegr, period of employment period, starting time of employment start and the level of exposure to arsenic during the study period arsenic (the exposure of interest). Read in the data first and examine the variables.

data(Montana)
summary(Montana)
##    respdeath        personyrs           agegr           period     
##  Min.   : 0.000   Min.   :    4.2   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:  127.5   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 1.000   Median :  335.1   Median :3.000   Median :2.000  
##  Mean   : 2.421   Mean   : 1096.4   Mean   :2.605   Mean   :2.404  
##  3rd Qu.: 3.000   3rd Qu.:  925.7   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :19.000   Max.   :12451.3   Max.   :4.000   Max.   :4.000  
##     starting        arsenic     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.250  
##  Median :1.000   Median :2.000  
##  Mean   :1.456   Mean   :2.474  
##  3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :2.000   Max.   :4.000

The last four variables are classed as integers. We need to tell R to interpret them as categorical variables, or factors, and attach labels to each of the levels. This can be done using the factor command with a ‘labels’ argument included.

Montana$agegr    <- factor(Montana$agegr, labels=c("40-49","50-59","60-69","70-79"))
Montana$period   <- factor(Montana$period, labels=c("1938-1949", "1950-1959","1960-1969", "1970-1977"))
Montana$start    <- factor(Montana$start, labels=c("pre-1925", "1925 & after"))
Montana$arsenic1 <- factor(Montana$arsenic, labels=c("<1 year", "1-4years","5-14 years", "15+ years"))
summary(Montana)
##    respdeath        personyrs         agegr          period      starting    
##  Min.   : 0.000   Min.   :    4.2   40-49:24   1938-1949:30   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:  127.5   50-59:28   1950-1959:32   1st Qu.:1.000  
##  Median : 1.000   Median :  335.1   60-69:31   1960-1969:28   Median :1.000  
##  Mean   : 2.421   Mean   : 1096.4   70-79:31   1970-1977:24   Mean   :1.456  
##  3rd Qu.: 3.000   3rd Qu.:  925.7                             3rd Qu.:2.000  
##  Max.   :19.000   Max.   :12451.3                             Max.   :2.000  
##     arsenic               start          arsenic1 
##  Min.   :1.000   pre-1925    :62   <1 year   :29  
##  1st Qu.:1.250   1925 & after:52   1-4years  :29  
##  Median :2.000                     5-14 years:29  
##  Mean   :2.474                     15+ years :27  
##  3rd Qu.:3.000                                    
##  Max.   :4.000

We keep the original arsenic variable unchanged for use later on.

Descriptive analyses : breakdown of incidence by age and period

Let us explore the person-years breakdown by age and period. Firstly, create a table for total person-years:

tapply(Montana$personyrs, list(Montana$period, Montana$agegr), sum) -> table.pyears

Carry out the same procedure for number of deaths, and compute the table of incidence per 10,000 person years for each cell.

tapply(Montana$respdeath, list(Montana$period, Montana$agegr), sum) -> table.deaths
table.inc10000 <- table.deaths/table.pyears*10000
table.inc10000
##              40-49    50-59    60-69    70-79
## 1938-1949 5.424700 17.13102 34.95107 26.53928
## 1950-1959 3.344638 23.47556 49.01961 64.82632
## 1960-1969 4.341516 20.49375 58.23803 55.06608
## 1970-1977 4.408685 14.77747 44.09949 80.81413

Now, create a time-series plot of the incidence:

plot.ts(table.inc10000, plot.type="single", xlab=" ",ylab="#/10,000 person-years", xaxt="n", col=c("black",
"blue","red","green"), lty=c(2,1,1,2), las=1)
points(rep(1:4,4), table.inc10000, pch=22, cex=table.pyears/sum(table.pyears) * 20)
title(main = "Incidence by age and period")
axis(side = 1, at = 1:4, labels = levels(Montana$period))
legend(3.2,40, legend=levels(Montana$agegr)[4:1], col=c("green","red", "blue", "black"), bg = "white", lty=c(2,1,1,2))

The above graph shows that the older age group is generally associated with a higher risk. On the other hand, the sample size (reflected by the size of the squares at each point) decreases with age. The possibility of a confounding effect of age on the exposure of interest can better be examined by using Poisson regression.

Modelling with Poisson regression

Let’s estimate a univariable Poisson regression model taking into account only period as a covariate:

mode11 <- glm(respdeath ~ period, offset = log(personyrs),family = poisson, data=Montana)
summary(mode11)
## 
## Call:
## glm(formula = respdeath ~ period, family = poisson, data = Montana, 
##     offset = log(personyrs))
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -6.4331     0.1715 -37.511   <2e-16 ***
## period1950-1959   0.2365     0.2117   1.117   0.2638    
## period1960-1969   0.3781     0.2001   1.889   0.0588 .  
## period1970-1977   0.4830     0.2036   2.372   0.0177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 369.27  on 110  degrees of freedom
## AIC: 596.05
## 
## Number of Fisher Scoring iterations: 6

The option offset = log(personyrs) allows the variable personyrs to be the denominator for the counts of respdeath. A logarithmic transformation is needed since, for a Poisson generalized linear model, the link function is the natural log, and the default link for the Poisson family is the log link.

Remind : an important criterion in the choice of a link function for various families of distributions is to ensure that the fitted values from the modelling stay within reasonable bounds. Specifying a log link (default for Poisson) ensures that the fitted counts are all greater than or equal to zero. For more details on default links for various families of distributions related to generalized linear modelling, see the help in R under help(family).

The first model above with period as the only independent variable suggests that the death rate increased with time. The model can be tested for goodness of fit and the checked whether the Poisson assumptions mentioned earlier have been violated.

Goodness of fit test

To test the goodness of fit of the Poisson model, type:

poisgof(mode11)
## $results
## [1] "Goodness-of-fit test for Poisson assumption"
## 
## $chisq
## [1] 369.2654
## 
## $df
## [1] 110
## 
## $p.value
## [1] 9.578443e-30

The component ‘$chisq’ is actually computed from the model deviance, a parameter reflecting the level of errors. A large chi-squared value with small degrees of freedom results in a significant violation of the Poisson assumption (p < 0.05). If only the P value is wanted, the command can be shortened.

poisgof(mode11)$p.value
## [1] 9.578443e-30

The P value is very small indicating a poor fit.

Note: It should be noted that this method works under the assumption of a large sample size. An alternative method is to a fit negative binomial regression model (this topic has not been not covered in the slides!).

We now add a second variable ‘agegr’ to the model:

mode12 <- glm(respdeath~agegr+period, offset=log(personyrs), family = poisson, data=Montana)
AIC(mode12)
## [1] 396.6407

The AIC (Akaike Information Criterion) has decreased remarkably from ‘model1’ to ‘model2’ indicating a poorer fit of the first model. (AIC is a criterion that is often used to select the best “parsimonious” multivariable model).

poisgof(mode12)$p.value
## [1] 0.0003381328

But ‘model2’ still violates the Poisson assumption.

mode13 <- glm(respdeath ~ agegr, offset = log(personyrs), family = poisson, data=Montana)
AIC(mode13) 
## [1] 394.4699
poisgof(mode13)$p.value 
## [1] 0.0003295142

Removal of ‘period’ further reduces the AIC but still violates the Poisson assumption to the same extent as the previous model. The next step is to add the variable: ‘arsenic1’.

mode14 <- glm(respdeath ~ agegr + arsenic1, offset=log(personyrs), family = poisson, data=Montana)
summary(mode14)
## 
## Call:
## glm(formula = respdeath ~ agegr + arsenic1, family = poisson, 
##     data = Montana, offset = log(personyrs))
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -7.9947     0.2237 -35.739  < 2e-16 ***
## agegr50-59           1.4624     0.2454   5.960 2.53e-09 ***
## agegr60-69           2.3502     0.2380   9.873  < 2e-16 ***
## agegr70-79           2.5987     0.2564  10.136  < 2e-16 ***
## arsenic11-4years     0.8041     0.1577   5.099 3.42e-07 ***
## arsenic15-14 years   0.5957     0.2059   2.893  0.00381 ** 
## arsenic115+ years    0.9976     0.1758   5.673 1.40e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 122.25  on 107  degrees of freedom
## AIC: 355.04
## 
## Number of Fisher Scoring iterations: 5
poisgof(mode14)$p.value
## [1] 0.1486902

Fortunately, ‘model4’ has a much lower AIC than model3 and it now does not violate the assumption.

If we change the reference level for arsenic and we use 1-4 years vs others:

Montana$arsenic.b <- relevel(Montana$arsenic1,ref="1-4years")
mode15 <- glm(respdeath ~ agegr + arsenic.b, offset=log(personyrs), family = poisson, data=Montana)
summary(mode15)
## 
## Call:
## glm(formula = respdeath ~ agegr + arsenic.b, family = poisson, 
##     data = Montana, offset = log(personyrs))
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -7.1905     0.2466 -29.156  < 2e-16 ***
## agegr50-59            1.4624     0.2454   5.960 2.53e-09 ***
## agegr60-69            2.3502     0.2380   9.873  < 2e-16 ***
## agegr70-79            2.5987     0.2564  10.136  < 2e-16 ***
## arsenic.b<1 year     -0.8041     0.1577  -5.099 3.42e-07 ***
## arsenic.b5-14 years  -0.2085     0.2326  -0.896     0.37    
## arsenic.b15+ years    0.1935     0.2071   0.935     0.35    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 122.25  on 107  degrees of freedom
## AIC: 355.04
## 
## Number of Fisher Scoring iterations: 5

It does not appear to be any increase in the risk of death from more than 4 years of exposure to arsenic so it may be worth combining it into just two levels:

Montana$arsenic2 <- Montana$arsenic1
levels(Montana$arsenic2) <- c("<1 year", rep("1+ years", 3))

model6 <- glm(respdeath ~ agegr + arsenic2,offset=log(personyrs), family=poisson, data=Montana)
summary(model6)
## 
## Call:
## glm(formula = respdeath ~ agegr + arsenic2, family = poisson, 
##     data = Montana, offset = log(personyrs))
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -8.0086     0.2233 -35.859  < 2e-16 ***
## agegr50-59         1.4702     0.2453   5.994 2.04e-09 ***
## agegr60-69         2.3661     0.2372   9.976  < 2e-16 ***
## agegr70-79         2.6238     0.2548  10.297  < 2e-16 ***
## arsenic21+ years   0.8109     0.1210   6.699 2.09e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 125.02  on 109  degrees of freedom
## AIC: 353.8
## 
## Number of Fisher Scoring iterations: 5

At this stage, we would accept ‘model6’ as the final model, since it has the smallest AIC among all the models that we have tried. We conclude that exposure to arsenic for at least one year is associated with an increased risk for the disease by exp(0.8109) or 2.25 times with statistical significance, independently from age.

Interpretation of the regression coefficients in the Poisson model : the incidence rate ratio (IRR)

In this example, all subjects pool their follow-up times and this number is called ‘person time’, which is then used as the denominator for the event, resulting in ‘incidence rate’. Comparing the incidence rate among two groups of subjects by their exposure status is fairer than comparing the crude risks (considering population at baseline). The ratio between the incidence rates of two groups is called the incidence rate ratio (IRR), which is an improved form of the relative risk.

In ‘model6’, for example, we want to compute the incidence rate ratio between the subjects exposed to arsenic for one or more years against those exposed for less than one year.

The shorter way to obtain this IRR is to exponentiate the coefficient of the specific variable ‘arsenic’, which is the fifth coefficient in the model.

coef(model6)
##      (Intercept)       agegr50-59       agegr60-69       agegr70-79 
##       -8.0086495        1.4701505        2.3661111        2.6237532 
## arsenic21+ years 
##        0.8108698
exp(coef(model6)[5])
## arsenic21+ years 
##         2.249864

The following steps explain how the 95% confidence interval for all variables can be obtained.

coeff <- coef(model6)
coeff.95ci <- cbind(coeff, confint(model6))
coeff.95ci
##                       coeff      2.5 %    97.5 %
## (Intercept)      -8.0086495 -8.4780952 -7.598323
## agegr50-59        1.4701505  1.0094838  1.976026
## agegr60-69        2.3661111  1.9239329  2.858522
## agegr70-79        2.6237532  2.1411250  3.145495
## arsenic21+ years  0.8108698  0.5724818  1.047494

Note that confint(glm6) provides a 95% confidence interval for the model coefficients.

IRR.95ci <- round(exp(coeff.95ci), 1)[-1,]

The required values are obtained from exponentiating the last matrix with the first row or intercept removed. The display is rounded to 1 decimal place for better viewing. Then the matrix column is labelled and the 95% CI is displayed.

colnames(IRR.95ci) <- c("IRR", "lower95ci", "upper95ci")
IRR.95ci
##                   IRR lower95ci upper95ci
## agegr50-59        4.3       2.7       7.2
## agegr60-69       10.7       6.8      17.4
## agegr70-79       13.8       8.5      23.2
## arsenic21+ years  2.2       1.8       2.9

A simpler way is to use the command idr.display in epiDisplay (but they use the term IDR: Incidence Density Rate).

idr.display(model6, decimal=1)
## 
## Poisson regression predicting respdeath with offset = log(personyrs) 
##  
##                               crude IDR(95%CI) adj. IDR(95%CI)  P(Wald's test)
## agegr: ref.=40-49                                                             
##    50-59                      4.5 (2.8,7.3)    4.3 (2.7,7)      < 0.001       
##    60-69                      11.3 (7.1,17.9)  10.7 (6.7,17)    < 0.001       
##    70-79                      14.5 (8.8,23.8)  13.8 (8.4,22.7)  < 0.001       
##                                                                               
## arsenic2: 1+ years vs <1 year 2.5 (2,3.1)      2.2 (1.8,2.9)    < 0.001       
##                                                                               
##                               P(LR-test)
## agegr: ref.=40-49             < 0.001   
##    50-59                                
##    60-69                                
##    70-79                                
##                                         
## arsenic2: 1+ years vs <1 year < 0.001   
##                                         
## Log-likelihood = -171.9
## No. of observations = 114
## AIC value = 353.8

The command idr.display gives results to 3 decimal places by default. This can easily be changed by the user.

Optional material: negative binomial regression

Recall that for Poisson regression one of the assumptions for a valid model is that the mean and variance of the count variable are equal. The negative binomial distribution is a more generalized form of distribution used for ‘count’ response data, allowing for greater dispersion or variance of counts. In practice, it is quite common for the variance of the outcome to be larger than the mean. This is called overdispersion. If a count variable is overdispersed, Poisson regression underestimates the standard errors of the predictor variables. When overdispersion is evident, one solution is to specify that the errors have a negative binomial distribution. Negative binomial regression gives the same coefficients as those from Poisson regression but give larger standard errors. The interpretation of the results is the same as that from Poisson regression. Take an example of counts of water containers infested with mosquito larvae in a field survey. The data is contained in the dataset DHF99.

library(MASS)
data(DHF99)
summary(DHF99)
##     houseid          village             education     containers     
##  Min.   :  1.00   Min.   :  1.00   Primary    :168   Min.   : 0.0000  
##  1st Qu.: 77.75   1st Qu.: 23.00   Secondary  : 36   1st Qu.: 0.0000  
##  Median :154.50   Median : 51.00   High school: 34   Median : 0.0000  
##  Mean   :174.27   Mean   : 48.56   Bachelor   : 25   Mean   : 0.3512  
##  3rd Qu.:269.25   3rd Qu.: 75.00   Other      : 37   3rd Qu.: 0.0000  
##  Max.   :385.00   Max.   :105.00                     Max.   :11.0000  
##                                                      NA's   :1        
##   viltype   
##  rural:180  
##  urban: 72  
##  slum : 48  
##             
##             
##             
## 
describeBy(DHF99$containers, group=DHF99$viltype)
## 
##  Descriptive statistics by group 
## group: rural
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 179 0.49 1.25      0     0.2   0   0  11    11 4.58     29.1 0.09
## ------------------------------------------------------------ 
## group: urban
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 72 0.07 0.26      0       0   0   0   1     1 3.32     9.13 0.03
## ------------------------------------------------------------ 
## group: slum
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 48 0.25 0.53      0    0.15   0   0   2     2 1.93     2.84 0.08

The function for performing a negative binomial glm is glm.nb. This function is located in the MASS library. In addition, a very helpful function for selecting the best model based on the AIC value is the step function, which is located in the stats library (a default library loaded on start-up).

model.poisson <- step(glm(containers ~ education + viltype, family=poisson, data=DHF99))
## Start:  AIC=513.3
## containers ~ education + viltype
## 
##             Df Deviance    AIC
## - education  4   363.67 505.92
## <none>           363.05 513.30
## - viltype    2   397.54 543.79
## 
## Step:  AIC=505.92
## containers ~ viltype
## 
##           Df Deviance    AIC
## <none>         363.67 505.92
## - viltype  2   398.52 536.76
model.nb      <- step(glm.nb(containers ~ education + viltype, data=DHF99))
## Start:  AIC=432.05
## containers ~ education + viltype
## 
##             Df Deviance    AIC
## - education  4   156.64 424.23
## <none>           156.46 432.05
## - viltype    2   174.78 446.37
## 
## Step:  AIC=424.23
## containers ~ viltype
## 
##           Df Deviance    AIC
## <none>         156.37 424.23
## - viltype  2   174.95 438.80
coef(model.poisson)
##  (Intercept) viltypeurban  viltypeslum 
##   -0.7100490   -1.9571792   -0.6762454
coef(model.nb)
##  (Intercept) viltypeurban  viltypeslum 
##   -0.7100490   -1.9571792   -0.6762454

Both models end up with only ‘viltype’ being selected. The coefficients are very similar. The Poisson model has significant overdispersion but not the negative binomial model.

poisgof(model.poisson)$p.value
## [1] 0.004387831
poisgof(model.nb)$p.value
## [1] 1

The AIC of the negative binomial model is also better (smaller) than that of the Poisson model.

model.poisson$aic
## [1] 505.9154
model.nb$aic
## [1] 426.2283

Finally, the main differences to be examined are their standard errors, the 95% confidence intervals and P values.

summary(model.poisson)$coefficients
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  -0.7100490  0.1066000 -6.660873 2.722059e-11
## viltypeurban -1.9571792  0.4597429 -4.257117 2.070800e-05
## viltypeslum  -0.6762454  0.3077286 -2.197538 2.798202e-02
summary(model.nb)$coefficients
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  -0.7100490  0.1731160 -4.101578 4.103414e-05
## viltypeurban -1.9571792  0.5255707 -3.723912 1.961591e-04
## viltypeslum  -0.6762454  0.4274174 -1.582166 1.136116e-01
idr.display(model.poisson)
## 
## Poisson regression predicting # infested vessels 
##  
##                          IDR(95%CI)        P(Wald's test) P(LR-test)
## Village type: ref.=rural                                  < 0.001   
##    urban                 0.14 (0.06,0.35)  < 0.001                  
##    slum                  0.51 (0.28,0.93)  0.028                    
##                                                                     
## Log-likelihood = -249.9577
## No. of observations = 299
## AIC value = 505.9154
attach(DHF99)
idr.display(model.nb)
## 
## Negative binomial regression predicting # infested vessels 
##  
##                          IDR(95%CI)        P(Wald's test) P(LR-test)
## Village type: ref.=rural                                  < 0.001   
##    urban                 0.14 (0.05,0.4)   < 0.001                  
##    slum                  0.51 (0.22,1.18)  0.114                    
##                                                                     
## Log-likelihood = -209.1141
## No. of observations = 299
## AIC value = 426.2283

The standard errors from the negative binomial model are slightly larger than those from the Poisson model resulting in wider 95% confidence intervals and larger P values. From the Poisson regression, both urban community and slum area had a significantly lower risk (around 14% and a half reduction, respectively) for infestation. However, from the negative binomial regression, only the urban community had a significantly lower risk.