Generalized Linear Models

Many times, linearity between a dependent variable and a set of predictors is not a suited assumption. Normal linear models rely on a normality assumption, and this may be a limitation when the nature of the dependent variable is discrete and its range spans from 0 to \(+\infty\). In normal linear models, the main assumption is:

\[ E[y_i]=\eta_i=\beta_0+\sum_{j=1}^{p}\beta_jx_{ij}.\]

Extending the linear framework allows for the following transformation involving a twice differentiable link function \(g\) such that:

\[g(E[y_i])=\eta_i=\beta_0+\sum_{j=1}^{p}\beta_jx_{ij}.\]

Logistic regression

The following example is in the book: Data analysis using regression and multilevel/ hierarchical models, Gelman, A., & Hill, J. (2007), Cambridge University Press.

It is known that water in some wells in Bangladesh and other South Asian countries might be contaminated with natural arsenic. The effect of the arsenic ingestion on health is a cumulative poisoning, and the risk of cancer and other diseases is estimated to be proportional to the dose intake.

A research team from the United States and Bangladesh examined the water of several wells in Araihazar district of Bangladeshin and they measured the arsenic level classified the wells as safe if the arsenic level was below the 0.5 in units of hundreds of micrograms per liter, or unsafe if it was above 0.5. All the people using unsafe wells have been encouraged to switch to a nearby well. Indeed the contamination does not depend on the proximity and close wells can have very different levels of arsenic.

The dataset in the file Wells.csv contains information on whether or not 3020 households in Bangladesh changed the wells that they were using after few years from the arsenic investigation.

switch: whether or not the household switched to another well from an unsafe well: no (0) or yes (1);
arsenic: the level of arsenic contamination in the household’s original well, in hundreds of micrograms per liter; all are above 0.5, which was the level identified as “safe”;
distance: in meters to the closest known safe well;
education: in years of the head of the household;
association: whether or not any members of the household participated in any community organizations: no (0) or yes (1).

Using a logistic regression, we analyse which are the main factors of well switching among the users of unsafe wells.

# Load the data (dataset Wells.csv)
# See the structure 
# See a summary of all variables in the dataset
# See the distribution of distance, arsenic and education by switch

setwd("C:/Users/Gioia/Desktop/lab 5")
data <- read.csv2("Wells.csv", header=TRUE)
str(data)

## 'data.frame':    3020 obs. of  5 variables:
##  $ switch : int  1 1 0 1 1 1 1 1 1 1 ...
##  $ arsenic: num  2.36 0.71 2.07 1.15 1.1 3.9 2.97 3.24 3.28 2.52 ...
##  $ dist   : num  16.8 47.3 21 21.5 40.9 ...
##  $ assoc  : int  0 0 0 0 1 1 1 0 1 1 ...
##  $ educ   : int  0 0 10 12 14 9 4 10 0 0 ...

summary(data)

##      switch          arsenic           dist             assoc             educ       
##  Min.   :0.0000   Min.   :0.510   Min.   :  0.387   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:0.820   1st Qu.: 21.117   1st Qu.:0.0000   1st Qu.: 0.000  
##  Median :1.0000   Median :1.300   Median : 36.761   Median :0.0000   Median : 5.000  
##  Mean   :0.5752   Mean   :1.657   Mean   : 48.332   Mean   :0.4228   Mean   : 4.828  
##  3rd Qu.:1.0000   3rd Qu.:2.200   3rd Qu.: 64.041   3rd Qu.:1.0000   3rd Qu.: 8.000  
##  Max.   :1.0000   Max.   :9.650   Max.   :339.531   Max.   :1.0000   Max.   :17.000

par(mfrow=c(1,2))
boxplot(dist ~ switch, data = data, horizontal = TRUE, 
        xlab = "Distance (in meters)", ylab = "Switch")
boxplot(arsenic ~ switch, data = data, horizontal = TRUE, 
        xlab = "Arsenic (in hundreds of mg per liter)", ylab = "Switch")

par(mfrow=c(1,1))
barplot(prop.table(table(data$switch, data$educ), margin = 1),
        beside=TRUE, legend.text = c("no switch", "switch"), xlab="Education (years)")

One predictor

We start the analysis modelling the effect of the variable distance on the variable switch. Since the outcome variable is a binary variable, the model chosen is the logistic regression model. \[\begin{align*} \rm{Pr}(y_i=1)&=p_i\\ \text{logit}(p_i)&=\mathbf{x}^\top_i\beta, \end{align*}\] where \(\text{logit}(x)=\log(x/(1-x))\) is a function that maps the range \((0,1)\) to the range \((-\infty,\infty)\).

In R:

# Fit the logistic regression model with one predictor (distance)
fit1 <- glm(switch ~ dist , data = data,
            family = binomial(link = "logit"))
summary(fit1)

## 
## Call:
## glm(formula = switch ~ dist, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4406  -1.3058   0.9669   1.0308   1.6603  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.6059594  0.0603102  10.047  < 2e-16 ***
## dist        -0.0062188  0.0009743  -6.383 1.74e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 4076.2  on 3018  degrees of freedom
## AIC: 4080.2
## 
## Number of Fisher Scoring iterations: 4

Exercise Try to obtain the R output by hand

The intercept can only be interpreted assuming zero values for the other predictors. When zero is not interesting or not even in the model, the intercept must be evaluated at some other point. Here, when dist\(=0\):

# Probability of switching well for dist = 0
# response function:
invlogit <- function(eta) 1/(1+exp(-eta))
invlogit(c(1,0)%*%coef(fit1))

##           [,1]
## [1,] 0.6470185

# Alternatively 
predict(fit1, newdata=data.frame(dist=0), type="response")

##         1 
## 0.6470185

Thus the probability of switching well for a family who lives close to a safe well is about \(65\%\).

There are two ways of interpreting the coefficients for the predictor dist:

the coefficient for dist can be interpreted as a negative difference of \(−0.0062\) in the logit probability of switching well when the distance is increased by one;

Or we can evaluate the difference of the logistic regression function to adding 1 to the predictor dist. This difference corresponds to a difference on the probability scale but requires the choice of two specific values of the predictor. The mean of the predictor is a good starting point since it corresponds to the steepest point of the inverse logit function.

Thus,

adding 1 todist —that is, adding 1 meters to the distance to the nearest safe well— corresponds to a negative difference in the probability of switching of about \(0.15\%\) on average.

Notice that coefficients here does not have a linear effect on the probability that the outcome is 1 because of the nonlinearity of the model. The curve of the logistic function requires us to choose where to evaluate changes, if we want to interpret on the probability scale.

# Difference on the probability of switch well due to an increase 
# of one unit on the predictor from the mean
invlogit(c(1,mean(data$dist)+1)%*%coef(fit1))-
invlogit(c(1,mean(data$dist))%*%coef(fit1))

##              [,1]
## [1,] -0.001519722

# Alternatively 
predict(fit1, newdata=data.frame(dist=mean(data$dist)+1), type="response")-
predict(fit1, newdata=data.frame(dist=mean(data$dist)), type="response")

##            1 
## -0.001519722

# Difference on the probability of switch well due to an increase 
# of one unit on the predictor from 99th percentile
invlogit(c(1,quantile(data$dist,0.99)+1)%*%coef(fit1))-
invlogit(c(1,quantile(data$dist,0.99))%*%coef(fit1))

##              [,1]
## [1,] -0.001465513

The effect of one unit increase of the variable dist on the inverse logit seems low, but this is misleading since distance is measured in meters, so this coefficient corresponds to the difference between, say, a house that is 90 meters away from the nearest safe well and a house that is 91 meters away. In order to improve interpretability of the results we rescale distance in 100-meter units.

# Consider dist/100 for interpretability purposes and fit the model 
data$dist100 <- data$dist/100 
fit2 <- glm(switch ~ dist100 , data = data,
            family = binomial(link = "logit"))
summary(fit2)

## 
## Call:
## glm(formula = switch ~ dist100, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4406  -1.3058   0.9669   1.0308   1.6603  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.60596    0.06031  10.047  < 2e-16 ***
## dist100     -0.62188    0.09743  -6.383 1.74e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 4076.2  on 3018  degrees of freedom
## AIC: 4080.2
## 
## Number of Fisher Scoring iterations: 4

# Difference on probability scale due to an increase of 1 unit
# on the predictor  dist100 from the mean
invlogit(c(1,mean(data$dist100)+1)%*%coef(fit2))-
invlogit(c(1,mean(data$dist100))%*%coef(fit2))

##            [,1]
## [1,] -0.1542287

# It is the same difference due to an increase of 100 unit on the predictor 
# dist from the mean (i.e. considering fit1)
invlogit(c(1,mean(data$dist)+100)%*%coef(fit1))-
invlogit(c(1,mean(data$dist))%*%coef(fit1))

##            [,1]
## [1,] -0.1542287

The following plot shows the fitted model with one predictor

#########################
# To do during the lab ~#
#########################
# Plot the fitted model 
switch.jitter <- jitter(data$switch, factor=0.10)

par(mfrow=c(1,1))
with(data, plot(dist, switch.jitter, 
                xlab="Distance (in meters) to nearest safe well",
                ylab = "Pr(Swiching)", type = "n"))
curve (invlogit(coef(fit1)[1]+coef(fit1)[2]*x), lwd=1, add=TRUE)
points(data$dist, switch.jitter, pch=20, cex=.1)

The probability of switching well is about \(60\%\) for those who live close to a safe well, declining to about \(20\%\) when the distance from a safe well increases up to \(300\) meters from their home. This seems reasonable: the probability of switching is higher for people who live closer to a safe well (Gelman-Hill, 2007).

Adding predictors

We start adding the arsenic level. We expect that the higher the arsenic concentration the higher the probability of switching well, that is, translated in model terms, a positive sign for the arsenic coefficient.

# Adding predictors 
# See the histogram of arsenic levels
par(mfrow=c(1,1))
hist(data$arsenic, freq = TRUE, xlab = "Arsenic concentration in well water", 
     breaks = seq(0, 0.25+max(data$arsenic), 0.25), main = "")

length(which(data$arsenic >= 0.5 & data$arsenic <= 0.75))

## [1] 632

# Fit the logistic regression model with two predictors (dist100 and arsenic)
fit3 <- glm(switch ~ dist100 + arsenic, data=data, 
            family = binomial("logit"))
summary(fit3)

## 
## Call:
## glm(formula = switch ~ dist100 + arsenic, family = binomial("logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6351  -1.2139   0.7786   1.0702   1.7085  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.002749   0.079448   0.035    0.972    
## dist100     -0.896644   0.104347  -8.593   <2e-16 ***
## arsenic      0.460775   0.041385  11.134   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3930.7  on 3017  degrees of freedom
## AIC: 3936.7
## 
## Number of Fisher Scoring iterations: 4

Numerical interpretation of one predictor effect here needs to be done choosing a value for the other predictor. As an example: given two wells with the same arsenic level (notice that it cannot be 0!), the logit probability of switching decreases of \(-.9\) every 100 meters in distance to the nearest safe well.

Equivalently: for two equally distant nearest safe wells, a difference of 1 in arsenic concentration corresponds to a 0.46 positive difference in the logit probability of switching.

The following plots show the fitted model with two predictors

# Plot the fitted model
# Pr(Switching) vs distance (consider Arsenic = 0.5 and Arsenic = 1) 

with(data, plot(dist100, switch.jitter, 
                xlab = "Distance (in hundreds meters) to nearest safe well",
                ylab = "Pr(Switching)", type = "n"))
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*x+coef(fit3)[3]),add=TRUE, col="blue")
# Alternatively
#curve(invlogit(coef(fit3)%*%c(1,x,1)),add=TRUE, col="blue") 
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*x+coef(fit3)[3]*0.5),add=TRUE, col="red")
points(data$dist, switch.jitter, pch=20, cex=.1)
text (0.50, .27, "if As = 0.5", adj=0, cex=.8, col = "red")
text (0.75, .50, "if As = 1.0", adj=0, cex=.8, col = "blue")

#Alternatively (on the meter-based distance scale)
with(data, plot(dist, switch.jitter, 
                xlab = "Distance (in meters) to nearest safe well",
                ylab = "Pr(Switching)", type = "n"))
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*x/100+coef(fit3)[3]),add=TRUE, col="blue")
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*x/100+coef(fit3)[3]*0.5),add=TRUE, col="red")
points(data$dist, switch.jitter, pch=20, cex=.1)
text (50, .27, "if As = 0.5", adj=0, cex=.8, col = "red")
text (75, .50, "if As = 1.0", adj=0, cex=.8, col = "blue")

# Pr(Switching) vs Arsenic (Consider distance = 0 and distance = 50)
with(data, plot(arsenic, switch.jitter, 
                xlab = "Arsenic concentration (in hundreds mg per liter) in well water",
                ylab = "Pr(Switching)", type = "n"))
curve(invlogit(coef(fit3)[1]+coef(fit3)[3]*x),add=TRUE, col="red")
text(1.5,0.8,"if dist=0", col="red")
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*0.5+coef(fit3)[3]*x),add=TRUE, col="blue")
text(4,0.6,"if dist=50", col="blue")
points(data$arsenic, switch.jitter, cex=0.01, pch=20)

In general terms, the fitted model can be summarized as: switching is easier if there is a nearby safe well, and if a household’s existing well has a high arsenic level, there should be more motivation to switch.

Extra try to interpret the change occurred in the estimated coefficient for the predictor dist100 in model fit.1 (\(\beta^1_{\texttt{dist}}=-0.62\)) versus model fit.2 (\(\beta^2_{\texttt{dist}}=-0.9\)).

We now want to verify if there is a statistically significant effect of the interaction between the distance and the arsenic concentration on the probability of switching well.

# Logistic regression with interactions
fit4 <- glm(switch ~ dist100 + arsenic + dist100:arsenic, 
            data = data, family = binomial(link = "logit"))
# Equivalently (by using *)
fit4 <- glm(switch ~ dist100*arsenic, 
            data = data, family = binomial(link = "logit"))
summary(fit4)

## 
## Call:
## glm(formula = switch ~ dist100 * arsenic, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7823  -1.2004   0.7696   1.0816   1.8476  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.14787    0.11754  -1.258  0.20838    
## dist100         -0.57722    0.20918  -2.759  0.00579 ** 
## arsenic          0.55598    0.06932   8.021 1.05e-15 ***
## dist100:arsenic -0.17891    0.10233  -1.748  0.08040 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3927.6  on 3016  degrees of freedom
## AIC: 3935.6
## 
## Number of Fisher Scoring iterations: 4

As said before, the interpretation of the coefficients when arsenic is equal to 0 does not make sense, since the minimum level reached is 0.5. We fix the predictors to their mean and so

the probability of switching well for an average distance and for an average arsenic concentration is about 0.59.

# Probability of switching for an average distance and 
# an average concentration of arsenic
invlogit(c(1,mean(data$dist100), mean(data$arsenic), 
           mean(data$dist100)*mean(data$arsenic))%*%coef(fit4))

##           [,1]
## [1,] 0.5868829

#Equivalently 
predict(fit4, newdata=data.frame(dist100 = mean(data$dist100), 
                                 arsenic = mean(data$arsenic)),
        type = "response")

##         1 
## 0.5868829

the probability of switching well needs to consider also the interaction term, thus: at the mean level of arsenic in the data, each 100 meters of distance corresponds to an approximate 22% negative difference in probability of switching.

# Difference in probability due to an increase of 100 meters 
# in the distance (from the mean), helding arsenic at its mean
invlogit(c(1,mean(data$dist100)+1, mean(data$arsenic), 
           (mean(data$dist100)+1)*mean(data$arsenic))%*%coef(fit4))-
invlogit(c(1,mean(data$dist100), mean(data$arsenic), 
           mean(data$dist100)*mean(data$arsenic))%*%coef(fit4))

##            [,1]
## [1,] -0.2146287

while at the mean level of distance in the data, each additional unit of arsenic corresponds to an approximate 11% positive difference in probability of switching.

# Difference in probability due to an increase of 1 unit 
# in the arsenic (from the mean), helding distance at its mean
invlogit(c(1,mean(data$dist100),mean(data$arsenic)+1,
           mean(data$dist100)*(mean(data$arsenic)+1))%*% coef(fit4))-
invlogit(c(1,mean(data$dist100),mean(data$arsenic),
           mean(data$dist100)*mean(data$arsenic))%*%coef(fit4))

##           [,1]
## [1,] 0.1074813

the interaction term can be seen as the effect added to the coefficient for distance for each additional unit of arsenic: at the average level of arsenic, the coefficient for distance is −0.88. Vice versa, at the average level of distance, the coefficient for arsenic is 0.47. So, the importance of distance as a predictor increases for households with higher existing arsenic levels and the importance of arsenic decreases for households that are farther from existing safe wells.

# Difference in logit probability scale when dist increases by 100 meters,
# considering arsenic at its mean level
coef(fit4)[2] + coef(fit4)[4]*mean(data$arsenic)

##    dist100 
## -0.8736527

# Difference in logit probability scale when arsenic increases by 1 unit,
# considering dist at its mean level
coef(fit4)[3] + coef(fit4)[4]*mean(data$dist100)

##   arsenic 
## 0.4695081

Interpreting the coefficients is easier when variables are standardized. Coefficients of predictors are now interpretable as the effect of the predictor on the logit probability when the others are at their mean.

# Centering the input variables

data$c.dist100 <- scale(data$dist100, scale = FALSE)
data$c.arsenic <- scale(data$arsenic, scale = FALSE)

# Refitting the model with centered inputs
fit5 <- glm(switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic,
            family = binomial(link = "logit"), data = data)
summary(fit5)

## 
## Call:
## glm(formula = switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic, 
##     family = binomial(link = "logit"), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7823  -1.2004   0.7696   1.0816   1.8476  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.35109    0.03985   8.810   <2e-16 ***
## c.dist100           -0.87365    0.10480  -8.337   <2e-16 ***
## c.arsenic            0.46951    0.04207  11.159   <2e-16 ***
## c.dist100:c.arsenic -0.17891    0.10233  -1.748   0.0804 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3927.6  on 3016  degrees of freedom
## AIC: 3935.6
## 
## Number of Fisher Scoring iterations: 4

The same graphs as before can be used to plot the models with the interaction term. This graph shows evidence that the differences in switching associated with differences in arsenic level are large if you are close to a safe well, but with a diminishing effect if you are far from any safe well. Comparing this plot with the one related to the model without interaction highlights that the main difference among the two curves is found in the area with few data points.

# Pr(Switching) vs Distance (Consider Arsenic=0.5 and Arsenic = 1)
with(data, plot(data$dist, switch.jitter, type = "n",
                xlab = "Distance(in meters) to nearest safe well",
                ylab = "Pr(Switching)"))
curve(invlogit(cbind(1, x/100, 1, x/100)%*%coef(fit5)), add=TRUE, col="red")
curve(invlogit(cbind(1, x/100, 0.5, 0.5*x/100)%*%coef(fit5)), add=TRUE, col="blue")
text(100,0.6,"if As = 1", col="red")
text(35,0.4,"if As = 0.5", col="blue")
points(data$dist, switch.jitter, pch=20, cex=0.01)

# Pr(Switching) vs Arsenic (Consider Distance = 0 and Distance = 50)
with(data, plot(data$arsenic, switch.jitter, type = "n",
                xlab = "Arsenic concentration (in hundreds of mg per liter) in well water",
                ylab = "Pr(Switching)"))
curve(invlogit(cbind(1, 0, x, 0)%*%coef(fit5)), add=TRUE, col="red")
curve(invlogit(cbind(1, 0.5, x, 0.5*x)%*%coef(fit5)), add=TRUE, col="blue")
text(2,0.9,"if Dist = 0", col="red")
text(3,0.6,"if Dist = 50", col="blue")
points(data$arsenic, switch.jitter, pch=20, cex=0.01)

Adding social predictors

In order to improve interpretability of the results, we divide the education variable by 4 and we centre the predictor.

Extra try to interpret the coefficients of the new model.

# Adding social predictor to the model 
# divide education by 4 and centering it
data$c.educ4 <- scale(data$educ/4, scale = FALSE)
fit6 <- glm(switch ~ c.dist100*c.arsenic + c.educ4, data = data,
            family = binomial(link = "logit"))
summary(fit6)

## 
## Call:
## glm(formula = switch ~ c.dist100 * c.arsenic + c.educ4, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7149  -1.1886   0.7478   1.0689   1.7223  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.35271    0.04001   8.816  < 2e-16 ***
## c.dist100           -0.87462    0.10510  -8.322  < 2e-16 ***
## c.arsenic            0.47663    0.04228  11.273  < 2e-16 ***
## c.educ4              0.16922    0.03833   4.415 1.01e-05 ***
## c.dist100:c.arsenic -0.16291    0.10235  -1.592    0.111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3907.9  on 3015  degrees of freedom
## AIC: 3917.9
## 
## Number of Fisher Scoring iterations: 4

Moreover, we add the interaction between the new variable and the previous predictors dist and arsenic. The interpretation of the new interaction terms represents the effect of education on the effects of the other predictors. Positive changes in education reduce distance’s negative association and increasing education increases arsenic’s positive association.

# Consider also the interaction between education and distance and
# between education and arsenic

fit7 <- glm(switch ~ c.dist100*c.arsenic + c.dist100*c.educ4 + 
                     c.arsenic * c.educ4, data = data,
            family = binomial(link = "logit"))
summary(fit7)

## 
## Call:
## glm(formula = switch ~ c.dist100 * c.arsenic + c.dist100 * c.educ4 + 
##     c.arsenic * c.educ4, family = binomial(link = "logit"), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5706  -1.1964   0.7314   1.0724   1.8712  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.35630    0.04028   8.844  < 2e-16 ***
## c.dist100           -0.90286    0.10731  -8.414  < 2e-16 ***
## c.arsenic            0.49498    0.04305  11.497  < 2e-16 ***
## c.educ4              0.18498    0.03919   4.720 2.36e-06 ***
## c.dist100:c.arsenic -0.11768    0.10353  -1.137  0.25569    
## c.dist100:c.educ4    0.32269    0.10662   3.026  0.00247 ** 
## c.arsenic:c.educ4    0.07223    0.04387   1.647  0.09965 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3891.7  on 3013  degrees of freedom
## AIC: 3905.7
## 
## Number of Fisher Scoring iterations: 4

# Equivalently
 
fit7 <- glm(switch ~ c.dist100 + c.arsenic + c.educ4 +
                     c.dist100:c.arsenic + c.arsenic:c.educ4 + 
                     c.dist100:c.educ4, data = data,
            family = binomial(link = "logit"))
summary(fit7)

## 
## Call:
## glm(formula = switch ~ c.dist100 + c.arsenic + c.educ4 + c.dist100:c.arsenic + 
##     c.arsenic:c.educ4 + c.dist100:c.educ4, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5706  -1.1964   0.7314   1.0724   1.8712  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.35630    0.04028   8.844  < 2e-16 ***
## c.dist100           -0.90286    0.10731  -8.414  < 2e-16 ***
## c.arsenic            0.49498    0.04305  11.497  < 2e-16 ***
## c.educ4              0.18498    0.03919   4.720 2.36e-06 ***
## c.dist100:c.arsenic -0.11768    0.10353  -1.137  0.25569    
## c.arsenic:c.educ4    0.07223    0.04387   1.647  0.09965 .  
## c.dist100:c.educ4    0.32269    0.10662   3.026  0.00247 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3891.7  on 3013  degrees of freedom
## AIC: 3905.7
## 
## Number of Fisher Scoring iterations: 4

# But not c.dist100*c.arsenic*c.educ4. It is a different model.

Binned Residuals

Outcome variable in the logistic regression is discrete, so the residuals of the model. The plot of the residuals defined as \[residual_i = y_i - E(y_i|X_i)=y_i-\text{logit}^{-1}(X_i\beta)\] versus the fitted values is not useful.

# Residual plot: wrong
pred7 <- predict(fit7, type="response")
# or 
# pred7 <- fit7$fitted.values

plot(c(0,1),c(-1,1), xlab = "Estimated Pr(Switching)", type = "n",
     ylab = "Observed - Estimated", main = "Residual Plot") 
abline(h = 0, col = "gray", lwd = 0.5)
points(pred7, data$switch - pred7, pch = 20, cex = 0.2)

A more readable plot can be made using the binned residuals defined as:

dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin (Gelman, Hill 2007).

The number of bins have to be chosen such that each bin is computed on enough points such that the resulting plot is not too noisy (high number of bins) but can highlight patterns in the residuals (hided if the number of bins is too low).

Grey lines depict \(\pm 2\) standard error bounds, so we expect that 95% of the binned residuals falls between these two lines. The standard error is defined as \(\sqrt{p(1-p)/n}\), where \(n\) is the number of data used to compute each average residual.

## Binned Residual Plot
# x represents the fitted values
# y represents the observed - fitted values
# nclass = number of categories (bins)

binned.resids <- function(x, y, nclass = sqrt(length(x))){
 breaks.index <- floor(length(x)*(1:(nclass))/nclass)
 breaks <- c(-Inf, sort(x)[breaks.index], Inf)
 output <- NULL
 xbreaks <- NULL
 x.binned <- as.numeric(cut(x, breaks))
 for(i in 1:nclass){
  items <- (1:length(x))[x.binned == i]
  x.range <- range(x[items]) 
  xbar <- mean(x[items])
  ybar <- mean(y[items])
  n <- length(items)
  sdev <- sd(y[items])
  output <- rbind(output, c(xbar, ybar, n, x.range, 2*sdev/sqrt(n)))
 }
 colnames(output) <- c("xbar", "ybar", "n", "x.lo", "x.hi", "2se")
 return(list(binned = output, xbreaks = xbreaks)) 
}

# Little changes w.r.t  to the binned.resids function of the arm package
# library("arm")
# arm::binned.resids

# Binned residuals vs Estimated probability of Switching

br7 <- binned.resids(pred7, data$switch - pred7, nclass = 50)$binned
plot(range(br7[,1]), range(br7[,2],br7[,6], -br7[,6]), 
     xlab = "Estimated Pr(Switching)", ylab = "Average residual",
     type = "n", main = "Binned Residual Plot")
abline(h = 0, col = "gray", lwd = 0.5)
lines(br7[,1], br7[,6], col = "gray", lwd = 0.5)
lines(br7[,1], -br7[,6], col = "gray", lwd = 0.5)
points(br7[,1], br7[,2], pch = 19, cex = 0.5)

A pattern can be identified looking at the plot of the binned residuals with respect to the arsenic level. Indeed, there are large negative binned residuals for some households with wells with small values of arsenic concentration. These points correspond to people that are less likely to switch well with respect to what predicted by the model (almost \(-20\%\)). Moreover, the distribution of the residuals shows a slight patter: the model seems to underestimate the probabilities of switching for medium arsenic level values while overestimates for high arsenic concentrations.

# Plot of the binned residuals vs input of interest

# distance
br.dist <- binned.resids(data$dist, data$switch - pred7,
                         nclass = 40)$binned
plot(range(br.dist[,1]), range(br.dist[,2], br.dist[,6], -br.dist[,6]),
     xlab = "Distance to the nearest safe well",
     ylab = "Average residual", type = "n", 
     main = "Binned residual plot")
abline(h = 0, col = "gray", lwd = 0.5)
lines(br.dist[,1], br.dist[,6], col = "gray", lwd = 0.5)
lines(br.dist[,1], -br.dist[,6], col = "gray", lwd = 0.5)
points(br.dist[,1], br.dist[,2], pch = 19, cex = 0.5)

# Arsenic
br.arsenic <- binned.resids(data$arsenic, data$switch - pred7,
                         nclass = 40)$binned
plot(range(br.arsenic[,1]), range(br.arsenic[,2], br.arsenic[,6], -br.arsenic[,6]),
     xlab = "Arsenic concentration",
     ylab = "Average residual", type = "n", 
     main = "Binned residual plot")
abline(h = 0, col = "gray", lwd = 0.5)
lines(br.arsenic[,1], br.arsenic[,6], col = "gray", lwd = 0.5)
lines(br.arsenic[,1], -br.arsenic[,6], col = "gray", lwd = 0.5)
points(br.arsenic[,1], br.arsenic[,2], pch = 19, cex = 0.5)

Extra propose a possible solution to improve the model and fit your model. Compare the new binned residual plots with the previous ones.

Error rate

Error rate is defined as the proportion of cases for which the fitted probabilities are above 0.5 and the outcome value is 0 or the fitted probabilities are below 0.5 and the outcome value is 1, that is when the prediction-guessing is wrong: \[er = mean((y=0 \, \& \, \rm{E}(y|X)>.5)\, | \, (y=1 \, \& \, \rm{E}(y|X)<.5))\] It should be a value between 0 (perfect prediction-guessing) and 0.5 (random prediction-guessing).

Usually we expect the error rate to be lower than the error rate of the null model, that is the only-intercept model. The estimated probability for the null model will be \(p=\sum_i y_i/n\), that is the proportion of 1’s in the data. The error rate for the null model is \(min(p, 1-p)\), and it corresponds to assign 1 (or 0) to all the fitted values.

# Error rate

# Error rate of the Null Model 
fit0 <- glm(switch ~ 1, family = binomial(link = "logit"), data = data)  
coef(fit0)

## (Intercept) 
##   0.3029584

invlogit(coef(fit0))

## (Intercept) 
##   0.5751656

m0 <- mean(data$switch)

err.rate0 <- min(m0, 1 - m0)

# Alternatively 
err.rate0 <- mean((rep(1, length(data$switch)) > 0.5 & data$switch ==0) |
                  ((rep(1, length(data$switch)) < 0.5 & data$switch ==1 )))

So if our model improved the prediction-guessing of a simple logistic regression, we expect the error rate of our model to be lower than the error rate for the null model. Our last fitted model, with an error rate of \(0.38\), correctly predicts \(62\%\) of the switching choices, only a \(4\%\) of improvement if compared to the simply guessing that all the households will switch.

# Error rate of Model 7
err.rate7 <- mean((pred7 > 0.5 & data$switch ==0) |
                  ((pred7 < 0.5 & data$switch ==1 )))

Consider the error rate as the equivalent of the \(R^2\) for the linear model, it can be a useful measure of the model fit but can’t substitute a deeper evaluation of the model looking at coefficients significance and at the diagnostic plot.

Extra justify the ‘high’ error rate we obtained with model fit.8.

Extra compute the error rate for your model and compare it with the previous one.

Separation in logistic regression

Let simulate some data to fit a logistic regression with two predictors. The first is from a standard normal distribution and the second from Bernoulli with parameter 0.5.

# Separation in logistic regression
library("arm") #To use the invlogit, display functions of arm package
set.seed(2)
# Simulate some data
n <- 100
x1 <- rnorm(n)
x2 <- rbinom(n, 1, 0.5)
# Fix the true paramater value
b0 <- 1
b1 <- 1.5
b2 <- 2
y <- rbinom(n, 1, arm::invlogit(b0 + b1*x1 + b2*x2)) 


# Fit the Logistic regression model 
M1 <- glm(y ~ x1 + x2, family = binomial(link = "logit"))
display(M1)

## glm(formula = y ~ x1 + x2, family = binomial(link = "logit"))
##             coef.est coef.se
## (Intercept) 0.61     0.43   
## x1          2.05     0.52   
## x2          3.96     0.96   
## ---
##   n = 100, k = 3
##   residual deviance = 52.1, null deviance = 105.4 (difference = 53.3)

Now we plot the fitted logistic curve when the dichotomous predictor is equal to 0 and 1.

# Plot the fitted logistic curve when the binary predictor is equal to 0 and 1
# x2 == 0
curve(arm::invlogit(coef(M1)[1] + coef(M1)[2]*x + coef(M1)[3]*0),
      from=-5,to=5, ylab="invlogit", xlab="x1")
# x2 == 1
curve(arm::invlogit(coef(M1)[1] + coef(M1)[2]*x + coef(M1)[3]*1),
      from=-5,to=5, add=TRUE, lty = 2)

To create quasi complete separation we assign value 1 to all the \(y\)s when \(x_2 = 1\).

# Create quasi separation 
# Assign the value 1 to all ys when x2 = 1

y <- ifelse(x2 == 1, 1, y)

# Fit the model 
M2 <- glm(y ~ x1 + x2, family = binomial(link = "logit"))
display(M2)

## glm(formula = y ~ x1 + x2, family = binomial(link = "logit"))
##             coef.est coef.se
## (Intercept)    0.60     0.43
## x1             2.00     0.58
## x2            21.65  2078.43
## ---
##   n = 100, k = 3
##   residual deviance = 36.1, null deviance = 97.2 (difference = 61.2)

# Plot the fitted logistic curve when the binary predictor is equal to 0 and 1
# x2 == 0
curve(arm::invlogit(coef(M2)[1] + coef(M2)[2]*x + coef(M2)[3]*0),
      from=-15,to=5, ylab="invlogit", xlab="x1")
# x2 == 1
curve(arm::invlogit(coef(M2)[1] + coef(M2)[2]*x + coef(M2)[3]*1),
      from=-15,to=5, add=TRUE, lty = 2)

Here we create separation setting \(y=1\) whenever \(x_1>0\). Now the value of \(y\) can be exactly predicted by \(x_1\) and the best-fit logistic regression line is \(y = logit^{−1}(\infty(x1))\), which has an infinite slope at \(x_1=0\).

# Create quasi separation
# Assign y = 1 when x1 > 0

y <- ifelse(x1>0, 1, 0)
# Fit the model with only x1 as predictor
M3 <- glm(y ~ x1, family = binomial(link = "logit"))
display(M3)

## glm(formula = y ~ x1, family = binomial(link = "logit"))
##             coef.est coef.se 
## (Intercept)    14.55  3127.56
## x1            590.03 79399.98
## ---
##   n = 100, k = 2
##   residual deviance = 0.0, null deviance = 138.0 (difference = 138.0)

# Plot the fitted logistic curve 
curve(arm::invlogit(coef(M3)[1] + coef(M3)[2]*x),
      from=-5,to=5)

Laboratory 5

L Egidi, V Gioia, M Stefanucci

15 - 12 - 2021