Many times, linearity between a dependent variable and a set of predictors is not a suited assumption. Normal linear models rely on a normality assumption, and this may be a limitation when the nature of the dependent variable is discrete and its range spans from 0 to \(+\infty\). In normal linear models, the main assumption is:
\[ E[y_i]=\eta_i=\beta_0+\sum_{j=1}^{p}\beta_jx_{ij}.\]
Extending the linear framework allows for the following transformation involving a twice differentiable link function \(g\) such that:
\[g(E[y_i])=\eta_i=\beta_0+\sum_{j=1}^{p}\beta_jx_{ij}.\]
The following example is in the book: Data analysis using regression and multilevel/ hierarchical models, Gelman, A., & Hill, J. (2007), Cambridge University Press.
It is known that water in some wells in Bangladesh and other South Asian countries might be contaminated with natural arsenic. The effect of the arsenic ingestion on health is a cumulative poisoning, and the risk of cancer and other diseases is estimated to be proportional to the dose intake.
A research team from the United States and Bangladesh examined the water of several wells in Araihazar district of Bangladeshin and they measured the arsenic level classified the wells as safe if the arsenic level was below the 0.5 in units of hundreds of micrograms per liter, or unsafe if it was above 0.5. All the people using unsafe wells have been encouraged to switch to a nearby well. Indeed the contamination does not depend on the proximity and close wells can have very different levels of arsenic.
The dataset in the file Wells.csv
contains information on whether or not 3020 households in Bangladesh changed the wells that they were using after few years from the arsenic investigation.
Using a logistic regression, we analyse which are the main factors of well switching among the users of unsafe wells.
# Load the data (dataset Wells.csv)
# See the structure
# See a summary of all variables in the dataset
# See the distribution of distance, arsenic and education by switch
setwd("C:/Users/Gioia/Desktop/lab 5")
data <- read.csv2("Wells.csv", header=TRUE)
str(data)
## 'data.frame': 3020 obs. of 5 variables:
## $ switch : int 1 1 0 1 1 1 1 1 1 1 ...
## $ arsenic: num 2.36 0.71 2.07 1.15 1.1 3.9 2.97 3.24 3.28 2.52 ...
## $ dist : num 16.8 47.3 21 21.5 40.9 ...
## $ assoc : int 0 0 0 0 1 1 1 0 1 1 ...
## $ educ : int 0 0 10 12 14 9 4 10 0 0 ...
summary(data)
## switch arsenic dist assoc educ
## Min. :0.0000 Min. :0.510 Min. : 0.387 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.820 1st Qu.: 21.117 1st Qu.:0.0000 1st Qu.: 0.000
## Median :1.0000 Median :1.300 Median : 36.761 Median :0.0000 Median : 5.000
## Mean :0.5752 Mean :1.657 Mean : 48.332 Mean :0.4228 Mean : 4.828
## 3rd Qu.:1.0000 3rd Qu.:2.200 3rd Qu.: 64.041 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :1.0000 Max. :9.650 Max. :339.531 Max. :1.0000 Max. :17.000
par(mfrow=c(1,2))
boxplot(dist ~ switch, data = data, horizontal = TRUE,
xlab = "Distance (in meters)", ylab = "Switch")
boxplot(arsenic ~ switch, data = data, horizontal = TRUE,
xlab = "Arsenic (in hundreds of mg per liter)", ylab = "Switch")
par(mfrow=c(1,1))
barplot(prop.table(table(data$switch, data$educ), margin = 1),
beside=TRUE, legend.text = c("no switch", "switch"), xlab="Education (years)")
We start the analysis modelling the effect of the variable distance
on the variable switch
. Since the outcome variable is a binary variable, the model chosen is the logistic regression model. \[\begin{align*}
\rm{Pr}(y_i=1)&=p_i\\
\text{logit}(p_i)&=\mathbf{x}^\top_i\beta,
\end{align*}\] where \(\text{logit}(x)=\log(x/(1-x))\) is a function that maps the range \((0,1)\) to the range \((-\infty,\infty)\).
In R:
# Fit the logistic regression model with one predictor (distance)
fit1 <- glm(switch ~ dist , data = data,
family = binomial(link = "logit"))
summary(fit1)
##
## Call:
## glm(formula = switch ~ dist, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4406 -1.3058 0.9669 1.0308 1.6603
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6059594 0.0603102 10.047 < 2e-16 ***
## dist -0.0062188 0.0009743 -6.383 1.74e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 4076.2 on 3018 degrees of freedom
## AIC: 4080.2
##
## Number of Fisher Scoring iterations: 4
Exercise Try to obtain the R output by hand
The intercept can only be interpreted assuming zero values for the other predictors. When zero is not interesting or not even in the model, the intercept must be evaluated at some other point. Here, when dist
\(=0\):
# Probability of switching well for dist = 0
# response function:
invlogit <- function(eta) 1/(1+exp(-eta))
invlogit(c(1,0)%*%coef(fit1))
## [,1]
## [1,] 0.6470185
# Alternatively
predict(fit1, newdata=data.frame(dist=0), type="response")
## 1
## 0.6470185
Thus the probability of switching well for a family who lives close to a safe well is about \(65\%\).
There are two ways of interpreting the coefficients for the predictor dist
:
dist
can be interpreted as a negative difference of \(−0.0062\) in the logit probability of switching well when the distance is increased by one;Or we can evaluate the difference of the logistic regression function to adding 1 to the predictor dist
. This difference corresponds to a difference on the probability scale but requires the choice of two specific values of the predictor. The mean of the predictor is a good starting point since it corresponds to the steepest point of the inverse logit function.
Thus,
dist
—that is, adding 1 meters to the distance to the nearest safe well— corresponds to a negative difference in the probability of switching of about \(0.15\%\) on average.Notice that coefficients here does not have a linear effect on the probability that the outcome is 1 because of the nonlinearity of the model. The curve of the logistic function requires us to choose where to evaluate changes, if we want to interpret on the probability scale.
# Difference on the probability of switch well due to an increase
# of one unit on the predictor from the mean
invlogit(c(1,mean(data$dist)+1)%*%coef(fit1))-
invlogit(c(1,mean(data$dist))%*%coef(fit1))
## [,1]
## [1,] -0.001519722
# Alternatively
predict(fit1, newdata=data.frame(dist=mean(data$dist)+1), type="response")-
predict(fit1, newdata=data.frame(dist=mean(data$dist)), type="response")
## 1
## -0.001519722
# Difference on the probability of switch well due to an increase
# of one unit on the predictor from 99th percentile
invlogit(c(1,quantile(data$dist,0.99)+1)%*%coef(fit1))-
invlogit(c(1,quantile(data$dist,0.99))%*%coef(fit1))
## [,1]
## [1,] -0.001465513
The effect of one unit increase of the variable dist
on the inverse logit seems low, but this is misleading since distance is measured in meters, so this coefficient corresponds to the difference between, say, a house that is 90 meters away from the nearest safe well and a house that is 91 meters away. In order to improve interpretability of the results we rescale distance in 100-meter units.
# Consider dist/100 for interpretability purposes and fit the model
data$dist100 <- data$dist/100
fit2 <- glm(switch ~ dist100 , data = data,
family = binomial(link = "logit"))
summary(fit2)
##
## Call:
## glm(formula = switch ~ dist100, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4406 -1.3058 0.9669 1.0308 1.6603
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.60596 0.06031 10.047 < 2e-16 ***
## dist100 -0.62188 0.09743 -6.383 1.74e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 4076.2 on 3018 degrees of freedom
## AIC: 4080.2
##
## Number of Fisher Scoring iterations: 4
# Difference on probability scale due to an increase of 1 unit
# on the predictor dist100 from the mean
invlogit(c(1,mean(data$dist100)+1)%*%coef(fit2))-
invlogit(c(1,mean(data$dist100))%*%coef(fit2))
## [,1]
## [1,] -0.1542287
# It is the same difference due to an increase of 100 unit on the predictor
# dist from the mean (i.e. considering fit1)
invlogit(c(1,mean(data$dist)+100)%*%coef(fit1))-
invlogit(c(1,mean(data$dist))%*%coef(fit1))
## [,1]
## [1,] -0.1542287
The following plot shows the fitted model with one predictor
#########################
# To do during the lab ~#
#########################
# Plot the fitted model
switch.jitter <- jitter(data$switch, factor=0.10)
par(mfrow=c(1,1))
with(data, plot(dist, switch.jitter,
xlab="Distance (in meters) to nearest safe well",
ylab = "Pr(Swiching)", type = "n"))
curve (invlogit(coef(fit1)[1]+coef(fit1)[2]*x), lwd=1, add=TRUE)
points(data$dist, switch.jitter, pch=20, cex=.1)
The probability of switching well is about \(60\%\) for those who live close to a safe well, declining to about \(20\%\) when the distance from a safe well increases up to \(300\) meters from their home. This seems reasonable: the probability of switching is higher for people who live closer to a safe well (Gelman-Hill, 2007).
We start adding the arsenic level. We expect that the higher the arsenic concentration the higher the probability of switching well, that is, translated in model terms, a positive sign for the arsenic coefficient.
# Adding predictors
# See the histogram of arsenic levels
par(mfrow=c(1,1))
hist(data$arsenic, freq = TRUE, xlab = "Arsenic concentration in well water",
breaks = seq(0, 0.25+max(data$arsenic), 0.25), main = "")
length(which(data$arsenic >= 0.5 & data$arsenic <= 0.75))
## [1] 632
# Fit the logistic regression model with two predictors (dist100 and arsenic)
fit3 <- glm(switch ~ dist100 + arsenic, data=data,
family = binomial("logit"))
summary(fit3)
##
## Call:
## glm(formula = switch ~ dist100 + arsenic, family = binomial("logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6351 -1.2139 0.7786 1.0702 1.7085
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.002749 0.079448 0.035 0.972
## dist100 -0.896644 0.104347 -8.593 <2e-16 ***
## arsenic 0.460775 0.041385 11.134 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3930.7 on 3017 degrees of freedom
## AIC: 3936.7
##
## Number of Fisher Scoring iterations: 4
Numerical interpretation of one predictor effect here needs to be done choosing a value for the other predictor. As an example: given two wells with the same arsenic level (notice that it cannot be 0!), the logit probability of switching decreases of \(-.9\) every 100 meters in distance to the nearest safe well.
Equivalently: for two equally distant nearest safe wells, a difference of 1 in arsenic concentration corresponds to a 0.46 positive difference in the logit probability of switching.
The following plots show the fitted model with two predictors
# Plot the fitted model
# Pr(Switching) vs distance (consider Arsenic = 0.5 and Arsenic = 1)
with(data, plot(dist100, switch.jitter,
xlab = "Distance (in hundreds meters) to nearest safe well",
ylab = "Pr(Switching)", type = "n"))
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*x+coef(fit3)[3]),add=TRUE, col="blue")
# Alternatively
#curve(invlogit(coef(fit3)%*%c(1,x,1)),add=TRUE, col="blue")
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*x+coef(fit3)[3]*0.5),add=TRUE, col="red")
points(data$dist, switch.jitter, pch=20, cex=.1)
text (0.50, .27, "if As = 0.5", adj=0, cex=.8, col = "red")
text (0.75, .50, "if As = 1.0", adj=0, cex=.8, col = "blue")
#Alternatively (on the meter-based distance scale)
with(data, plot(dist, switch.jitter,
xlab = "Distance (in meters) to nearest safe well",
ylab = "Pr(Switching)", type = "n"))
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*x/100+coef(fit3)[3]),add=TRUE, col="blue")
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*x/100+coef(fit3)[3]*0.5),add=TRUE, col="red")
points(data$dist, switch.jitter, pch=20, cex=.1)
text (50, .27, "if As = 0.5", adj=0, cex=.8, col = "red")
text (75, .50, "if As = 1.0", adj=0, cex=.8, col = "blue")
# Pr(Switching) vs Arsenic (Consider distance = 0 and distance = 50)
with(data, plot(arsenic, switch.jitter,
xlab = "Arsenic concentration (in hundreds mg per liter) in well water",
ylab = "Pr(Switching)", type = "n"))
curve(invlogit(coef(fit3)[1]+coef(fit3)[3]*x),add=TRUE, col="red")
text(1.5,0.8,"if dist=0", col="red")
curve(invlogit(coef(fit3)[1]+coef(fit3)[2]*0.5+coef(fit3)[3]*x),add=TRUE, col="blue")
text(4,0.6,"if dist=50", col="blue")
points(data$arsenic, switch.jitter, cex=0.01, pch=20)
In general terms, the fitted model can be summarized as: switching is easier if there is a nearby safe well, and if a household’s existing well has a high arsenic level, there should be more motivation to switch.
Extra try to interpret the change occurred in the estimated coefficient for the predictor dist100
in model fit.1 (\(\beta^1_{\texttt{dist}}=-0.62\)) versus model fit.2 (\(\beta^2_{\texttt{dist}}=-0.9\)).
We now want to verify if there is a statistically significant effect of the interaction between the distance and the arsenic concentration on the probability of switching well.
# Logistic regression with interactions
fit4 <- glm(switch ~ dist100 + arsenic + dist100:arsenic,
data = data, family = binomial(link = "logit"))
# Equivalently (by using *)
fit4 <- glm(switch ~ dist100*arsenic,
data = data, family = binomial(link = "logit"))
summary(fit4)
##
## Call:
## glm(formula = switch ~ dist100 * arsenic, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7823 -1.2004 0.7696 1.0816 1.8476
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.14787 0.11754 -1.258 0.20838
## dist100 -0.57722 0.20918 -2.759 0.00579 **
## arsenic 0.55598 0.06932 8.021 1.05e-15 ***
## dist100:arsenic -0.17891 0.10233 -1.748 0.08040 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3927.6 on 3016 degrees of freedom
## AIC: 3935.6
##
## Number of Fisher Scoring iterations: 4
As said before, the interpretation of the coefficients when arsenic is equal to 0 does not make sense, since the minimum level reached is 0.5. We fix the predictors to their mean and so
# Probability of switching for an average distance and
# an average concentration of arsenic
invlogit(c(1,mean(data$dist100), mean(data$arsenic),
mean(data$dist100)*mean(data$arsenic))%*%coef(fit4))
## [,1]
## [1,] 0.5868829
#Equivalently
predict(fit4, newdata=data.frame(dist100 = mean(data$dist100),
arsenic = mean(data$arsenic)),
type = "response")
## 1
## 0.5868829
# Difference in probability due to an increase of 100 meters
# in the distance (from the mean), helding arsenic at its mean
invlogit(c(1,mean(data$dist100)+1, mean(data$arsenic),
(mean(data$dist100)+1)*mean(data$arsenic))%*%coef(fit4))-
invlogit(c(1,mean(data$dist100), mean(data$arsenic),
mean(data$dist100)*mean(data$arsenic))%*%coef(fit4))
## [,1]
## [1,] -0.2146287
# Difference in probability due to an increase of 1 unit
# in the arsenic (from the mean), helding distance at its mean
invlogit(c(1,mean(data$dist100),mean(data$arsenic)+1,
mean(data$dist100)*(mean(data$arsenic)+1))%*% coef(fit4))-
invlogit(c(1,mean(data$dist100),mean(data$arsenic),
mean(data$dist100)*mean(data$arsenic))%*%coef(fit4))
## [,1]
## [1,] 0.1074813
# Difference in logit probability scale when dist increases by 100 meters,
# considering arsenic at its mean level
coef(fit4)[2] + coef(fit4)[4]*mean(data$arsenic)
## dist100
## -0.8736527
# Difference in logit probability scale when arsenic increases by 1 unit,
# considering dist at its mean level
coef(fit4)[3] + coef(fit4)[4]*mean(data$dist100)
## arsenic
## 0.4695081
Interpreting the coefficients is easier when variables are standardized. Coefficients of predictors are now interpretable as the effect of the predictor on the logit probability when the others are at their mean.
# Centering the input variables
data$c.dist100 <- scale(data$dist100, scale = FALSE)
data$c.arsenic <- scale(data$arsenic, scale = FALSE)
# Refitting the model with centered inputs
fit5 <- glm(switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic,
family = binomial(link = "logit"), data = data)
summary(fit5)
##
## Call:
## glm(formula = switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic,
## family = binomial(link = "logit"), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7823 -1.2004 0.7696 1.0816 1.8476
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.35109 0.03985 8.810 <2e-16 ***
## c.dist100 -0.87365 0.10480 -8.337 <2e-16 ***
## c.arsenic 0.46951 0.04207 11.159 <2e-16 ***
## c.dist100:c.arsenic -0.17891 0.10233 -1.748 0.0804 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3927.6 on 3016 degrees of freedom
## AIC: 3935.6
##
## Number of Fisher Scoring iterations: 4
The same graphs as before can be used to plot the models with the interaction term. This graph shows evidence that the differences in switching associated with differences in arsenic level are large if you are close to a safe well, but with a diminishing effect if you are far from any safe well. Comparing this plot with the one related to the model without interaction highlights that the main difference among the two curves is found in the area with few data points.
# Pr(Switching) vs Distance (Consider Arsenic=0.5 and Arsenic = 1)
with(data, plot(data$dist, switch.jitter, type = "n",
xlab = "Distance(in meters) to nearest safe well",
ylab = "Pr(Switching)"))
curve(invlogit(cbind(1, x/100, 1, x/100)%*%coef(fit5)), add=TRUE, col="red")
curve(invlogit(cbind(1, x/100, 0.5, 0.5*x/100)%*%coef(fit5)), add=TRUE, col="blue")
text(100,0.6,"if As = 1", col="red")
text(35,0.4,"if As = 0.5", col="blue")
points(data$dist, switch.jitter, pch=20, cex=0.01)
# Pr(Switching) vs Arsenic (Consider Distance = 0 and Distance = 50)
with(data, plot(data$arsenic, switch.jitter, type = "n",
xlab = "Arsenic concentration (in hundreds of mg per liter) in well water",
ylab = "Pr(Switching)"))
curve(invlogit(cbind(1, 0, x, 0)%*%coef(fit5)), add=TRUE, col="red")
curve(invlogit(cbind(1, 0.5, x, 0.5*x)%*%coef(fit5)), add=TRUE, col="blue")
text(2,0.9,"if Dist = 0", col="red")
text(3,0.6,"if Dist = 50", col="blue")
points(data$arsenic, switch.jitter, pch=20, cex=0.01)
Outcome variable in the logistic regression is discrete, so the residuals of the model. The plot of the residuals defined as \[residual_i = y_i - E(y_i|X_i)=y_i-\text{logit}^{-1}(X_i\beta)\] versus the fitted values is not useful.
# Residual plot: wrong
pred7 <- predict(fit7, type="response")
# or
# pred7 <- fit7$fitted.values
plot(c(0,1),c(-1,1), xlab = "Estimated Pr(Switching)", type = "n",
ylab = "Observed - Estimated", main = "Residual Plot")
abline(h = 0, col = "gray", lwd = 0.5)
points(pred7, data$switch - pred7, pch = 20, cex = 0.2)
A more readable plot can be made using the binned residuals defined as:
dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin (Gelman, Hill 2007).
The number of bins have to be chosen such that each bin is computed on enough points such that the resulting plot is not too noisy (high number of bins) but can highlight patterns in the residuals (hided if the number of bins is too low).
Grey lines depict \(\pm 2\) standard error bounds, so we expect that 95% of the binned residuals falls between these two lines. The standard error is defined as \(\sqrt{p(1-p)/n}\), where \(n\) is the number of data used to compute each average residual.
## Binned Residual Plot
# x represents the fitted values
# y represents the observed - fitted values
# nclass = number of categories (bins)
binned.resids <- function(x, y, nclass = sqrt(length(x))){
breaks.index <- floor(length(x)*(1:(nclass))/nclass)
breaks <- c(-Inf, sort(x)[breaks.index], Inf)
output <- NULL
xbreaks <- NULL
x.binned <- as.numeric(cut(x, breaks))
for(i in 1:nclass){
items <- (1:length(x))[x.binned == i]
x.range <- range(x[items])
xbar <- mean(x[items])
ybar <- mean(y[items])
n <- length(items)
sdev <- sd(y[items])
output <- rbind(output, c(xbar, ybar, n, x.range, 2*sdev/sqrt(n)))
}
colnames(output) <- c("xbar", "ybar", "n", "x.lo", "x.hi", "2se")
return(list(binned = output, xbreaks = xbreaks))
}
# Little changes w.r.t to the binned.resids function of the arm package
# library("arm")
# arm::binned.resids
# Binned residuals vs Estimated probability of Switching
br7 <- binned.resids(pred7, data$switch - pred7, nclass = 50)$binned
plot(range(br7[,1]), range(br7[,2],br7[,6], -br7[,6]),
xlab = "Estimated Pr(Switching)", ylab = "Average residual",
type = "n", main = "Binned Residual Plot")
abline(h = 0, col = "gray", lwd = 0.5)
lines(br7[,1], br7[,6], col = "gray", lwd = 0.5)
lines(br7[,1], -br7[,6], col = "gray", lwd = 0.5)
points(br7[,1], br7[,2], pch = 19, cex = 0.5)
A pattern can be identified looking at the plot of the binned residuals with respect to the arsenic level. Indeed, there are large negative binned residuals for some households with wells with small values of arsenic concentration. These points correspond to people that are less likely to switch well with respect to what predicted by the model (almost \(-20\%\)). Moreover, the distribution of the residuals shows a slight patter: the model seems to underestimate the probabilities of switching for medium arsenic level values while overestimates for high arsenic concentrations.
# Plot of the binned residuals vs input of interest
# distance
br.dist <- binned.resids(data$dist, data$switch - pred7,
nclass = 40)$binned
plot(range(br.dist[,1]), range(br.dist[,2], br.dist[,6], -br.dist[,6]),
xlab = "Distance to the nearest safe well",
ylab = "Average residual", type = "n",
main = "Binned residual plot")
abline(h = 0, col = "gray", lwd = 0.5)
lines(br.dist[,1], br.dist[,6], col = "gray", lwd = 0.5)
lines(br.dist[,1], -br.dist[,6], col = "gray", lwd = 0.5)
points(br.dist[,1], br.dist[,2], pch = 19, cex = 0.5)
# Arsenic
br.arsenic <- binned.resids(data$arsenic, data$switch - pred7,
nclass = 40)$binned
plot(range(br.arsenic[,1]), range(br.arsenic[,2], br.arsenic[,6], -br.arsenic[,6]),
xlab = "Arsenic concentration",
ylab = "Average residual", type = "n",
main = "Binned residual plot")
abline(h = 0, col = "gray", lwd = 0.5)
lines(br.arsenic[,1], br.arsenic[,6], col = "gray", lwd = 0.5)
lines(br.arsenic[,1], -br.arsenic[,6], col = "gray", lwd = 0.5)
points(br.arsenic[,1], br.arsenic[,2], pch = 19, cex = 0.5)
Extra propose a possible solution to improve the model and fit your model. Compare the new binned residual plots with the previous ones.
Error rate is defined as the proportion of cases for which the fitted probabilities are above 0.5 and the outcome value is 0 or the fitted probabilities are below 0.5 and the outcome value is 1, that is when the prediction-guessing is wrong: \[er = mean((y=0 \, \& \, \rm{E}(y|X)>.5)\, | \, (y=1 \, \& \, \rm{E}(y|X)<.5))\] It should be a value between 0 (perfect prediction-guessing) and 0.5 (random prediction-guessing).
Usually we expect the error rate to be lower than the error rate of the null model, that is the only-intercept model. The estimated probability for the null model will be \(p=\sum_i y_i/n\), that is the proportion of 1’s in the data. The error rate for the null model is \(min(p, 1-p)\), and it corresponds to assign 1 (or 0) to all the fitted values.
# Error rate
# Error rate of the Null Model
fit0 <- glm(switch ~ 1, family = binomial(link = "logit"), data = data)
coef(fit0)
## (Intercept)
## 0.3029584
invlogit(coef(fit0))
## (Intercept)
## 0.5751656
m0 <- mean(data$switch)
err.rate0 <- min(m0, 1 - m0)
# Alternatively
err.rate0 <- mean((rep(1, length(data$switch)) > 0.5 & data$switch ==0) |
((rep(1, length(data$switch)) < 0.5 & data$switch ==1 )))
So if our model improved the prediction-guessing of a simple logistic regression, we expect the error rate of our model to be lower than the error rate for the null model. Our last fitted model, with an error rate of \(0.38\), correctly predicts \(62\%\) of the switching choices, only a \(4\%\) of improvement if compared to the simply guessing that all the households will switch.
# Error rate of Model 7
err.rate7 <- mean((pred7 > 0.5 & data$switch ==0) |
((pred7 < 0.5 & data$switch ==1 )))
Consider the error rate as the equivalent of the \(R^2\) for the linear model, it can be a useful measure of the model fit but can’t substitute a deeper evaluation of the model looking at coefficients significance and at the diagnostic plot.
Extra justify the ‘high’ error rate we obtained with model fit.8
.
Extra compute the error rate for your model and compare it with the previous one.
Let simulate some data to fit a logistic regression with two predictors. The first is from a standard normal distribution and the second from Bernoulli with parameter 0.5.
# Separation in logistic regression
library("arm") #To use the invlogit, display functions of arm package
set.seed(2)
# Simulate some data
n <- 100
x1 <- rnorm(n)
x2 <- rbinom(n, 1, 0.5)
# Fix the true paramater value
b0 <- 1
b1 <- 1.5
b2 <- 2
y <- rbinom(n, 1, arm::invlogit(b0 + b1*x1 + b2*x2))
# Fit the Logistic regression model
M1 <- glm(y ~ x1 + x2, family = binomial(link = "logit"))
display(M1)
## glm(formula = y ~ x1 + x2, family = binomial(link = "logit"))
## coef.est coef.se
## (Intercept) 0.61 0.43
## x1 2.05 0.52
## x2 3.96 0.96
## ---
## n = 100, k = 3
## residual deviance = 52.1, null deviance = 105.4 (difference = 53.3)
Now we plot the fitted logistic curve when the dichotomous predictor is equal to 0 and 1.
# Plot the fitted logistic curve when the binary predictor is equal to 0 and 1
# x2 == 0
curve(arm::invlogit(coef(M1)[1] + coef(M1)[2]*x + coef(M1)[3]*0),
from=-5,to=5, ylab="invlogit", xlab="x1")
# x2 == 1
curve(arm::invlogit(coef(M1)[1] + coef(M1)[2]*x + coef(M1)[3]*1),
from=-5,to=5, add=TRUE, lty = 2)
To create quasi complete separation we assign value 1 to all the \(y\)s when \(x_2 = 1\).
# Create quasi separation
# Assign the value 1 to all ys when x2 = 1
y <- ifelse(x2 == 1, 1, y)
# Fit the model
M2 <- glm(y ~ x1 + x2, family = binomial(link = "logit"))
display(M2)
## glm(formula = y ~ x1 + x2, family = binomial(link = "logit"))
## coef.est coef.se
## (Intercept) 0.60 0.43
## x1 2.00 0.58
## x2 21.65 2078.43
## ---
## n = 100, k = 3
## residual deviance = 36.1, null deviance = 97.2 (difference = 61.2)
# Plot the fitted logistic curve when the binary predictor is equal to 0 and 1
# x2 == 0
curve(arm::invlogit(coef(M2)[1] + coef(M2)[2]*x + coef(M2)[3]*0),
from=-15,to=5, ylab="invlogit", xlab="x1")
# x2 == 1
curve(arm::invlogit(coef(M2)[1] + coef(M2)[2]*x + coef(M2)[3]*1),
from=-15,to=5, add=TRUE, lty = 2)
Here we create separation setting \(y=1\) whenever \(x_1>0\). Now the value of \(y\) can be exactly predicted by \(x_1\) and the best-fit logistic regression line is \(y = logit^{−1}(\infty(x1))\), which has an infinite slope at \(x_1=0\).
# Create quasi separation
# Assign y = 1 when x1 > 0
y <- ifelse(x1>0, 1, 0)
# Fit the model with only x1 as predictor
M3 <- glm(y ~ x1, family = binomial(link = "logit"))
display(M3)
## glm(formula = y ~ x1, family = binomial(link = "logit"))
## coef.est coef.se
## (Intercept) 14.55 3127.56
## x1 590.03 79399.98
## ---
## n = 100, k = 2
## residual deviance = 0.0, null deviance = 138.0 (difference = 138.0)
# Plot the fitted logistic curve
curve(arm::invlogit(coef(M3)[1] + coef(M3)[2]*x),
from=-5,to=5)