As an example of an observational study, consider the UCBAdmissions dataset, which is one of the standard R datasets, and refers to the outcome of applications to 6 departments at University of California at Berkeley, by gender. The raw original data (4526 observations on 3 variables) have been already cross-tabulated in a 3-dimensional array : for the six largest departments results of admissions are cross-tabulated by sex.
UCBAdmissions
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
For convenience in plotting we abbreviate the outcome text:
dimnames(UCBAdmissions)[[1]] <- c("Adm", "Rej")
The best way to get an overview of multi-way structures is by using the ftable function:
ftable(UCBAdmissions)
## Dept A B C D E F
## Admit Gender
## Adm Male 512 353 120 138 53 22
## Female 89 17 202 131 94 24
## Rej Male 313 207 205 279 138 351
## Female 19 8 391 244 299 317
Thus, in Department A, 512 males were admitted while 313 were rejected, and so on.
The question of interest is whether there is any bias against admitting female applicants.
The following coerces the contingency table to a data frame:
ucb <- as.data.frame(UCBAdmissions)
dim(ucb)
## [1] 24 4
head(ucb)
## Admit Gender Dept Freq
## 1 Adm Male A 512
## 2 Rej Male A 313
## 3 Adm Female A 89
## 4 Rej Female A 19
## 5 Adm Male B 353
## 6 Rej Male B 207
The relationship between the contingency table and the data frame is that each entry in the contingency table becomes a record in the data frame, with a variable Freq holding the entry and the classification as variables.
The function effx calculates the effects of an exposure on a response, possibly stratified by a stratifying variable, and/or controlled for one or more confounding variables.The function is a wrapper for glm.
Effects are calculated as differences in means for a numerical response, odds ratios/relative risks for a binary response, and rate ratios/rate differences for a failure or count response. The k-1 effects for a categorical exposure with k levels are relative to a baseline which, by default, is the first level. The effect of a quantitative numerical exposure is calculated per unit of exposure.The exposure variable can be numeric or a factor, but if it is an ordered factor the order will be ignored.
library(Epi)
effx(response=Admit=="Rej", type="binary", exposure=Gender, weight=Freq, data=ucb)
## ---------------------------------------------------------------------------
## response : Admit == "Rej"
## type : binary
## exposure : Gender
##
## Gender is a factor with levels: Male / Female
## baseline is Male
## effects are measured as odds ratios
## ---------------------------------------------------------------------------
##
## effect of Gender on Admit == "Rej"
## number of observations 24
##
## Effect 2.5% 97.5%
## 1.84 1.62 2.09
##
## Test for no effects of exposure on 1 df: p-value= <2e-16
Thus, at a first glance, it seems that women are 80% more likely to be rejected, but if we take Department (Dept) into account by adding control=Dept, it does not seem to be so:
effx(response=Admit=="Rej", type="binary", exposure=Gender, control=Dept,weight=Freq, data=ucb)
## ---------------------------------------------------------------------------
## response : Admit == "Rej"
## type : binary
## exposure : Gender
## control vars : Dept
##
## Gender is a factor with levels: Male / Female
## baseline is Male
## effects are measured as odds ratios
## ---------------------------------------------------------------------------
##
## effect of Gender on Admit == "Rej"
## controlled for Dept
##
## number of observations 24
##
## Effect 2.5% 97.5%
## 0.905 0.772 1.060
##
## Test for no effects of exposure on 1 df: p-value= 0.216
In fact now it looks like women are 10% less likely to be rejected - though not significantly so (see the confidence interval).
What effx does here is to fit a statistical model (logistic regression) for the admission odds that allows each department its own admission odds, but assuming that the admission odds ratio between women and men within each department is the same (i.e. no interaction between Gender and Department). And, it is this common odds ratio that is reported. What we see here is a classical example of confounding : department is associated with both sex and admission probability, so ignoring department in the analysis will give a distorted picture of the effect of sex per se on admission rates.
We can recover the original data format as follows:
ucb_disagg = ucb[rep(1:nrow(ucb), ucb$Freq),
-grep("Freq", names(ucb))]
ucb_disagg$outcome = ifelse(ucb_disagg$Admit=="Rej", 1,0)
dim(ucb_disagg)
## [1] 4526 4
head(ucb_disagg)
## Admit Gender Dept outcome
## 1 Adm Male A 0
## 1.1 Adm Male A 0
## 1.2 Adm Male A 0
## 1.3 Adm Male A 0
## 1.4 Adm Male A 0
## 1.5 Adm Male A 0
With this fomat, we can easily apply the glm function:
glm.one <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg)
summary(glm.one)
##
## Call:
## glm(formula = outcome ~ Gender, family = binomial(link = "logit"),
## data = ucb_disagg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5442 -1.2722 0.8506 1.0855 1.0855
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22013 0.03879 5.675 1.38e-08 ***
## GenderFemale 0.61035 0.06389 9.553 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6044.3 on 4525 degrees of freedom
## Residual deviance: 5950.9 on 4524 degrees of freedom
## AIC: 5954.9
##
## Number of Fisher Scoring iterations: 4
As you can see the exponential of the Gender coefficient corresponds to the odds ratio previously estimated. Now, if we add in the model also the Department:
glm.two <- glm(outcome ~ Gender+Dept, binomial(link = "logit"), data=ucb_disagg)
summary(glm.two)
##
## Call:
## glm(formula = outcome ~ Gender + Dept, family = binomial(link = "logit"),
## data = ucb_disagg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3613 -0.9588 0.3741 0.9306 1.4773
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.58205 0.06899 -8.436 <2e-16 ***
## GenderFemale -0.09987 0.08085 -1.235 0.217
## DeptB 0.04340 0.10984 0.395 0.693
## DeptC 1.26260 0.10663 11.841 <2e-16 ***
## DeptD 1.29461 0.10582 12.234 <2e-16 ***
## DeptE 1.73931 0.12611 13.792 <2e-16 ***
## DeptF 3.30648 0.16998 19.452 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6044.3 on 4525 degrees of freedom
## Residual deviance: 5187.5 on 4519 degrees of freedom
## AIC: 5201.5
##
## Number of Fisher Scoring iterations: 5
Again, the exponential of the Gender coefficient is 0.90, as before. Note that being Department a factor variable with 6 levels, one is fixed as the reference value and 5 coefficients are estimated for the other levels vs the reference.
To recover confidence intervals around the odds ratio we use the standard error of the coefficient’s estimate:
lower.b = round(exp(glm.two$coefficients[2]-1.96*0.08085),3)
upper.b = round(exp(glm.two$coefficients[2]+1.96*0.08085),3)
c(round(exp(glm.two$coefficients[2]),3), lower.b, upper.b)
## GenderFemale GenderFemale GenderFemale
## 0.905 0.772 1.060
If we further look into rejection fractions for women versus men in different departments, replacing control=Dept with strata=Dept, it seems as if the only Department where there is a substantial sex difference is in A, where women are less likely to be rejected:
effx(response=Admit=="Rej", type="binary", exposure=Gender, strata=Dept,weight=Freq, data=ucb)
## ---------------------------------------------------------------------------
## response : Admit == "Rej"
## type : binary
## exposure : Gender
## stratified by : Dept
##
## Gender is a factor with levels: Male / Female
## baseline is Male
## Dept is a factor with levels: A/B/C/D/E/F
## effects are measured as odds ratios
## ---------------------------------------------------------------------------
##
## effect of Gender on Admit == "Rej"
## stratified by Dept
##
## number of observations 24
##
## Effect 2.5% 97.5%
## strata A level Female vs Male 0.349 0.209 0.584
## strata B level Female vs Male 0.803 0.340 1.890
## strata C level Female vs Male 1.130 0.855 1.500
## strata D level Female vs Male 0.921 0.686 1.240
## strata E level Female vs Male 1.220 0.825 1.810
## strata F level Female vs Male 0.828 0.455 1.510
##
## Test for effect modification on 5 df: p-value= 0.00114
This correspond using glm to :
glm.s1 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="A")
glm.s2 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="B")
glm.s3 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="C")
glm.s4 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="D")
glm.s5 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="E")
glm.s6 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="F")
round(exp(glm.s1$coefficients[2]),3)
## GenderFemale
## 0.349
round(exp(glm.s2$coefficients[2]),3)
## GenderFemale
## 0.803
round(exp(glm.s3$coefficients[2]),3)
## GenderFemale
## 1.133
round(exp(glm.s4$coefficients[2]),3)
## GenderFemale
## 0.921
round(exp(glm.s5$coefficients[2]),3)
## GenderFemale
## 1.222
round(exp(glm.s6$coefficients[2]),3)
## GenderFemale
## 0.828
The final analysis shows that there are differences between Departments in the women/men odds ratio of being rejected, although the only one where there is a significantly different rejection rate is in A. From the stratified analysis we see that the departments where women are more likely to be rejected are the departments with the highest proportion of women among applicants.
We can visually appreciate the distribution of rejection rates across Departments by gender using the mosaicplot:
mosaicplot(UCBAdmissions, sort=3:1, col=TRUE, main="", ylab="", xlab="", off=c(0,1,10))
The function basically draws rectangles with an area proportional to each of the entries in the UCBAdmissions.
Finally, we can make an ad hoc analysis of admissions, excluding Department A to get an indication of whether the other Departments differ with respect to admission odds ratio between men and women:
effx(response=Admit=="Rej", type="binary", exposure=Gender, strata=Dept,weight=Freq,
data=transform(subset(ucb, Dept!="A"), Dept=factor(Dept)))
## ---------------------------------------------------------------------------
## response : Admit == "Rej"
## type : binary
## exposure : Gender
## stratified by : Dept
##
## Gender is a factor with levels: Male / Female
## baseline is Male
## Dept is a factor with levels: B/C/D/E/F
## effects are measured as odds ratios
## ---------------------------------------------------------------------------
##
## effect of Gender on Admit == "Rej"
## stratified by Dept
##
## number of observations 20
##
## Effect 2.5% 97.5%
## strata B level Female vs Male 0.803 0.340 1.89
## strata C level Female vs Male 1.130 0.855 1.50
## strata D level Female vs Male 0.921 0.686 1.24
## strata E level Female vs Male 1.220 0.825 1.81
## strata F level Female vs Male 0.828 0.455 1.51
##
## Test for effect modification on 4 df: p-value= 0.635
Note that it is not enough to just restrict the departments not being A, because the variable Dept will still be a factor with A as one of the levels. Therefore we must redefine Dept to a factor with only the actually occurring levels in the subset. We see that there is no indication of differential rejection rates between departments B-F. However, this p-value is biased since it is based on data where we deliberately excluded a department on the grounds that it was an outlier in a specific direction.
To conclude, when we estimate this kind of association measures from observational data, a lot of attention is devoted to try to disentangle possible confounders or correctly taking into account effect modifiers.
When a confounder is in fact also an effect modifier, the strata analysis is the most correct approach, even if we should take into account that dividing the study population in strata we lose sample size (and therefore statistical power).
In the regression model approach there is the possibility to explicitly add an interaction term in the equation:
glm.int <- glm(outcome ~ Gender+Dept+Gender*Dept, binomial(link = "logit"), data=ucb_disagg)
summary(glm.int)
##
## Call:
## glm(formula = outcome ~ Gender + Dept + Gender * Dept, family = binomial(link = "logit"),
## data = ucb_disagg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3793 -0.9768 0.3821 0.9127 1.8642
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.49212 0.07175 -6.859 6.94e-12 ***
## GenderFemale -1.05208 0.26271 -4.005 6.21e-05 ***
## DeptB -0.04163 0.11319 -0.368 0.71304
## DeptC 1.02764 0.13550 7.584 3.34e-14 ***
## DeptD 1.19608 0.12641 9.462 < 2e-16 ***
## DeptE 1.44908 0.17681 8.196 2.49e-16 ***
## DeptF 3.26187 0.23119 14.109 < 2e-16 ***
## GenderFemale:DeptB 0.83205 0.51039 1.630 0.10306
## GenderFemale:DeptC 1.17700 0.29956 3.929 8.53e-05 ***
## GenderFemale:DeptD 0.97009 0.30262 3.206 0.00135 **
## GenderFemale:DeptE 1.25226 0.33032 3.791 0.00015 ***
## GenderFemale:DeptF 0.86318 0.40266 2.144 0.03206 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6044.3 on 4525 degrees of freedom
## Residual deviance: 5167.3 on 4514 degrees of freedom
## AIC: 5191.3
##
## Number of Fisher Scoring iterations: 5
In this case (using all data), we see that a significant interaction is present between Gender and Department: the reference is Dept A; it seems that in Departments C, D, E and F women are more likely to be rejected than men but remind that this is calculated with respect to what happens in Dept A. Instead in Dept B this there is no significant gender effect with respect to Dept A. The marginal negative effect for female that we observe in the model (exp(-1.05)) seems therefore only due to the “protective” effect in Dept A.
If we exclude Dept A from the dataset (as before we also did):
glm.intNoA <- glm(outcome ~ Gender+Dept+Gender*Dept, binomial(link = "logit"), data=ucb_disagg, subset=Dept!="A")
summary(glm.intNoA)
##
## Call:
## glm(formula = outcome ~ Gender + Dept + Gender * Dept, family = binomial(link = "logit"),
## data = ucb_disagg, subset = Dept != "A")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3793 -0.9607 0.3821 0.9127 1.5096
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.53375 0.08754 -6.097 1.08e-09 ***
## GenderFemale -0.22002 0.43759 -0.503 0.615
## DeptC 1.06927 0.14448 7.401 1.35e-13 ***
## DeptD 1.23771 0.13599 9.101 < 2e-16 ***
## DeptE 1.49071 0.18379 8.111 5.02e-16 ***
## DeptF 3.30349 0.23657 13.964 < 2e-16 ***
## GenderFemale:DeptC 0.34494 0.46066 0.749 0.454
## GenderFemale:DeptD 0.13804 0.46266 0.298 0.765
## GenderFemale:DeptE 0.42021 0.48123 0.873 0.383
## GenderFemale:DeptF 0.03113 0.53349 0.058 0.953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4511.1 on 3592 degrees of freedom
## Residual deviance: 3971.6 on 3583 degrees of freedom
## AIC: 3991.6
##
## Number of Fisher Scoring iterations: 5
Now as expected we observe no significant marginal effect of gender at all. The only significant effect in the rejection rates are for the contrasts between Dept C, D, E, and F: they have all a higher risk of rejecting admissions with respect to Dept B.