An example of an observational study

As an example of an observational study, consider the UCBAdmissions dataset, which is one of the standard R datasets, and refers to the outcome of applications to 6 departments at University of California at Berkeley, by gender. The raw original data (4526 observations on 3 variables) have been already cross-tabulated in a 3-dimensional array : for the six largest departments results of admissions are cross-tabulated by sex.

UCBAdmissions

## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317

For convenience in plotting we abbreviate the outcome text:

dimnames(UCBAdmissions)[[1]] <- c("Adm", "Rej")

The best way to get an overview of multi-way structures is by using the ftable function:

ftable(UCBAdmissions)

##              Dept   A   B   C   D   E   F
## Admit Gender                             
## Adm   Male        512 353 120 138  53  22
##       Female       89  17 202 131  94  24
## Rej   Male        313 207 205 279 138 351
##       Female       19   8 391 244 299 317

Thus, in Department A, 512 males were admitted while 313 were rejected, and so on.

The question of interest is whether there is any bias against admitting female applicants.

The following coerces the contingency table to a data frame:

ucb <- as.data.frame(UCBAdmissions)
dim(ucb)

## [1] 24  4

head(ucb)

##   Admit Gender Dept Freq
## 1   Adm   Male    A  512
## 2   Rej   Male    A  313
## 3   Adm Female    A   89
## 4   Rej Female    A   19
## 5   Adm   Male    B  353
## 6   Rej   Male    B  207

The relationship between the contingency table and the data frame is that each entry in the contingency table becomes a record in the data frame, with a variable Freq holding the entry and the classification as variables.

The function effx calculates the effects of an exposure on a response, possibly stratified by a stratifying variable, and/or controlled for one or more confounding variables.The function is a wrapper for glm.

Effects are calculated as differences in means for a numerical response, odds ratios/relative risks for a binary response, and rate ratios/rate differences for a failure or count response. The k-1 effects for a categorical exposure with k levels are relative to a baseline which, by default, is the first level. The effect of a quantitative numerical exposure is calculated per unit of exposure.The exposure variable can be numeric or a factor, but if it is an ordered factor the order will be ignored.

library(Epi)
effx(response=Admit=="Rej", type="binary", exposure=Gender, weight=Freq, data=ucb)

## --------------------------------------------------------------------------- 
## response      :  Admit == "Rej" 
## type          :  binary 
## exposure      :  Gender 
## 
## Gender is a factor with levels: Male / Female 
## baseline is  Male 
## effects are measured as odds ratios 
## --------------------------------------------------------------------------- 
## 
## effect of Gender on Admit == "Rej" 
## number of observations  24 
## 
## Effect   2.5%  97.5% 
##   1.84   1.62   2.09 
## 
## Test for no effects of exposure on 1 df: p-value= <2e-16

Thus, at a first glance, it seems that women are 80% more likely to be rejected, but if we take Department (Dept) into account by adding control=Dept, it does not seem to be so:

effx(response=Admit=="Rej", type="binary", exposure=Gender, control=Dept,weight=Freq, data=ucb)

## --------------------------------------------------------------------------- 
## response      :  Admit == "Rej" 
## type          :  binary 
## exposure      :  Gender 
## control vars  :  Dept 
## 
## Gender is a factor with levels: Male / Female 
## baseline is  Male 
## effects are measured as odds ratios 
## --------------------------------------------------------------------------- 
## 
## effect of Gender on Admit == "Rej" 
## controlled for Dept 
## 
## number of observations  24 
## 
## Effect   2.5%  97.5% 
##  0.905  0.772  1.060 
## 
## Test for no effects of exposure on 1 df: p-value= 0.216

In fact now it looks like women are 10% less likely to be rejected - though not significantly so (see the confidence interval).

What effx does here is to fit a statistical model (logistic regression) for the admission odds that allows each department its own admission odds, but assuming that the admission odds ratio between women and men within each department is the same (i.e. no interaction between Gender and Department). And, it is this common odds ratio that is reported. What we see here is a classical example of confounding : department is associated with both sex and admission probability, so ignoring department in the analysis will give a distorted picture of the effect of sex per se on admission rates.

We can recover the original data format as follows:

ucb_disagg = ucb[rep(1:nrow(ucb), ucb$Freq), 
                             -grep("Freq", names(ucb))]
ucb_disagg$outcome = ifelse(ucb_disagg$Admit=="Rej", 1,0)
dim(ucb_disagg)

## [1] 4526    4

head(ucb_disagg)

##     Admit Gender Dept outcome
## 1     Adm   Male    A       0
## 1.1   Adm   Male    A       0
## 1.2   Adm   Male    A       0
## 1.3   Adm   Male    A       0
## 1.4   Adm   Male    A       0
## 1.5   Adm   Male    A       0

With this fomat, we can easily apply the glm function:

glm.one <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg)
summary(glm.one)

## 
## Call:
## glm(formula = outcome ~ Gender, family = binomial(link = "logit"), 
##     data = ucb_disagg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5442  -1.2722   0.8506   1.0855   1.0855  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.22013    0.03879   5.675 1.38e-08 ***
## GenderFemale  0.61035    0.06389   9.553  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6044.3  on 4525  degrees of freedom
## Residual deviance: 5950.9  on 4524  degrees of freedom
## AIC: 5954.9
## 
## Number of Fisher Scoring iterations: 4

As you can see the exponential of the Gender coefficient corresponds to the odds ratio previously estimated. Now, if we add in the model also the Department:

glm.two <- glm(outcome ~ Gender+Dept, binomial(link = "logit"), data=ucb_disagg)
summary(glm.two)

## 
## Call:
## glm(formula = outcome ~ Gender + Dept, family = binomial(link = "logit"), 
##     data = ucb_disagg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3613  -0.9588   0.3741   0.9306   1.4773  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.58205    0.06899  -8.436   <2e-16 ***
## GenderFemale -0.09987    0.08085  -1.235    0.217    
## DeptB         0.04340    0.10984   0.395    0.693    
## DeptC         1.26260    0.10663  11.841   <2e-16 ***
## DeptD         1.29461    0.10582  12.234   <2e-16 ***
## DeptE         1.73931    0.12611  13.792   <2e-16 ***
## DeptF         3.30648    0.16998  19.452   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6044.3  on 4525  degrees of freedom
## Residual deviance: 5187.5  on 4519  degrees of freedom
## AIC: 5201.5
## 
## Number of Fisher Scoring iterations: 5

Again, the exponential of the Gender coefficient is 0.90, as before. Note that being Department a factor variable with 6 levels, one is fixed as the reference value and 5 coefficients are estimated for the other levels vs the reference.

To recover confidence intervals around the odds ratio we use the standard error of the coefficient’s estimate:

lower.b = round(exp(glm.two$coefficients[2]-1.96*0.08085),3)
upper.b = round(exp(glm.two$coefficients[2]+1.96*0.08085),3)
c(round(exp(glm.two$coefficients[2]),3), lower.b, upper.b)

## GenderFemale GenderFemale GenderFemale 
##        0.905        0.772        1.060

If we further look into rejection fractions for women versus men in different departments, replacing control=Dept with strata=Dept, it seems as if the only Department where there is a substantial sex difference is in A, where women are less likely to be rejected:

effx(response=Admit=="Rej", type="binary", exposure=Gender, strata=Dept,weight=Freq, data=ucb)

## --------------------------------------------------------------------------- 
## response      :  Admit == "Rej" 
## type          :  binary 
## exposure      :  Gender 
## stratified by :  Dept 
## 
## Gender is a factor with levels: Male / Female 
## baseline is  Male 
## Dept is a factor with levels: A/B/C/D/E/F 
## effects are measured as odds ratios 
## --------------------------------------------------------------------------- 
## 
## effect of Gender on Admit == "Rej" 
## stratified by Dept 
## 
## number of observations  24 
## 
##                               Effect  2.5% 97.5%
## strata A level Female vs Male  0.349 0.209 0.584
## strata B level Female vs Male  0.803 0.340 1.890
## strata C level Female vs Male  1.130 0.855 1.500
## strata D level Female vs Male  0.921 0.686 1.240
## strata E level Female vs Male  1.220 0.825 1.810
## strata F level Female vs Male  0.828 0.455 1.510
## 
## Test for effect modification on 5 df: p-value= 0.00114

This correspond using glm to :

glm.s1 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="A")
glm.s2 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="B")
glm.s3 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="C")
glm.s4 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="D")
glm.s5 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="E")
glm.s6 <- glm(outcome ~ Gender, binomial(link = "logit"), data=ucb_disagg, subset=Dept=="F")
round(exp(glm.s1$coefficients[2]),3)

## GenderFemale 
##        0.349

round(exp(glm.s2$coefficients[2]),3)

## GenderFemale 
##        0.803

round(exp(glm.s3$coefficients[2]),3)

## GenderFemale 
##        1.133

round(exp(glm.s4$coefficients[2]),3)

## GenderFemale 
##        0.921

round(exp(glm.s5$coefficients[2]),3)

## GenderFemale 
##        1.222

round(exp(glm.s6$coefficients[2]),3)

## GenderFemale 
##        0.828

The final analysis shows that there are differences between Departments in the women/men odds ratio of being rejected, although the only one where there is a significantly different rejection rate is in A. From the stratified analysis we see that the departments where women are more likely to be rejected are the departments with the highest proportion of women among applicants.

We can visually appreciate the distribution of rejection rates across Departments by gender using the mosaicplot:

mosaicplot(UCBAdmissions, sort=3:1, col=TRUE, main="", ylab="", xlab="", off=c(0,1,10))

The function basically draws rectangles with an area proportional to each of the entries in the UCBAdmissions.

Finally, we can make an ad hoc analysis of admissions, excluding Department A to get an indication of whether the other Departments differ with respect to admission odds ratio between men and women:

effx(response=Admit=="Rej", type="binary", exposure=Gender, strata=Dept,weight=Freq, 
     data=transform(subset(ucb, Dept!="A"), Dept=factor(Dept)))

## --------------------------------------------------------------------------- 
## response      :  Admit == "Rej" 
## type          :  binary 
## exposure      :  Gender 
## stratified by :  Dept 
## 
## Gender is a factor with levels: Male / Female 
## baseline is  Male 
## Dept is a factor with levels: B/C/D/E/F 
## effects are measured as odds ratios 
## --------------------------------------------------------------------------- 
## 
## effect of Gender on Admit == "Rej" 
## stratified by Dept 
## 
## number of observations  20 
## 
##                               Effect  2.5% 97.5%
## strata B level Female vs Male  0.803 0.340  1.89
## strata C level Female vs Male  1.130 0.855  1.50
## strata D level Female vs Male  0.921 0.686  1.24
## strata E level Female vs Male  1.220 0.825  1.81
## strata F level Female vs Male  0.828 0.455  1.51
## 
## Test for effect modification on 4 df: p-value= 0.635

Note that it is not enough to just restrict the departments not being A, because the variable Dept will still be a factor with A as one of the levels. Therefore we must redefine Dept to a factor with only the actually occurring levels in the subset. We see that there is no indication of differential rejection rates between departments B-F. However, this p-value is biased since it is based on data where we deliberately excluded a department on the grounds that it was an outlier in a specific direction.

To conclude, when we estimate this kind of association measures from observational data, a lot of attention is devoted to try to disentangle possible confounders or correctly taking into account effect modifiers.

When a confounder is in fact also an effect modifier, the strata analysis is the most correct approach, even if we should take into account that dividing the study population in strata we lose sample size (and therefore statistical power).

In the regression model approach there is the possibility to explicitly add an interaction term in the equation:

glm.int <- glm(outcome ~ Gender+Dept+Gender*Dept, binomial(link = "logit"), data=ucb_disagg)
summary(glm.int)

## 
## Call:
## glm(formula = outcome ~ Gender + Dept + Gender * Dept, family = binomial(link = "logit"), 
##     data = ucb_disagg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3793  -0.9768   0.3821   0.9127   1.8642  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.49212    0.07175  -6.859 6.94e-12 ***
## GenderFemale       -1.05208    0.26271  -4.005 6.21e-05 ***
## DeptB              -0.04163    0.11319  -0.368  0.71304    
## DeptC               1.02764    0.13550   7.584 3.34e-14 ***
## DeptD               1.19608    0.12641   9.462  < 2e-16 ***
## DeptE               1.44908    0.17681   8.196 2.49e-16 ***
## DeptF               3.26187    0.23119  14.109  < 2e-16 ***
## GenderFemale:DeptB  0.83205    0.51039   1.630  0.10306    
## GenderFemale:DeptC  1.17700    0.29956   3.929 8.53e-05 ***
## GenderFemale:DeptD  0.97009    0.30262   3.206  0.00135 ** 
## GenderFemale:DeptE  1.25226    0.33032   3.791  0.00015 ***
## GenderFemale:DeptF  0.86318    0.40266   2.144  0.03206 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6044.3  on 4525  degrees of freedom
## Residual deviance: 5167.3  on 4514  degrees of freedom
## AIC: 5191.3
## 
## Number of Fisher Scoring iterations: 5

In this case (using all data), we see that a significant interaction is present between Gender and Department: the reference is Dept A; it seems that in Departments C, D, E and F women are more likely to be rejected than men but remind that this is calculated with respect to what happens in Dept A. Instead in Dept B this there is no significant gender effect with respect to Dept A. The marginal negative effect for female that we observe in the model (exp(-1.05)) seems therefore only due to the “protective” effect in Dept A.

If we exclude Dept A from the dataset (as before we also did):

glm.intNoA <- glm(outcome ~ Gender+Dept+Gender*Dept, binomial(link = "logit"), data=ucb_disagg, subset=Dept!="A")
summary(glm.intNoA)

## 
## Call:
## glm(formula = outcome ~ Gender + Dept + Gender * Dept, family = binomial(link = "logit"), 
##     data = ucb_disagg, subset = Dept != "A")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3793  -0.9607   0.3821   0.9127   1.5096  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.53375    0.08754  -6.097 1.08e-09 ***
## GenderFemale       -0.22002    0.43759  -0.503    0.615    
## DeptC               1.06927    0.14448   7.401 1.35e-13 ***
## DeptD               1.23771    0.13599   9.101  < 2e-16 ***
## DeptE               1.49071    0.18379   8.111 5.02e-16 ***
## DeptF               3.30349    0.23657  13.964  < 2e-16 ***
## GenderFemale:DeptC  0.34494    0.46066   0.749    0.454    
## GenderFemale:DeptD  0.13804    0.46266   0.298    0.765    
## GenderFemale:DeptE  0.42021    0.48123   0.873    0.383    
## GenderFemale:DeptF  0.03113    0.53349   0.058    0.953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4511.1  on 3592  degrees of freedom
## Residual deviance: 3971.6  on 3583  degrees of freedom
## AIC: 3991.6
## 
## Number of Fisher Scoring iterations: 5

Now as expected we observe no significant marginal effect of gender at all. The only significant effect in the rejection rates are for the contrasts between Dept C, D, E, and F: they have all a higher risk of rejecting admissions with respect to Dept B.

Observational data example

An example of an observational study