library(mice)
##
## Caricamento pacchetto: 'mice'
## Il seguente oggetto è mascherato da 'package:stats':
##
## filter
## I seguenti oggetti sono mascherati da 'package:base':
##
## cbind, rbind
library(ggplot2)
# Using the built-in 'nhanes' dataset (Age, BMI, Hyp, Chol)
data <- nhanes
The nhanes dataset comes bundled with the mice package. It is a small extract from the National Health and Nutrition Examination Survey. It contains 25 observations on 4 variables:
age: Age group (1=20-39, 2=40-59, 3=60+)
bmi: Body mass index (kg/m²)
hyp: Hypertension status (1=no, 2=yes)
chl: Total serum cholesterol (mg/dL)
Before imputing, we need to see the pattern of what’s missing.The “geometry” of the missingness !
md.pattern(data)
## age hyp bmi chl
## 13 1 1 1 1 0
## 3 1 1 1 0 1
## 1 1 1 0 1 1
## 1 1 0 0 1 2
## 7 1 0 0 0 3
## 0 8 9 10 27
What the output shows: The left column shows the number of rows. For example, you’ll see how many rows are complete (no missing values) and which combinations of variables (like BMI and CHL) are missing together.
We will create 5 imputed datasets using Predictive Mean Matching (pmm). Note that in each version, the missing values will be slightly different because they are sampled from a predictive distribution.
# m=5 imputations, maxit=5 iterations
imputed_data <- mice(data, m = 5, method = 'pmm', seed = 123, printFlag = FALSE)
Let’s say we want to know if BMI predicts Cholesterol levels. We run the regression on all 5 sets.We don’t merge the data. We run our regression model on each of the 5 datasets separately.
# Fit linear model to all 5 datasets
fit <- with(imputed_data, lm(chl ~ bmi + age))
# Print the estimates for each of the 5 runs
fit
## call :
## with.mids(data = imputed_data, expr = lm(chl ~ bmi + age))
##
## call1 :
## mice(data = data, m = 5, method = "pmm", printFlag = FALSE, seed = 123)
##
## nmis :
## [1] 0 9 8 10
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = chl ~ bmi + age)
##
## Coefficients:
## (Intercept) bmi age
## 56.374 3.158 32.093
##
##
## [[2]]
##
## Call:
## lm(formula = chl ~ bmi + age)
##
## Coefficients:
## (Intercept) bmi age
## 41.355 4.295 20.274
##
##
## [[3]]
##
## Call:
## lm(formula = chl ~ bmi + age)
##
## Coefficients:
## (Intercept) bmi age
## -37.092 6.228 34.380
##
##
## [[4]]
##
## Call:
## lm(formula = chl ~ bmi + age)
##
## Coefficients:
## (Intercept) bmi age
## -29.343 6.423 24.817
##
##
## [[5]]
##
## Call:
## lm(formula = chl ~ bmi + age)
##
## Coefficients:
## (Intercept) bmi age
## 25.857 4.795 25.962
You will see 5 different sets of coefficients. Notice they are similar but not identical—this variation represents the uncertainty caused by the missing data.
This applies Rubin’s Rules to give us a single output.
final_results <- pool(fit)
summary(final_results)
## term estimate std.error statistic df p.value
## 1 (Intercept) 11.430196 72.560560 0.1575263 8.020422 0.87872369
## 2 bmi 4.979979 2.322195 2.1445134 7.835031 0.06503537
## 3 age 27.505289 10.784999 2.5503285 9.748109 0.02938517
What the output shows:
estimate: The average coefficient across all imputations.
std.error: This is now “corrected.” It combines the variance within the datasets and the variance between the datasets.
If you look at the final output, you’ll notice the
std.error is likely larger than it would be if you just did
a single imputation. This is a good thing! It means you aren’t
overconfident in your results.
In estimation, we focus on one pooled \(\beta\) coefficient. In prediction, we want one pooled probability \((\hat{p})\). The Rule of Thumb: Always pool the results, never the data. * Do not average the imputed datasets into one single dataset and then predict. This ignores the uncertainty. The Workflow: Generate \(m\) imputations. Fit your model on each. Generate a predicted probability for the new patient in each of the \(m\) models. Average those probabilities to get the final prediction.
Why not just use one “Average” Dataset?If you average the variables (e.g., \(BMI_{avg} = 26\)) before predicting, you create a “synthetic” patient that doesn’t exist. By predicting across \(m\) datasets, you account for the fact that a patient’s BMI might have been 24 in one scenario and 28 in another. This provides a Mean Predicted Probability and allows you to calculate a Prediction Interval, giving you a sense of how “sure” the model is about that specific patient.
Let’s adapt the previous code to predict the probability of hypertension (hyp).
# Convert 'hyp' to 0 and 1
# 1 (No) becomes 0
# 2 (Yes) becomes 1
data$hyp <- ifelse(data$hyp == 2, 1, 0)
# Now run the imputation on the 0/1 data
imputed_data <- mice(data, m = 5, method = 'pmm', seed = 123, printFlag = FALSE)
# Now fit the logistic regression model on the imputed datasets
fit_logistic <- with(imputed_data, glm(hyp ~ bmi + age + chl, family = binomial))
Suppose now that we have a new patient and we want to predict his/her hypertension status We calculate the prediction for this patient across all 5 imputed models
# Extract predictions for each imputed dataset
pred_list <- lapply(1:5, function(i) {
# Get the i-th completed dataset
dat_i <- complete(imputed_data, i)
# Refit the model on this dataset
mod_i <- glm(hyp ~ bmi + age + chl, data = dat_i, family = binomial)
# Predict probability for a hypothetical new patient
# (Age=3, BMI=28, Chl=220)
predict(mod_i, newdata = data.frame(age=3, bmi=28, chl=220), type="response")
})
# Convert list to vector
preds <- unlist(pred_list)
# Calculate Mean and Standard Deviation of the prediction
mean_pred <- mean(preds)
sd_pred <- sd(preds)
range_pred <- range(preds)
# Output the uncertainty
cat("Mean Predicted Probability:", round(mean_pred, 3), "\n")
## Mean Predicted Probability: 0.851
cat("Variation (SD) due to Imputation:", round(sd_pred, 3), "\n")
## Variation (SD) due to Imputation: 0.114
cat("Range of predictions (Min to Max):", round(range_pred[1], 3), "-", round(range_pred[2], 3))
## Range of predictions (Min to Max): 0.71 - 0.95
This approach is a simplified version of the original Rubin’s rule (see the Appendix).
If we only take the average of the \(m\) imputed datasets, we underestimate the standard error.
We act as if the imputed values are “true” facts, which they aren’t.
The Rubin Solution: The total variance must be split into two parts:
Within-Imputation Variance (\(W\)): The typical sampling error (uncertainty because we didn’t survey the whole population).
Between-Imputation Variance (\(B\)): The uncertainty created by the missing data (how much our results change depending on what values we “guessed”).
The Final Result: Total Variance (\(T\)) = Within + Between + a small adjustment factor.
To calculate the final estimate and its variance, we use these three steps:
This is simply the arithmetic mean of the estimates (e.g., the regression coefficients \(\hat{\beta}\)) from each imputed dataset: \[\bar{Q} = \frac{1}{m} \sum_{i=1}^{m} \hat{Q}_i\]
Where \(m\) is the number of imputations and \(\hat{Q}_i\) is the estimate from dataset \(i\).
Within-Imputation Variance (\(W\)):
The average of the squared standard errors from each model.
\[W = \frac{1}{m} \sum_{i=1}^{m} SE_i^2\]
Between-Imputation Variance (\(B\)):
How much the individual estimates \(\hat{Q}_i\) vary around the pooled average \(\bar{Q}\).
\[B = \frac{1}{m-1} \sum_{i=1}^{m} (\hat{Q}_i - \bar{Q})^2\]
\[T = W + B + \frac{B}{m}\]
Note: The \(\frac{B}{m}\) term is a “penalty” for using a finite number of imputations. As \(m\) (the number of datasets) increases toward infinity, this penalty disappears.
Why does this matter?
In epidemiology and clinical research, we often deal with selection bias. If you ignore the “Between-Imputation Variance” (\(B\)), your Confidence Intervals will be too narrow (too “precise”), which can lead to Type I errors—claiming a drug or exposure is significant when you actually just don’t have enough data to be sure. Rubin’s Rules ensure your p-values are “honest” by inflating the standard error to reflect the missingness.
Note that Rubin’s Rules were designed for variables that are normally distributed.
If we want to apply the previous approach we should have analytical formula for the standard error, and moreover the variables of interest should be *approximately” gaussian distributed.
If you are using an algorithm like Random Forest, XGBoost, or a Neural Network, you usually don’t have a simple formula for the standard error of a single prediction. In that case, you rely on the empirical distribution of your predictions across the \(m\) datasets like we did before.
If instead you can calculate analytically the standard errors, then you can apply Rubin’s rules, like for example :
# New patient data
new_pt <- data.frame(age=3, bmi=28, chl=220)
# 1. Get predictions on the 'link' scale (log-odds) with standard errors
predictions_logit <- lapply(1:5, function(i) {
dat_i <- complete(imputed_data, i)
mod_i <- glm(hyp ~ bmi + age + chl, data = dat_i, family = binomial)
# Predict returns log-odds and SE when se.fit = TRUE
predict(mod_i, newdata = new_pt, type = "link", se.fit = TRUE)
})
# 2. Extract estimates (Q) and variances (W)
Q_i <- sapply(predictions_logit, function(x) x$fit)
W_i <- sapply(predictions_logit, function(x) x$se.fit^2)
# 3. Apply Rubin's Rules manually for the log-odds
Q_bar <- mean(Q_i)
W_bar <- mean(W_i)
B <- var(Q_i)
T_var <- W_bar + B + (B/5) # Total Variance
# 4. Calculate 95% CI on logit scale, then back-transform to Probability
lower_logit <- Q_bar - 1.96 * sqrt(T_var)
upper_logit <- Q_bar + 1.96 * sqrt(T_var)
# Inverse Logit function
inv_logit <- function(x) 1 / (1 + exp(-x))
cat("Pooled Predicted Risk:", round(inv_logit(Q_bar), 3), "\n")
## Pooled Predicted Risk: 0.88
cat("95% Prediction CI:", round(inv_logit(lower_logit), 3), "to", round(inv_logit(upper_logit), 3))
## 95% Prediction CI: 0.152 to 0.997