A pharmaceutical team is developing a new antiviral drug, Zilavax, to reduce the recovery time for a specific viral infection. Standard Care (Control) has a mean recovery time of 10 days with a known standard deviation of 3 days. The investigators want to design a RCT (Randomized Clinical Trial) to determine if Zilavax reduces the recovery time at least to 8.5 days. Calculate the required sample size per group and the total sample size for this trial based on the following parameters: • Significance Level (alpha): 0.05 (two-sided). • Power (1-beta): 80%. • Minimum Detectable Difference/Effect size (delta): The difference between 10 days and 8.5 days. • Allocation Ratio: 1:1 Suppose then that the infection is more heterogeneous than expected, and the standard deviation is actually 4 days instead of 3. Without performing the full calculation again, explain qualitatively what happens to the required sample size and why.
The goal is to determine the required sample size for a Randomized Clinical Trial (RCT) comparing the recovery time of the new antiviral drug Zilavax against Standard Care.
Based on the study design, we use the following parameters:
We use the standard power formula for the comparison of two independent means:
# Define parameters
alpha <- 0.05
power <- 0.80
sigma <- 3
delta <- 1.5
# Critical values
z_alpha <- qnorm(1 - alpha/2)
z_beta <- qnorm(power)
# Sample size per group (n)
n_per_group <- (2 * sigma^2 * (z_alpha + z_beta)^2) / (delta^2)
n_per_group
## [1] 62.79104
# Total sample size
total_n <- ceiling(n_per_group) * 2
total_n
## [1] 126
If the population is more heterogeneous (\(\sigma = 4\) instead of \(3\)),the “noise” in the data increases. Because the sample size \(n\) is proportional to the variance (\(\sigma^2\)), an increase in variability requires a larger sample size to maintain the same statistical power and detect the \(1.5\)-day difference.
Explain the concept of ‘symmetry of roles’ between disease and exposure in the context of the odds ratio estimation.
The “symmetry of roles” is one of the most elegant and fundamental properties of the Odds Ratio (OR). It explains why we can use a case-control study (where we sample based on disease) to make inferences about the risk associated with an exposure. In most epidemiological measures, the roles of “Exposure” (\(E\)) and “Disease” (\(D\)) are rigid. For example, Relative Risk (RR) is strictly the ratio of probabilities of disease given exposure. However, the Odds Ratio is mathematically identical regardless of which variable you treat as the outcome. Consider a standard \(2 \times 2\) table:
| Disease (+) | Disease (-) | |
|---|---|---|
| Exposure (+) | \(a\) | \(b\) |
| Exposure (-) | \(c\) | \(d\) |
The OR can be viewed in two ways:the odds of being diseased among the exposed vs. the unexposed. \[\text{OR} = \frac{a/b}{c/d} = \frac{ad}{bc}\] or as the odds of being exposed among the diseased vs. the non-diseased. \[\text{OR} = \frac{a/c}{b/d} = \frac{ad}{bc}\]
Because both equations simplify to the cross-product ratio (\(ad/bc\)), the result is identical.
This symmetry is the scientific justification for the Case-Control study design.In a cohort study, we follow the “direction of time”: we look at exposure and see who gets the disease. In a case-control study, we look at the disease (the “outcome”) and look backward at the exposure.Even though a case-control study technically measures the Exposure Odds Ratio (how much more likely a “case” was to be exposed than a “control”), the symmetry of roles allows us to interpret that result as the Disease Odds Ratio (how much the exposure increases the odds of disease).
The table below shows the results from looking at the diagnostic accuracy of a new rapid test for HIV in 100,000 subjects, compared to the Reference standard ELISA test. The rows of the table represent the test result and the columns the true disease status (as confirmed by ELISA).
| HIV+ (Reference) | HIV- (Reference) | Total | |
|---|---|---|---|
| Rapid Test + | 378 (TP) | 397 (FP) | 775 |
| Rapid Test - | 2 (FN) | 98,823 (TN) | 98,825 |
| Total | 380 | 99,220 | 100,000 |
What are the sensitivity and positive predictive value (PPV) of the new rapid HIV test? Based on these values, is it justifiable to replace the ELISA procedure for future testing?
Sensitivity measures the ability of the test to correctly identify those with the disease.How many of the sick patients tested positive? \[\text{Sensitivity} = \frac{TP}{TP + FN}\] \[\text{Sensitivity} = \frac{378}{378 + 2} = \frac{378}{380} \approx 0.9947 \text{ or } \mathbf{99.47\%}\]
Positive Predictive Value (PPV) measures the probability that a subject actually has the disease given that the test result is positive.How much does the positive result change the probability of the disease ?
\[\text{PPV} = \frac{TP}{TP + FP}\]
\[\text{PPV} = \frac{378}{378 + 397} = \frac{378}{775} \approx 0.4877 \text{ or } \mathbf{48.77\%}\]
Based on these values, it is not justifiable to replace the ELISA procedure with this rapid test as a standalone diagnostic tool. A PPV of roughly \(49\%\) means that if a person tests positive on the rapid test, there is less than a \(50\%\) chance they actually have HIV. More than half of the positive results are “False Alarms” (False Positives). In the context of a serious disease like HIV, a \(48.77\%\) PPV is low for a definitive diagnosis. It would cause significant unnecessary distress for the healthy individuals who tested positive. Even though the sensitivity is excellent (\(99.47\%\)), the PPV is dragged down because the prevalence of the disease in this population is very low. When a disease is rare, even a test with high specificity will produce a large number of false positives relative to true positives. The rapid test could act as a screening tool because of its high sensitivity (it misses very few cases), but it must be followed by a more specific confirmatory test (like the ELISA or a Western Blot) to rule out the high number of false positives.
You are an epidemiologist designing a cross-sectional study to estimate the prevalence of a newly emerging respiratory pathogen in a specific urban population. Preliminary data suggests the prevalence is approximately 15%. Your primary goal is estimation with high precision to inform public health resource allocation. The Ministry of Health requires that the amplitude of the 95% confidence interval for the prevalence estimate must not exceed 6 percentage points (i.e., a margin of error of ±3%). Calculate the minimum sample size required to achieve this level of precision. Sensitivity Analysis: If the true prevalence in the population is actually 25% instead of the anticipated 15%, what happens to the width of your confidence interval if you stick with the sample size calculated before? Provide the new margin of error.
To find the sample size \(n\), we use: \[n = \frac{Z^2 \cdot P(1 - P)}{d^2}\] Where: \(Z\): The Z-score for the confidence level (for 95%, \(Z \approx 1.96\)).\(P\): The anticipated prevalence (\(0.15\)).\(d\): The margin of error (\(0.03\)).
# Parameters
conf_level <- 0.95
p_expected <- 0.15
margin_of_error <- 0.03
# Z-score for 95% confidence
z <- qnorm(1 - (1 - conf_level) / 2)
# Sample size formula
n_required <- (z^2 * p_expected * (1 - p_expected)) / (margin_of_error^2)
n_required <- ceiling(n_required) # Always round up to the nearest integer
cat("The minimum sample size required is:", n_required)
## The minimum sample size required is: 545
If the true prevalence is actually 25% (\(P = 0.25\)), but we use the sample size calculated above, we need to find the new margin of error (\(d\)).We rearrange the formula to solve for \(d\):
\[d = Z \cdot \sqrt{\frac{P(1 - P)}{n}}\]
p_true <- 0.25
# Calculate new margin of error
new_moe <- z * sqrt((p_true * (1 - p_true)) / n_required)
# Convert to percentage for clarity
new_moe_pct <- new_moe * 100
total_width <- new_moe_pct * 2
cat("The amplitude of the interval is:", total_width)
## The amplitude of the interval is: 7.270771
Because 25% is closer to 50% (the point of maximum variance for a binomial distribution), the fixed sample size results in a loss of precision. The confidence interval becomes wider than the Ministry of Health’s 6-point limit.
Researchers are interested in investigating whether exposure to a specific pesticide during pregnancy is associated with an increased risk of a rare birth defect. They hypothesize that the pesticide might increase the likelihood of this defect occurring in newborns. What is the most suitable study design for assessing this relationship? Explain your reasoning.
For this scenario, the most suitable study design is a Case-Control Study. In epidemiology, choosing the right design depends on the frequency of the outcome and the nature of the exposure. Here is the reasoning why a case-control approach outperforms other designs in this specific context: since the birth defect is rare in an hypothetical cohort Study we would have to follow thousands of pregnant women (exposed and unexposed) and wait for them to give birth, only to potentially find a handful of cases. This is incredibly expensive and time-consuming (and also not completely ethical). In the Case-Control Study we instead identify infants who already have the defect (the cases) and a group of infants who do not (the controls). This allows us to achieve statistical power with a much smaller total sample size.Since the researchers are looking at exposure during pregnancy, a case-control study allows them to look backward in time. Researchers can interview mothers or review medical records to determine pesticide exposure levels during the gestational period. While this is the best design, it could be susceptible to Recall Bias. Mothers of children with birth defects may remember their environmental exposures more vividly or differently than mothers of healthy children.
Define the term ‘confounded association’ and provide an example of confounding.
A confounded association occurs in an epidemiological study when the observed relationship between an exposure and an outcome is distorted by a third variable, known as a confounder. In this situation, the confounder is independently associated with both the exposure and the outcome. As a result, the researcher may incorrectly attribute the outcome to the exposure, when in reality, the confounder is either partially or entirely responsible for the observed effect. To be considered a confounder, a variable must meet three specific conditions: - Association with Exposure: It must be associated with the exposure being studied. - Association with Outcome: It must be a risk factor for the outcome, independent of the exposure. - Not a Mediator: It must not be on the causal pathway between the exposure and the outcome. Example of Confounding: Imagine a study investigating the association between Coffee Consumption (Exposure) and Pancreatic Cancer (Outcome). The Observed Association: The researchers find that people who drink more coffee have a higher risk of developing pancreatic cancer.The Confounder: Smoking.
Link to Exposure: Smokers tend to drink more coffee than non-smokers.
Link to Outcome: Smoking is a known, strong risk factor for pancreatic cancer.
Not a Mediator: there is no reason to affirm that smoking is on the causal pathway between coffee and the outcome (coffee is not a biological “cause” of the smoking habit, is just positively correlated with).
Because smoking is tied to both coffee and cancer, coffee “picks up” the blame for the harm caused by smoking. If the researchers do not “control” for smoking in their analysis (e.g., by stratifying with respect to smoke or including smoking in a multivariable regression model), they might falsely conclude that coffee causes cancer.
Optional: we can handle confounding at two stages:
Design Stage:Randomization: Randomly assigning exposure ensures confounders are distributed equally.
Matching: Selecting a control group that matches the cases based on the confounding variable (e.g., matching by age or smoking status).
Restriction: Only including people in the study who fall into one category of the confounder (e.g., a study only of non-smokers).
At the statistical analysis Stage:
For example, using a multivariable regression model:
\[\ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1(\text{Exposure}) + \beta_2(\text{Confounder})\]
By including the confounder in the model, we can isolate the true causal conditional effect (\(\beta_1\)) of the exposure on the outcome adjusted for the confounder (in absence of other unmeasured confounding…)
A researcher in Trieste is designing a study to estimate the prevalence of a particular health behavior within the adult population of the city. They want to determine the necessary sample size for their survey to achieve a specific level of precision and confidence. Identify and briefly explain the key input parameters the researcher needs to consider and know (or estimate) before they can calculate the required sample size. (Optional): provide hypothetical examples by assigning plausible values to the parameters you identified and then calculate the approximate minimum sample size required.
To calculate the sample size for a prevalence study (typically a cross-sectional survey), the researcher needs to define four critical parameters.
Expected Prevalence (\(p\)): This is the researcher’s “best guess” or an estimate from previous literature of how common the behavior is. If the prevalence is unknown, using 0.5 (50%) is the standard conservative approach because it maximizes the required sample size.
Confidence Level (\(Z\)): This reflects how certain the researcher wants to be that the sample represents the population. Usually, a 95% confidence level is used, which corresponds to a Z-score of 1.96.
Margin of Error / Precision (\(d\)): For example, if the required precision is \(0.05\) (5%), and the result is 20%, the researcher is confident the true value is between 15% and 25%. Smaller \(d\) requires a much larger sample.
Population Size (\(N\)): The total number of adults in Trieste. However, if the population is large (typically \(> 10,000\)), the sample size calculation becomes relatively independent of \(N\) unless the researcher applies a “finite population correction.” We did not discuss this aspect in the course.
Hypothetical Calculation: To calculate the sample size (\(n\)), we use the Formula:
\[n = \frac{Z^2 \cdot p(1-p)}{d^2}\]
Assumed Values: Confidence Level: 95% (\(Z = 1.96\)); Expected Prevalence (\(p\)): 30% (\(0.30\)) — Let’s assume this is the estimated rate of daily espresso consumption. Precision (\(d\)): 5% (\(0.05\))
The Calculation:Square the Z-score: \(1.96^2 \approx 3.8416\) Calculate the variance: \(0.30 \times (1 - 0.30) = 0.21\) Square the precision: \(0.05^2 = 0.0025\)
\[n = \frac{3.8416 \times 0.21}{0.0025}\]
\[n = \frac{0.8067}{0.0025}\]
\[n \approx 322.68\]
Minimum Sample Size Required: The researcher would need at least 323 participants to be 95% confident that their estimate of the behavior in Trieste is within 5 percentage points of the true prevalence.
Of note : If the researcher expects a high non-response rate (e.g., 20% of people won’t answer the survey), they should increase the initial sample size to ensure they end up with 323 completed responses.
\[n_{final} = \frac{n}{0.80} \approx 404\]
In what situations is the incidence rate a more appropriate measure than cumulative incidence?
In epidemiology, both measures describe how frequently new cases of a disease occur, but they handle the concept of time differently. Incidence Rate is more appropriate than Cumulative Incidence in the following four specific situations:
Dynamic Populations (Open Cohorts) In a real-world city (like Trieste), people are constantly moving in and out, being born, or dying. Cumulative Incidence assumes a “closed” group where everyone is followed for the entire period. Incidence Rate is better because it accounts for people entering or leaving the study during the observation at different times by using person-time in the denominator.
Differing Lengths of Follow-up If we are conducting a study where participants are followed for unequal amounts of time, the Incidence Rate is the most accurate choice. Example: If one person is followed for 10 years and another for 2 years, they do not contribute the same amount of “risk” to the study. The Incidence Rate accounts for this by calculating cases per person-years.
High Attrition or “Loss to Follow-up” In long-term studies, people often drop out (move away, lose interest, or die from other causes). Cumulative Incidence gets “diluted” or biased if we lose many people, as we don’t know if they developed the disease after leaving. Incidence Rate handles this by only counting the time they were actually under observation.
Recurrent Events If the health outcome is something that can happen more than once to the same person (e.g., the common cold, asthma attacks, or sports injuries): cumulative Incidence is poorly suited for this because it typically measures the proportion of people who had at least one event. Incidence Rate can also count multiple events over a total span of time, providing a much more accurate picture of the disease burden.(We did not discuss this aspect specifically in Block 1)