For the sole purpose of illustration, here we use the “single-split” strategy, training the logistic model on 63% of the data and then calculating the performance measures in the remaining 37% of the data. A single split of the purpose dataset is not recommended in most real applications for efficiency reasons (the model trained in the full data outperform the model trained on a reduced training set) and because the results of a single random split may depend grossly on the random seed, i.e. the results will be different when the data were split in a different way. Boostrap or cross-validation is in this sense safer, even if giving us an average perfomance of the final model (but it is for sure recommended when comparing altenative models). This example is characterized by uncensored outcome: no one is lost to follow up before the prediction time horizon, and hence the outcome status (yes/no) of all subjects is known at the prediction time horizon.
library(MedicalRiskPredictionModels)
prepareExamples()
The dataset ivf come from an in vitro fertilization study : this is a computer modified synthetic version of the data of an in vitro fertilization study (https://www.rbmojournal.com/article/S1472-6483(10)00564-X/fulltext). Ovarian stimulation carries a risk of either low or excessive ovarian response. The aim was to develop prognostic models for identification of standard (ovulatory and normal basal FSH) patients’ risks of low and excessive response to conventional stimulation for IVF/intracytoplasmic sperm injection. Prospectively collected data on 276 first-cycle patients treated with 150 IU recombinant FSH (rFSH)/day were analysed. The outcome variable is ovarian hyper stimulation syndrome. Covariates used in the model are woman’s age, menstrual cycle length, antral follicle count and smoking.
Here we estimate the model on 174 subjects of the training test and we visualize the estimated probabilities on the test set:
fit <- glm(ohss~ant.foll+cyclelen+age+smoking,data=ivftrain,family="binomial")
summary(fit)
##
## Call:
## glm(formula = ohss ~ ant.foll + cyclelen + age + smoking, family = "binomial",
## data = ivftrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2544 -0.6998 -0.4532 0.8073 2.0464
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.36802 3.33629 -1.609 0.1076
## ant.foll 0.13551 0.02703 5.014 5.33e-07 ***
## cyclelen 0.12445 0.09735 1.278 0.2011
## age -0.04723 0.05637 -0.838 0.4021
## smokingYes -0.94109 0.44815 -2.100 0.0357 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 220.09 on 173 degrees of freedom
## Residual deviance: 168.65 on 169 degrees of freedom
## AIC: 178.65
##
## Number of Fisher Scoring iterations: 5
Note that two out of the four predictors are not even statistically significant; probably we could also obtain here a more parsimonious model, but we keep this version just as an example.
x <- Score(list(fit),formula=ohss~1,data=ivftest,summary="riskQuantiles") # add the predicted risk to the test set
boxplot(x)
abline(v=mean(ivftest$ohss),lty=2, col="red", lwd=2)# outcome prevalence in the test set
abline(v=mean(predictRisk(fit,newdata=ivftest)),lty=3, col="blue", lwd=2)# average of the predicted risk according to the model
We can observe that there is a good calibration in the large, i.e. the two lines are quite close one to each other; remind that the model will be perfectly calibrated (in the large) if the two lines were indistinguishable.
The AUC in the test dataset is:
fit <- glm(ohss~ant.foll+cyclelen+age+smoking,data=ivftrain,family="binomial")
Score(list("My model"=fit),formula=ohss~1,data=ivftest,metrics="auc")
##
## Metric AUC:
##
## Results by model:
##
## model AUC lower upper
## 1: My model 86.5 79.5 93.5
##
## NOTE: Values are multiplied by 100 and given in %.
## NOTE: The higher AUC the better.
we can also visualize it:
fit <- glm(ohss~ant.foll+smoking+age,data=ivftrain,family=binomial)
x <- Score(list("My-model"=fit), data=ivftest, formula=ohss~1,
plots="ROC")
plotROC(x)
It is quite a good value for the AUC, even if the confidence interval is large (small sample size!).
There are many different methods to evaluate calibration; in general terms a calibration curve is defined over the range of values that the model is predicting. For any such value, the calibration is the observed event frequency among patients that have a predicted risk equal ot at least close to that value. There are then various ways to define close : a very basic method is to categorize the predicted risk according to deciles.
fit <- glm(ohss~ant.foll+smoking+age,data=ivftrain,family=binomial)
x <- Score(list("My-model"=fit), data=ivftest, formula=ohss~1,
plots="calibration")
plotCalibration(x,bars=1,q=10)
We can observe some deviations from perfect calibration, in particular in the lower end of the predicted risks. It has to be noted also that the histogram type of calibration plot is quite sensitive to the number of groups, as it is the Hosmer-Lemeshow test. See for example how different looks the following plot:
plotCalibration(x,bars=1,q=3)
As an alternative, one can use the more flexible density type of calibration plot:
plotCalibration(x,brier.in.legend=FALSE, auc.in.legend=FALSE)
boxplot(x$Calibration$plotframe$risk,horizontal=TRUE, main="Predicted risk")
To compute the density curve, the predicted risks are ordered and then a window of values around points on the x-axis is defined. These values are used to calculate the observed frequency of the event. The density type calibration curve is sensitive to the width of this window. Also, when interpreting a local deviation from the diagonal line one needs to consider the number of subjects in that region of the predicted risk. The calibration curve is a nice graphical representation between the predicted risk and the observed outcome. Unfortunately there are three important limitations of the calibration curve: first, it does not quantify the miscalibration of the model as a single number. Second, deviations from the 45-degree line are more important in regions where many subjects have their predicted risk. Third, there are many ways to calculate the running average and the choice of the method typically impacts the visual nature of the curve. A good idea is to compare different approaches in calibration, using for example the CalibrationCurves R package:
library(CalibrationCurves)
Pred.p <- predictRisk(fit,newdata=ivftest)
yval.p <- ivftest$ohss
val.prob.ci.2(Pred.p, yval.p, CL.smooth = FALSE, logistic.cal = TRUE, lty.log = 2,
col.log = "blue", lwd.log = 1.5)
## Call:
## val.prob.ci.2(p = Pred.p, y = yval.p, CL.smooth = FALSE, logistic.cal = TRUE,
## lty.log = 2, col.log = "blue", lwd.log = 1.5)
##
## A 95% confidence interval is given for the calibration intercept, calibration slope and c-statistic.
##
## Dxy C (ROC) R2 D D:Chi-sq
## 7.124035e-01 8.562017e-01 4.254247e-01 3.481072e-01 3.650693e+01
## D:p U U:Chi-sq U:p Q
## 1.521244e-09 -3.844537e-03 1.607857e+00 4.475672e-01 3.519517e-01
## Brier Intercept Slope Emax Brier scaled
## 1.452398e-01 -4.422196e-02 1.359613e+00 9.337746e-02 3.134599e-01
## Eavg ECI
## 5.798306e-02 4.089080e-01
This plot give us the estimates of the intercept and slope using the logistic regression calibration approach, with corresponding confidence intervals, so that we can also have a numerical evaluation of the calibration. We can see that we have a nice result for the calibration in the large, with the confidence interval for the intercept containing zero, and a trend towards an under-fitting for the slope being > 1, even if the confidence interval contains 1, so it is not statistically significant on average this under-fitting. Moreover, we can compare the logistic regression approach with a more flexible one (default is loess here). We can also obtain other numerical measures of calibration here (see the help of the original function val.prob of the rms library for all the details).
Finally we can depict also confidence intervals around the estimated calibration curve:
val.prob.ci.2(Pred.p, yval.p)
## Call:
## val.prob.ci.2(p = Pred.p, y = yval.p)
##
## A 95% confidence interval is given for the calibration intercept, calibration slope and c-statistic.
##
## Dxy C (ROC) R2 D D:Chi-sq
## 7.124035e-01 8.562017e-01 4.254247e-01 3.481072e-01 3.650693e+01
## D:p U U:Chi-sq U:p Q
## 1.521244e-09 -3.844537e-03 1.607857e+00 4.475672e-01 3.519517e-01
## Brier Intercept Slope Emax Brier scaled
## 1.452398e-01 -4.422196e-02 1.359613e+00 9.337746e-02 3.134599e-01
## Eavg ECI
## 5.798306e-02 4.089080e-01
We now make an example of the cross-validation approach to evaluate performance using simulated data : we simulate data with binary outcome and 10 covariates.
set.seed(10)
learndat=sampleData(400,outcome="binary")
head(learndat)
## X6 X7 X8 X9 X10 X1 X2 X3 X4 X5
## 1: 63.74545 61.52629 -0.04798623 0.2768507 1.0500137 0 0 0 1 1
## 2: 60.90024 61.76612 0.35769140 -0.2152228 0.2860926 0 0 0 1 0
## 3: 53.11203 58.29247 0.80898961 1.7082707 0.2405648 0 0 1 1 0
## 4: 70.65227 62.57300 0.64772237 1.5384312 0.8327052 0 0 0 1 1
## 5: 44.40411 58.12349 0.59938390 -0.2362137 -0.2229832 0 0 1 1 0
## 6: 82.01472 61.03095 -0.39172740 -2.0764318 0.2883442 0 0 0 0 0
## Y
## 1: 1
## 2: 1
## 3: 0
## 4: 0
## 5: 1
## 6: 1
Then we fit two alternative models [to the full dataset]:
lr1a = glm(Y~X6,data=learndat,family=binomial)
lr2a = glm(Y~X7+X8+X9,data=learndat,family=binomial)
Now we perform the bootstrap cross-validation in different ways:
x1=Score(list(“LR1”=lr1a,“LR2”=lr2a),formula=Y~1,data=learndat,split.method=“bootcv”,B=100)
x1
##
## Metric AUC:
##
## Results by model:
##
## model AUC lower upper
## 1: LR1 71.1 64.2 77.2
## 2: LR2 72.0 65.3 78.5
##
## Results of model comparisons:
##
## model reference delta.AUC lower upper
## 1: LR2 LR1 0.9 -9.2 11.9
##
## NOTE: Values are multiplied by 100 and given in %.
## NOTE: The higher AUC the better.
##
## Metric Brier:
##
## Results by model:
##
## model Brier lower upper
## 1: Null model 25.0 24.5 25.7
## 2: LR1 21.9 20.1 24.3
## 3: LR2 21.6 19.5 24.3
##
## Results of model comparisons:
##
## model reference delta.Brier lower upper
## 1: LR1 Null model -3.2 -4.8 -0.6
## 2: LR2 Null model -3.4 -5.4 -0.9
## 3: LR2 LR1 -0.2 -4.3 3.3
##
## NOTE: Values are multiplied by 100 and given in %.
## NOTE: The lower Brier the better.
##
## Bootstrap cross-validation based on 100 bootstrap samples (drawn with replacement) each of size 400.
Leave-one-out and leave-pair-out bootstrap:
x2=Score(list(“LR1”=lr1a,“LR2”=lr2a),formula=Y~1,data=learndat,split.method=“loob”, B=100,plots=“calibration”)
x2
##
## Metric AUC:
##
## Results by model:
##
## model AUC lower upper
## 1: Null model 50.0 50.0 50.0
## 2: LR1 71.3 66.3 76.3
## 3: LR2 72.4 67.5 77.3
##
## Results of model comparisons:
##
## model reference delta.AUC lower upper p
## 1: LR1 Null model 21.3 16.3 26.3 7e-17
## 2: LR2 Null model 22.4 17.5 27.3 3e-19
## 3: LR2 LR1 1.1 -6.7 8.9 8e-01
##
## NOTE: Values are multiplied by 100 and given in %.
## NOTE: The higher AUC the better.
##
## Metric Brier:
##
## Results by model:
##
## model Brier lower upper
## 1: Null model 25.0 24.6 25.4
## 2: LR1 21.7 20.2 23.3
## 3: LR2 21.4 19.7 23.1
##
## Results of model comparisons:
##
## model reference delta.Brier lower upper p
## 1: LR1 Null model -3.3 -4.8 -1.7 3.634041e-05
## 2: LR2 Null model -3.6 -5.3 -1.9 3.333449e-05
## 3: LR2 LR1 -0.3 -2.8 2.2 8.111893e-01
##
## NOTE: Values are multiplied by 100 and given in %.
## NOTE: The lower Brier the better.
##
## Data for Calibration plot are stored in the object as x[["Calibration"]].
##
## Efron's leave-one-out-bootstrap based on 100 bootstrap samples (drawn with replacement) each of size 400.
## The 'confidence intervals' and 'p-values' are obtained with the delta method after bootstrap.
And finally the well-known 5-fold cross-validation:
x3=Score(list(“LR1”=lr1a,“LR2”=lr2a),formula=Y~1,data=learndat,split.method=“cv5”,B=1,plots=“calibration”)
x3
##
## Metric AUC:
##
## Results by model:
##
## model AUC lower upper
## 1: LR1 71.1 66.1 76.1
## 2: LR2 72.0 67.0 77.1
##
## Results of model comparisons:
##
## model reference delta.AUC lower upper p
## 1: LR2 LR1 0.9 -7.0 8.8 0.8
##
## NOTE: Values are multiplied by 100 and given in %.
## NOTE: The higher AUC the better.
##
## Metric Brier:
##
## Results by model:
##
## model Brier lower upper
## 1: Null model 24.8 24.4 25.3
## 2: LR1 21.5 20.0 23.1
## 3: LR2 21.3 19.6 23.0
##
## Results of model comparisons:
##
## model reference delta.Brier lower upper p
## 1: LR1 Null model -3.3 -4.9 -1.7 3.047993e-05
## 2: LR2 Null model -3.6 -5.2 -1.9 3.637662e-05
## 3: LR2 LR1 -0.3 -2.8 2.3 8.413500e-01
##
## NOTE: Values are multiplied by 100 and given in %.
## NOTE: The lower Brier the better.
##
## Data for Calibration plot are stored in the object as x[["Calibration"]].
##
## 5-fold cross-validation repeated 1 times.
No significant differences are observed between the two models. Of note, the Brier score is a popular metric in a way combining in one single summary calibration and discrimination for a logistic model. See for details and discussion about this metric : https://diagnprognres.biomedcentral.com/articles/10.1186/s41512-017-0020-3
We now want to compute the minimum sample size required for the development of a new multivariable model using the criteria proposed by Riley et al. The required sample size aims to minimize model overfitting and to ensure key parameters (such as the model intercept) are estimated precisely. As for any sample size calculation, the approach requires the user to specify anticipated values for key parameters.
The package pmsampsize can be used to calculate the minimum sample size for the development of models with continuous, binary or survival (time-to-event) outcomes. Riley et al. lay out a series of criteria the sample size should meet. These aim to minimise the overfitting and to ensure precise estimation of key parameters in the prediction model.
For continuous outcomes, there are four criteria:
The sample size calculation requires the user to pre-specify (e.g. based on previous evidence) the anticipated R-squared of the model, and the average outcome value and standard deviation of outcome values in the population of interest.
For binary or survival (time-to-event) outcomes, there are three major criteria:
library(pmsampsize)
Arguments of the function:
type specifies the type of analysis for which sample size is being calculated
“c”” specifies sample size calculation for a prediction model with a continuous outcome
“b” specifies sample size calculation for a prediction model with a binary outcome
“s” specifies sample size calculation for a prediction model with a survival (time-to-event) outcome
rsquared specifies the expected value of the (Cox-Snell) R-squared of the new model, where R-squared is the percentage of variation in outcome values explained by the model. For example, the user may input the value of the (Cox-Snell) Rsquared reported for a previous prediction model study in the same field. If taking a value from a previous prediction model development study, users should input the model’s adjusted R-squared value, not the apparent R-squared value, as the latter is optimistic (biased). However, if taking the R-squared value from an external validation of a previous model, the apparent R-squared can be used (as the validation data was not used for development, and so R-squared apparent is then unbiased). Note that for binary and survival outcome models, the Cox-Snell R-squared value is required; this is the generalised version of the well-known Rsquared for continuous outcomes, based on the likelihood. The papers by Riley et al. (see references) outline how to obtain the Cox-Snell R-squared value from published studies if they are not reported, using other information (such as the Cstatistic [see cstatistic() option below] or Nagelkerke’s R-squared). Users should be conservative with their chosen R-squared value; for example, by taking the R-squared value from a previous model, even if they hope their new model will improve performance.
parameters specifies the number of candidate predictor parameters for potential inclusion in the new prediction model. Note that this may be larger than the number of candidate predictors, as categorical and continuous predictors often require two or more parameters to be estimated.
shrinkage specifies the level of shrinkage desired at internal validation after developing the new model. Shrinkage is a measure of overfitting, and can range from 0 to 1, with higher values denoting less overfitting. A shrinkage = 0.9 is recommended (the default in pmsampsize), which indicates that the predictor effect (beta coefficients) in the model would need to be shrunk by 10% to adjust for overfitting. See references below for further information.
prevalence (binary outcome option) specifies the overall outcome proportion (for a prognostic model) or overall prevalence (for a diagnostic model) expected within the model development dataset. This should be derived based on previous studies in the same population.
cstatistic (binary outcome option) specifies the C-statistic reported in an existing prediction model study to be used in conjunction with the expected prevalence to approximate the Cox-Snell R-squared using the approach of Riley et al. 2020. Ideally, this should be an optimism-adjusted C-statistic. The approximate Cox- Snell R-squared value is used as described above for the rsquared() option, and so is treated as a baseline for the expected performance of the new model.
seed (binary outcome option) specifies the initial value of the random-number seed used by the random-number functions when simulating data to approximate the Cox-Snell R-squared based on reported C-statistic and expect prevalence as described by Riley et al. 2020
rate (survival outcome option) specifies the overall event rate in the population of interest, for example as obtained from a previous study, for the survival outcome of interest. NB: rate must be given in time units used for meanfup and timepoint options.
timepoint (survival outcome option) specifies the timepoint of interest for prediction. NB: time units must be the same as given for meanfup option (e.g. years, months).
meanfup (survival outcome option) specifies the average (mean) follow-up time anticipated for individuals in the model development dataset, for example as taken from a previous study in the population of interest. NB: time units must be the same as given for timepoint option.
intercept (continuous outcome options) specifies the average outcome value in the population of interest e.g. the average blood pressure, or average pain score. This could be based on a previous study, or on clinical knowledge.
sd (continuous outcome options) specifies the standard deviation (SD) of outcome values in the population e.g. the SD for blood pressure in patients with all other predictors set to the average. This could again be based on a previous study, or on clinical knowledge.
mmoe (continuous outcome options) multiplicative margin of error (MMOE) acceptable for calculation of the intercept. The default is a MMOE of 10%. Confidence interval for the intercept will be displayed in the output for reference. See references below for further information.
Self-identified race or ethnic group is used to determine normal reference standards in the prediction of pulmonary function. A study was conducted in 2010 to determine whether the genetically determined percentage of African ancestry is associated with lung function and whether its use could improve predictions of lung function among persons who identified themselves as African American (see:https://www.nejm.org/doi/full/10.1056/NEJMoa0907897) Authors assessed the ancestry of 777 participants self-identified as African American and evaluated the relation between pulmonary function and ancestry by means of linear regression. African ancestry was inversely related to forced expiratory volume in 1 second (FEV1) and forced vital capacity. Assuming to use 25 candidate parameters, and an intercept of 1.9, a standard deviation of 0.6 (from the published paper) and a lower bound for the R squared of 0.20:
pmsampsize(type = "c", rsquared = 0.2, parameters = 25, intercept=1.9, sd=0.6)
## NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
## NB: Assuming MMOE <= 1.1 in estimation of intercept & residual standard deviation
## SPP - Subjects per Predictor Parameter
##
## Samp_size Shrinkage Parameter Rsq SPP
## Criteria 1 918 0.900 25 0.2 36.72
## Criteria 2 401 0.801 25 0.2 16.04
## Criteria 3 259 0.727 25 0.2 10.36
## Criteria 4* 918 0.900 25 0.2 36.72
## Final 918 0.900 25 0.2 36.72
##
## Minimum sample size required for new model development based on user inputs = 918
##
## * 95% CI for intercept = (1.87, 1.93), for sample size n = 918
SPP indicates the “Subjects per Predictor parameter” and as you can observe this number vary accordingly to the criterion used (from 10 to 37). The confidence interval reported for the intercept is based on a 10% margin of error.
Chagas disease is a tropical parasitic disease.It is spread mostly by insects known as Triatominae, or “kissing bugs”. The symptoms change over the course of the infection. In the early stage, symptoms are typically either not present or mild, and may include fever, swollen lymph nodes, headaches, or swelling at the site of the bite.After four to eight weeks, untreated individuals enter the chronic phase of disease, which in most cases does not result in further symptoms.Up to 45% of people with chronic infection develop heart disease 10-30 years after the initial illness, which can lead to heart failure. Digestive complications, including an enlarged esophagus or an enlarged colon, may also occur in up to 21% of people, and up to 10% of people may experience nerve damage. With the globalization of Chagas disease, unexperienced health care providers may have difficulties in identifying which patients should be examined for this condition. This study published in 2016 aimed to develop and validate a diagnostic clinical model for chronic Chagas disease : https://www.scielo.br/j/rsbmt/a/WMwS4xvKGxMBMzybxwkbvVv/?lang=en
We use pmsampsize to calculate the minimum sample size required to develop a multivariable prediction model for a binary outcome using 24 candidate predictor parameters. Based on the published evidence, the outcome prevalence is anticipated to be 0.174 (17.4%) and a lower bound (taken from the adjusted Cox-Snell R-squared of an existing prediction model) for the new model’s R-squared value is 0.288.
pmsampsize(type = "b", rsquared = 0.288, parameters = 24, prevalence = 0.174)
## NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
## NB: Assuming 0.05 margin of error in estimation of intercept
## NB: Events per Predictor Parameter (EPP) assumes prevalence = 0.174
##
## Samp_size Shrinkage Parameter CS_Rsq Max_Rsq Nag_Rsq EPP
## Criteria 1 623 0.900 24 0.288 0.603 0.477 4.52
## Criteria 2 662 0.905 24 0.288 0.603 0.477 4.80
## Criteria 3 221 0.905 24 0.288 0.603 0.477 1.60
## Final 662 0.905 24 0.288 0.603 0.477 4.80
##
## Minimum sample size required for new model development based on user inputs = 662,
## with 116 events (assuming an outcome prevalence = 0.174) and an EPP = 4.8
##
##
EPP here indicates the “Event per Predictor parameter” and as you can observe this number vary accordingly to the criterion used (from 1.6 to 4.8).
Here we are interested in developing a diagnostic model for the presence of DVT (deep venous thrombosis). DVT is a blood clot that forms in a leg vein and may migrate to the lungs leading to blockage of arterial flow, preventing oxygenation of the blood and potentially causing death. Multivariable diagnostic prediction models have been proposed during the past decades to safely exclude DVT without having to refer for further burdening (reference standard) testing. In the reference http://dx.doi.org/10.1016/j.jclinepi.2014.06.018 was reported a C statistic of 0.79 for a model with 8 parameters and an outcome prevalence estimated at 22%.
pmsampsize(type = "b", cstatistic=0.79, parameters = 9, prevalence = 0.22)
## Given input C-statistic = 0.79 & prevalence = 0.22
## Cox-Snell R-sq = 0.1802
##
## NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
## NB: Assuming 0.05 margin of error in estimation of intercept
## NB: Events per Predictor Parameter (EPP) assumes prevalence = 0.22
##
## Samp_size Shrinkage Parameter CS_Rsq Max_Rsq Nag_Rsq EPP
## Criteria 1 403 0.900 9 0.1802 0.651 0.277 9.85
## Criteria 2 246 0.847 9 0.1802 0.651 0.277 6.01
## Criteria 3 264 0.900 9 0.1802 0.651 0.277 6.45
## Final 403 0.900 9 0.1802 0.651 0.277 9.85
##
## Minimum sample size required for new model development based on user inputs = 403,
## with 89 events (assuming an outcome prevalence = 0.22) and an EPP = 9.85
##
##
Van Calster, B., Nieboer, D., Vergouwe, Y., De Cock, B., Pencina, M.J., Steyerberg, E.W. (2016). A calibration hierarchy for risk models was defined: from utopia to empirical data. Journal of Clinical Epidemiology, 74, pp. 167-176
Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ (Clinical research ed). 2020
Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS. Minimum sample size required for developing a multivariable prediction model: Part I continuous outcomes. Statistics in Medicine. doi: 10.1002/sim.7993
Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS. Minimum sample size required for developing a multivariable prediction model: Part II binary and time-to-event outcomes. Statistics in Medicine. doi: 10.1002/sim.7992
van Smeden M, Moons KG, de Groot JA, et al. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Methods Med Res. 2019;28(8):2455-74
Riley, RD, Van Calster, B, Collins, GS. A note on estimating the Cox-Snell R2 from a reported C statistic (AUROC) to inform sample size calculations for developing a prediction model with a binary outcome. Statistics in Medicine. 2020