Area under the receiver operating characteristic (ROC) curve (AUC): A measure of the discrimination of a logistic regression model. The ROC curve is the plot of sensitivity versus one minus specificity over all possible thresholds of predicted probability. The area under the ROC curve is numerically equivalent to the c-statistic for a binary outcome.
Bayesian nonparametric methods: An approach to model selection that allows the data to determine the complexity of the model. In an infinite-dimension parameter space, a Bayesian nonparametric model uses only a finite subset of the available dimensions to explain a sample of observations, with the complexity of the model adapting to the sample data.
C-statistic: A measure of the discriminative ability of a logistic regression model. The concordance (or c) statistic is a unit-less index denoting the probability that a randomly selected subject who experienced the outcome will have a higher predicted probability of having the outcome occur compared with a randomly selected subject who did not experience the event.
Effect modification: Occurs when the magnitude of the effect of the primary treatment or exposure on an outcome differs depending on the level of a third variable (e.g., patient characteristics). In the presence of effect modification, the use of an overall effect estimate is inappropriate.
Ensemble learning: A type of machine learning approach that combines multiple learning algorithms or models to predict an outcome to obtain better model performance than any of the individual models.
Genome-wide association study (GWAS): An observational study of a genome-wide set of genetic variants in different individuals to examine associations with variants with an outcome or trait.
Heterogeneous treatment effects (HTE): Nonrandom variability in the direction or magnitude of a treatment effect, measured using clinical outcomes. HTE is fundamentally a scale-dependent concept and therefore, for clarity, the scale should generally be specified.
Net benefit: A decision analytic measure that puts benefits and harms on the same scale. This is achieved by specifying an exchange rate based on the relative
value of benefits and harms associated with interventions. The exchange rate is related to the probability threshold to determine whether a patient is classified as being positive or negative for a model outcome, or (when applied to trial analysis) as being treatment-favorable versus treatment-unfavorable.
Overfitting: A key threat to the validity of a model when predictions do not generalize to new subjects outside the sample under study. Overfitting occurs when a model conforms too closely to the idiosyncrasies or “noise” of the limited data sample on which it is derived.
Penalized regression: A set of regression methods, developed to prevent overfitting, in which the coefficients assigned to covariates are penalized for model complexity. Penalized regression is sometimes referred to as shrinkage or regularization. Examples of penalized regression include LASSO, ridge, and elastic net regularization.
Predictive analytics: The field of predictive analytics encompasses a variety of statistical methods including prediction modeling, machine learning, and data mining techniques to make use of existing data to predict future events.
Reference class: A group of similar cases that is used to make predictions for an individual case of interest. The “reference class problem” refers to the fact that there are an indefinite number of different ways to define similarity.
Regression tree-based methods: Algorithms that use a recursive partitioning approach to predict categorical (classification tree) or continuous outcomes (regression tree).
Subgroup analysis: An analysis that examines whether specific patient characteristics modify the effects of treatment on an outcome.