The criteria and rating system presented in Table C-1 was adopted from Terwee and colleagues (2007) and modified to fit the committee’s process for evaluating measures of interest. For many of the considered measures, the level of evidence needed to evaluate each criterion was not available and therefore not every component was applicable. However, the committee used the existing evidence to assess each candidate measure based on as many of these criteria as possible to select the best outcome for each measure.
TABLE C-1 Committee Criteria and Rating System for Outcome Measure Evaluation
| Measurement Property | Rating* | Criteria |
|---|---|---|
| Content validity | + | All items refer to relevant aspects of the construct to be measured AND are relevant for the target population AND are relevant for the context of use AND together comprehensively reflect the construct to be measured |
| (including face validity) | ? | Not all information for ‘+’ reported |
| − | Criteria for ‘+’ not met | |
| Structural validity | + | Classical Test Theory (CTT): |
| Unidimensionality: Exploratory factor analysis: First factor accounts for at least 20% of the variability AND ratio of the variance explained by the first to the second factor greater than 4 OR Bi-factor model: Standardized loadings on a common factor > 0.30 AND correlation between individual scores under a bi-factor and unidimensional model > 0.90 | ||
| Structural validity: Comparative fit index (CFI) or Tucker-Lewis Index (TLI) comparable measure > 0.95 AND root mean square error of approximation (RMSEA) < 0.06 OR standardized root mean residuals (SRMR) < 0.08 | ||
| Rasch/Item Response Theory (IRT): | ||
| At least limited evidence for unidimensionality or positive structural validity AND no evidence for violation of local independence: Rasch: standardized item-person fit residuals between −2.5 and 2.5; OR IRT: residual correlations among the items after controlling for the dominant factor < 0.20 OR Q3s < 0.37 AND no evidence for violation of monotonicity: adequate looking graphs OR item scalability > 0.30 AND adequate model fit: Rasch: infit and outfit mean squares ≥ 0.5 and ≤ 1.5 OR Z-standardized values > −2 and < 2; OR IRT: G2 > 0.01; | ||
| Optional additional evidence: | ||
| Adequate targeting; Rasch: adequate person-item threshold distribution; IRT: adequate threshold range | ||
| No important differential item functioning for relevant subject characteristics (such as age, gender, education), McFadden’s R2 < 0.02 | ||
| ? | CTT: Not all information for ‘+’ reported | |
| IRT: Model fit not reported | ||
| − | Criteria for ‘+’ not met | |
| No data available | If the element hasn’t been tested/reported, leave the cell blank (this is different from having a negative finding). |
| Measurement Property | Rating* | Criteria |
|---|---|---|
| Internal consistency | + | At least limited evidence for unidimensionality or positive structural validity AND Cronbach’s alpha(s) ≥ 0.70 and ≤ 0.95 |
| ? | Not all information for ‘+’ reported OR conflicting evidence for unidimensionality or structural validity OR evidence for lack of unidimensionality or negative structural validity | |
| − | Criteria for ‘+’ not met | |
| Reliability | + | Intraclass correlation coefficient (ICC) or weighted Kappa ≥ 0.70 |
| ? | ICC or weighted Kappa not reported | |
| − | Criteria for ‘+’ not met | |
| Measurement error | + | Smallest detectable change (SDC) or limits of agreement (LOA) < minimal important change (MIC) |
| ? | MIC not defined | |
| − | Criteria for ‘+’ not met | |
| Hypotheses testing | + | At least 75% of the results are in accordance with the hypotheses |
| ? | No correlations with instrument(s) measuring related construct(s) AND no differences between relevant groups reported | |
| − | Criteria for ‘+’ not met | |
| Criterion validity | + | Convincing arguments that gold standard is “gold” AND correlation with gold standard ≥ 0.70 |
| ? | Not all information for ‘+’ reported | |
| − | Criteria for ‘+’ not met | |
| Responsiveness | + | At least 75% of the results are in accordance with the hypotheses |
| ? | No correlations with changes in instrument(s) measuring related construct(s) AND no differences between changes in relevant groups reported | |
| − | Criteria for ‘+’ not met | |
NOTE: * “+” = positive rating, “?” = indeterminate rating, “−” = negative rating.
SOURCE: Modified from Terwee et al., 2007. Reprinted with permission from Elsevier.
Terwee, C. B., S. D. Bot, M. R. de Boer, D. A. van der Windt, D. L. Knol, J. Dekker, L. M. Bouter, and H. C. de Vet. 2007. Quality criteria were proposed for measurement properties of health status questionnaires. Journal of Clinical Epidemiology 60(1):34–42.
This page intentionally left blank.