This report focuses on meaningful outcomes and the measures used to evaluate the efficacy and effectiveness of interventions, rather than focusing on the types of assessments used to diagnose hearing loss or determine candidacy for various devices. As part of its initial work in gathering evidence for potential core outcomes and importance to measure, the committee assembled a list of some of the most commonly used outcome measures across all the outcomes and outcome domains identified in Chapter 5 (see Appendix B). The committee used evidence of the existence and overall quality of measures to help determine the core outcome set (see Chapter 5). Then, the committee initiated a more in-depth analysis of individual measures for the core outcomes of understanding speech in complex listening situations and hearing-related psychosocial health. This chapter starts with a brief history of outcome measurement for hearing health interventions. Next the chapter details the committee’s process and criteria for evaluating measures. Finally, the committee presents its individual analyses and recommendations regarding which measures should be used for each core outcome.
One of the earliest comprehensive protocols for the evaluation of hearing outcomes targeted veterans who returned from military service during World War II with hearing loss (Carhart, 1946). The hearing aid evaluation and fitting process included several steps and components, but two key elements were the measurement of speech understanding in noise under controlled conditions and the collection of self-report information from the hearing
aid wearer. For the next several decades, the focus of outcome measurement was confined to measuring the benefits of hearing aids using standardized measures of understanding speech in quiet and in noise. Typically, this involved obtaining percent-correct scores for standardized lists of words or sentences in quiet and in noise under controlled measurement conditions. Clinicians had to make several decisions regarding the speech materials, the speech presentation level, and the type and nature of the competing background noise. Recognizing that many standardized measures of speech understanding failed to capture the extent of communication difficulties (unaided) and the relative improvement with amplification, multiple self-report measures of hearing difficulties were developed in the 1980s (Erdman, 2014).
Walden (1997) described the rationale and procedures for a “model clinical trials protocol” to assess the benefits of hearing aid interventions. The protocol recommended that speech understanding be evaluated in the clinic at each of three speech levels, each with corresponding signal-to-noise ratios (SNRs) for that particular speech level. The speech levels and SNRs were designed to span the range likely to be encountered in everyday life (Pearsons et al., 1977). In addition to these measures of speech recognition, a self-report measure focused on speech communication in a variety of contexts, the 66-item Profile of Hearing Aid Benefit (PHAB), also was recommended (Cox and Gilmore, 1990). The model for a clinical trials protocol to evaluate the benefits of hearing aids proposed by Walden focused exclusively on speech communication, an important dimension of hearing-aid outcome, but certainly not the sole dimension of importance to those with hearing difficulties (Walden, 1997). The 66-item PHAB was quickly found to be too long for routine clinical application and an abbreviated PHAB, the 24-item Abbreviated PHAB (APHAB), was developed and evaluated (Cox and Alexander, 1995). (See later in this chapter for more on the APHAB.)
In the 1980s, additional self-report measures were developed recognizing the importance of the potential negative psychosocial consequences of hearing difficulties beyond those affecting communication directly. These include the Communication Profile for the Hearing Impaired (CPHI) (see Chapter 4), various versions of the Hearing Handicap Inventory (see later in this chapter), and the Glasgow Hearing Aid Benefit Profile (see Chapter 4). Recognizing that hearing difficulties entail more than problems with speech communication, the Speech, Spatial and Qualities of Hearing Scale (SSQ) was developed (Gatehouse and Noble, 2004). (See later in this chapter for more on the SSQ.)
Throughout the proliferation of self-report hearing-aid outcome measures in the 1980s and 1990s, few studies compared the results across outcome measures to assess the independence of each. A series of studies
sought to remedy this by obtaining multiple outcomes across several presumably independent outcome domains from a large number of adults fitted with hearing aids and then performed factor analyses on the results (Cox et al., 2007; Dillon et al., 1997; Humes, 1999, 2003; Humes and Krull, 2012; Humes et al., 2001, 2017). These analyses identified considerable redundancy among the outcome measures included, ranging from 10 to 26 in number across studies, with from 3 to 7 outcome dimensions identified in the ensuing factor analyses. The most common outcomes across all studies and analyses were clinically measured speech-understanding performance (aided only) or benefit (aided/unaided) most often measured in noise, self-reported benefit and satisfaction (always loaded together on a single factor), and daily usage (self-reported or data logging) (Humes and Krull, 2012). When measures of hearing-related social and emotional difficulties were included, this emerged as a separate outcome domain.
Among hearing aid wearers, behavioral measures are weakly correlated with self-report measures (Cox and Alexander, 1992; Cox et al., 2007; Dornhoffer et al., 2020; Humes et al., 2017; Stenbäck et al., 2023; Walden and Walden, 2004). Dornhoffer and colleagues (2020) suggested that the reason for these low correlations is that “hearing aid users’ real-world listening environments [assessed with self-report measures] are more varied than can be predicted by simple audiologic measures [of speech understanding]” (p. 7). Overall, the evidence indicates that speech communication measured in the sound booth with standardized behavioral materials under controlled conditions and by self-report capture different aspects of speech communication.
On the surface, the self-report measures of speech communication would appear to be superior to those obtained behaviorally in the sound booth owing to their more direct connection to the everyday communication situations experienced by the individual with hearing difficulties, but these measures also are subject to bias. For example, individuals seeking treatment may want to feel good about the decision they have made or the time and money they have expended and therefore report more positive findings. A behavioral measure along with a self-report measure may provide a fuller picture of the outcome of interest, although the limited research available suggests both types of measures may be subject to placebo effects (Dawes et al., 2011, 2013). Generally, from the model clinical trial protocol of Walden (1997) to more recent comparative evaluations of hearing-aid technologies (e.g., Cox et al., 2014, 2016; Johnson et al., 2016, 2017), a combination of behavioral and self-report measures of outcomes has been implemented.
Once core outcomes have been identified, decisions must be made about how and when to measure them (Clarke and Williamson, 2016;
Gatehouse, 2000). For many outcomes, one must choose among the many available measures for that outcome. In addition, some self-report measures require baseline and postintervention measurements separately whereas others can be administered postintervention only. For self-report measures, additional issues of how the survey is administered, paper-and-pencil versus electronically, as well as how they are scored, must be taken into consideration. For example, the mode of administration of the Hearing Handicap Inventory for the Elderly (HHIE) can affect the scores obtained as well as the test-retest reliability (Thorén et al., 2012; Weinstein et al., 1986).
The appropriate timing of outcome measurement will depend on the context, including the type of intervention. For example, various studies show that the intervals for measurement of outcomes after hearing aid intervention range from 1 week to 3 years postintervention; these assessments most commonly occur approximately 4 to 6 weeks after the hearing aid fitting (Bentler et al., 1993a,b; Cox and Alexander, 1992; Cox et al., 2007; Cox and Rivera, 1992; Dawes et al., 2014; Dawes and Munro, 2017; Humes et al., 1996, 2002, 2003; Saunders and Cienkowski, 1997; Surr et al., 1998; Wright and Gagné, 2021). As outcomes beyond hearing and communication are examined, longer periods of hearing-aid use may be required to achieve stable outcomes (Allen et al., 2022; Mulrow et al., 1992b). Additionally, other types of interventions such as pharmaceuticals and biologics may require different intervals to show effectiveness not only of the therapeutic substance at or immediately after the time of administration but also the sustained effects of the intervention after the prescribed period of treatment ends.
As noted earlier, the committee created an inventory of measures for each of the candidate outcomes identified in Chapter 5 in order to inform its final conclusions for the core outcome set. Ultimately, the committee recommended that two outcomes should be included in the core outcome set: understanding speech in complex listening situations and hearing-related psychosocial health. For these outcomes, the committee prioritized in-depth evaluation for those measures with with a sufficient amount and quality of evidence regarding their development and testing. Several measures were ruled out of consideration for a range of reasons. For example, some measures were eliminated from consideration because they are primarily used for diagnostic assessment but are not appropriate as outcome measures. Others were eliminated because they are not broadly accessible. For example, some measures are subject to copyright restrictions (e.g., CPHI) or are currently not available for purchase for clinical use (e.g., the Hearing in Noise Test [HINT]). The committee focused on the in-depth evaluation of
measures that were accessible to clinicians and researchers in both research and clinical settings.
The committee performed a comprehensive literature search for each remaining measure looking for studies related to the psychometric characteristics and scientific development of the measure. Overall, the committee considered descriptive elements of these studies for each measure (including population studied, number of participants, setting, and clinical unit of interpretation). Then, the committee assessed the strengths and weaknesses of each measure by examining evidence for two broad criteria: scientific acceptability (including reliability, validity, and sensitivity to change) and feasibility. While the committee recognizes that responsiveness (or sensitivity to change) may be viewed as a part of validity (rather than a distinct property) (Hays and Hadorn, 1992), the committee kept sensitivity to change as a separate criterion based, in part, on the criteria used in the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative described in Chapter 3. (See Appendix C for the committee’s worksheet for evaluation of scientific acceptability.) The results of the individual analyses for the candidate outcome measures are presented later in this chapter. The committee notes that no single measure had robust evidence supporting all the criteria considered.
Scientific acceptability is the “extent to which the measure produces reliable and valid results about the intended area of measurement. These qualities determine whether use of the measure can draw reasonable conclusions about care in a given domain” (CMS, 2024). The definitions of the components of scientific acceptability that the committee adopted from COSMIN are displayed in Table 6-1 (COSMIN, 2024).
First the committee evaluated the measure’s validity or the extent to which the results of a measurement reflect the construct or outcome of interest. In other words, does the test measure what it was intended to measure? Scientific acceptability was the first criterion because the measure needed to have evidence of at least face validity in order to continue with its consideration. The committee also looked for evidence of criterion validity, internal consistency, structural validity, areas of potential measurement error, and evidence that the measure is predictive of everyday performance.
Next, the committee evaluated the measure’s reliability, which is the degree to which the result of a measurement, calculation, or specification is repeatable or consistent. The committee looked for evidence of test-retest reliability, meaning that the results were consistent when the measure was repeated, and interrater reliability, meaning that the results were consistent between different raters (people administering the measurement).
TABLE 6-1 Definitions of Scientific Acceptability
| Measurement Property | Definition According to the COSMIN Taxonomy |
|---|---|
| Content validity (including face validity) | The degree to which the content of a measurement instrument is an adequate reflection of the construct to be measured |
| Reliability | The degree to which the measurement is free from measurement error |
| Responsivenessa | The ability of a measurement instrument to detect change over time in the construct to be measured |
| Internal consistency | The degree of interrelatedness among the items of the assessment |
| Structural validity | The degree to which the scores of a measurement instrument are an adequate reflection of the dimensionality of the construct to be measured |
| Measurement error | The systematic and random error of a patient’s score that is not attributed to true changes in the construct to be measured |
| Criterion validity | The degree to which the scores of a measurement instrument are an adequate reflection of a “gold standard” |
a Responsiveness is also known as sensitivity to change—the ability of the measure to accurately document a meaningful change in an outcome
NOTE: COSMIN = COnsensus-based Standards for the selection of health Measurement INstruments.
SOURCES: COSMIN, 2024; Prinsen et al., 2016. CC BY 4.0.
Finally, the committee considered the measure’s responsiveness or sensitivity to change. In the context of health-related patient-reported outcome measures, sensitivity to change is the “ability of an instrument to detect significant change in health status over time” (Toussaint et al., 2020, p. 395). Where possible, the committee noted the percentage of participants whose scores were at ceiling and at floor. The committee deemed it to be critical that selected measures accurately reflect the change in an outcome resulting from the intervention. (See later in this chapter for a fuller discussion of interpreting changes in measure scores [i.e., statistical approaches for evaluating a measure’s responsiveness].)
The feasibility of a measure is the extent to which the measure requires data, processes, or equipment that are readily available or easily performed without undue burden (CMS, 2024). The feasibility of a measure varies greatly by setting. The burden can fall on the patient as respondent fatigue from long questionnaires, on the clinician by requiring excessive time to administer (and score) the measure during a short appointment, or on researchers by requiring an unrealistic amount of time or money to
implement. When performing its measure analyses, the committee considered three main criteria: (1) whether the data required for the measure are readily available (or easily captured) in both research and clinical settings; (2) whether there are clear instructions for administering, scoring, and interpreting the measure; and (3) the time needed for administration and scoring. The committee considered feasibility for each measure individually, but also collectively for the battery of core outcomes to be measured in a core set.
Many measures exist for the evaluation of understanding speech in complex listening situations (primarily understanding speech in noise). The committee considered both behavioral measures as well as self-report measures. Behavioral measures capture outcomes in a controlled environment while self-report measures capture real-world experiences. However, before proceeding to a review of the specific measures, a general overview of these measures is provided.
Historically, a wide array of measures of speech understanding have been developed and evaluated. These measures are varied in the type of speech material used, with most using phonemes (distinct units of sound), nonsense syllables, words, or sentences. The measures also vary regarding the type of competition used, most frequently including various types of noise (e.g., white noise, speech-shaped noise) or competing speech (e.g., single-talker, multiple talker). These measures also employ a variety of response formats, including closed-set speech identification and open-set speech recognition as the most common alternatives. Finally, most of these measures fall into two general categories of procedure.
One option prescribes the speech and noise levels to be used and measures the percentage of test items correctly identified or recognized. In the other main approach, the speech or noise level is fixed, and the level of the other variable is varied adaptively to achieve a criterion level of performance. For example, the speech level may be fixed at 70 decibels (dB) sound pressure level and the competing noise or speech adjusted to bracket 50 percent correct, often referred to as the speech recognition threshold (SRT). Alternatively, the noise level may be fixed at 70 dB sound pressure level and the speech level adjusted to reach SRT. The measured SRT can be reported as the final value of the speech or noise level adjusted to achieve 50 percent-correct performance but, more often, it is reported as the relative difference in these two levels: the SNR in dB for 50 percent-correct performance.
Behavioral measures of understanding speech in complex listening situations were originally developed as efficient measures for use in diagnosis,
including screening for hearing loss (e.g., Smits et al., 2006). To ensure consistency of results across time and clinics, the testing materials, conditions, and procedures have been highly standardized for each individual measure. Typically, testing is completed using earphones in a sound-treated test booth. Although desirable for consistency of results, this has often constrained the usefulness of such tests as measures of everyday hearing function. High and colleagues (1964) developed a self-report measure to capture hearing difficulties experienced in everyday life. Other self-report measures followed, many focused primarily on a detailed examination of speech communication in everyday situations (Cox and Alexander, 1995; Cox and Gilmore, 1990; Cox and Rivera, 1992; Demorest and Walden, 1984; Giolas et al., 1979; Johnson et al., 2010; Lamb et al., 1983).
In studies that have evaluated both behavioral and self-report measures of speech-understanding in adults, the correlations are typically weak to moderate, suggesting 16 to 36 percent overlap between these two types of measures for the same outcome; self-report measures appear to be as or more sensitive to the benefits provided by hearing intervention (e.g., Cox et al., 2003; Humes, 2003; Humes et al., 2017). In these studies, using both types of outcome measures, self-reported benefit was often significant in the absence of statistically significant differences in behavioral measures obtained in noise.
More recently, a study demonstrated that amplification provided clinically significant benefit on their behavioral measure of speech-in-noise performance for 53 percent of the participants in their randomized controlled trial. In contrast, the APHAB global benefit scores, their self-report measure, revealed statistically significant improvements in 77 percent of the participants (De Sousa et al., 2023).
Other studies contrast in that hearing aid benefit was shown in both self-report and behavioral outcome measures. For example, Perron and colleagues (2023) recently demonstrated statistically significant improvements in performance on the Quick Speech-in-Noise (QuickSIN) test with amplification relative to unaided performance, with a large effect size (0.53), and they also reported significant difference in self-reported listening effort ratings parallel to the QuickSIN differences. The clinical significance of the change must be considered for both behavioral and self-report measures. With respect to behavioral measures, McShefferty and colleagues (2016) demonstrated that, regardless of the specific SNR-based test used, the SNR change must be at least 3 dB to be detectable and at least 6 dB to be considered meaningful. Given that there are few data available that address the responsiveness of SNR-based speech-understanding measures like the QuickSIN and the Words-in-Noise (WIN) test to hearing-aid interventions, additional research is needed. Self-reported hearing difficulty and behavioral measures are correlated, although the significant individual variability in both may reduce sensitivity for measurement of intervention effects at the individual level
(see, for example, Fitzgerald and colleagues [2024] who explored relationships between QuickSIN SNR and SSQ-speech scale scores).
Conclusion 6-1: Both a behavioral measure and a patient-reported outcome measure are needed to evaluate the core outcome of understanding speech in complex listening situations. While behavioral measures have face validity, are considered by some to be more objective, and are readily available, these measures lack data on their sensitivity to change and do not necessarily correlate with the individual’s perception of their improvement.
The following sections present evidence regarding behavioral (objective) measures of understanding speech in complex listening situations considered for in-depth analysis by the committee.
Given that most behavioral measures share a limitation with respect to the effects of native language, the committee was interested in options that are language agnostic. Few such tests are available, and data are limited for the emerging measures. The one commercially available test identified was the Audible Contrast Threshold (ACT) test. The ACT is a spectro-temporal modulation detection test; it uses modulated noise to determine the amount of contrast an individual needs to identify a difference between signals (Zaar et al., 2024). The committee reviewed two articles pertaining to the development and validation of the ACT (Zaar et al., 2023, 2024). The ACT strongly predicts other speech-in-noise measurements. This test is unique because it is language agnostic. The test requires less than 2 minutes on average to administer and currently is included on two commercially available audiometers. The test has moderate scientific acceptability and moderate test-retest data.
Conclusion 6-2: The ACT may be a promising measure, but it is new and lacks enough evidence to be recommended at this time.
The AzBio Sentence Test has 33 lists of 20 recorded sentences in both male and female voices (Holder et al., 2018). The individual with hearing difficulties is presented these lists both in quiet and with a twenty-talker babble at SNR ratios of +10, +5, −, −5, and −10 dB. The individual is asked
to repeat each sentence to the best of their ability. The sentences are more challenging than other tests, making them potentially more reflective of the patient’s everyday experiences (Spahr et al., 2012). This test is appropriate for use in clinic and lab settings.
The committee reviewed six articles pertaining to the development and validation of the AzBio Sentence Test (Advanced Bionics LLC et al., 2011; Holder et al., 2018; Patro et al., 2024; Schafer et al., 2012; Spahr et al., 2012; Vermiglio et al., 2021). The scientific acceptability of this test is acceptable. In addition, the test-retest correlations were moderate. The Az-Bio requires 5 to 7 minutes to administer, somewhat limiting the feasibility. This test is most commonly used as a candidacy measure and for evaluating cochlear implants, which are outside the scope of this report. The committee concluded that the AzBio Sentence Test is not an appropriate outcome measure for evaluating the effectiveness of hearing interventions for the populations and treatments included in the scope of this report.
Conclusion 6-3: The AzBio Sentence Test has primarily been used as a candidacy measure and for evaluating cochlear implants. Additionally, the measure takes 5 to 7 minutes to administer making it less feasible than other behavioral measures. There is insufficient evidence to support the recommendation of the AzBio as an outcome measure for evaluating the effectiveness of hearing health interventions other than cochlear implants at this time.
The Digits-in-Noise (DIN) test was designed for clinical use and consists of digit triplets recorded by a male speaker measuring the speech reception threshold (the level of noise at which the person can no longer correctly repeat the digits). The committee reviewed 28 articles pertaining to the development and validation of DIN test (Armstrong et al., 2020; De Sousa et al., 2020a,b, 2023; Folmer et al., 2017, 2021; Hoth, 2016; Jansen et al., 2012, 2013; Koole et al., 2016; Kwak et al., 2022; Lyzenga and Smits, 2011; Melo et al., 2022; Motlagh Zadeh et al., 2021; Oremule et al., 2024; Potgieter et al., 2015, 2018a,b; Reynard et al., 2022; Roup et al., 2018; Schimmel et al., 2024; Śliwińska-Kowalska, 2020; Smits et al., 2013; Van den Borre et al., 2021; Wang and Wong, 2024; Watson et al., 2012; Wilson and Weakley, 2004; Wright and Gagné, 2021).
The DIN takes about 2 minutes to administer and is available in over a dozen languages making it highly feasible in the clinic and research settings. While there are some commonalities across tests in that the same digits are used across tests, it should be noted that not all studies use a standardized test. Digits are recorded with different speakers and used with different noise backgrounds across studies. The test has documented evidence of scientific
acceptability, and DIN performance is strongly correlated with scores on word or sentence in noise tests. It is an effective screening tool for hearing loss. The greatest concern is that the test may not be sensitive to hearing interventions, including hearing aids, but this has not been studied frequently. Rather, most research on the DIN has been as an assessment of hearing sensitivity, substituting for measures like the audiogram. Without additional research, the committee concluded that the DIN is not an appropriate outcome measure for evaluating the effectiveness of hearing interventions.
Conclusion 6-4: The DIN has primarily been used as a diagnostic measure. There is insufficient evidence to support the recommendation of the DIN as an outcome measure for evaluating the effectiveness of hearing health interventions at this time.
As the name implies, the QuickSIN estimates SNR hearing loss quickly (Killion et al., 2004). It consists of sets of six sentences, each sentence with five key words at a 70 dB hearing level presentation with a four-talker babble in the background; only repetition of the key words affects the score (Billings et al., 2023; Interacoustics, 2022; Killion et al., 2004). The level of the target speech remains constant while the level of the background babble increases by 5 dB after each sentence (Billings et al., 2023). The SNR’s range from easy to difficult (i.e., 25, 20, 15, 10, 5 and 0) (Interacoustics, 2022). The QuickSIN is scored as an SNR deficit relative to normative population performance, with normative performance considered to be 2 dB SNR according to the QuickSin Instructions for Use (Interacoustics, 2022). The QuickSIN is only available in English at this time (Auditdata, n.d.).
The committee reviewed 15 articles pertaining to the development and validation of the QuickSIN (Bentler, 2000; Billings et al., 2023; De Sousa et al., 2023; Fitzgerald et al., 2023; Killion et al., 2004; Killion and Villchur, 1993; Kraus et al., 2011; McArdle and Wilson, 2006; McArdle et al., 2005; Mendel, 2007; Ou and Wetmore, 2020; Phatak et al., 2018; Sabin et al., 2020; Walden and Walden, 2004; Wilson et al., 2007b). The QuickSIN has strong criterion validity and correlates well with other measures of speech in noise. The test has strong homogeneity showing internal consistency. There is a lack of evidence on the reliability and consistency of the measurement over time. The minimal detectable difference is +/− 2.7 dB signal-to-babble ratio with a 95 percent confidence level when performance on two lists is averaged.
An analysis of the equivalency of the 18 QuickSIN lists revealed that “the psychometric functions for each list showed high-performance variability across lists for listeners with hearing loss but not for listeners with normal hearing” (McArdle and Wilson, 2006, p. 157). Furthermore, the
analysis showed that 4 lists fell outside the critical difference for listeners with hearing loss, and 9 lists “provide homogenous results for listeners with and without hearing loss.” The QuickSIN takes about a minute to administer each sentence set and two sentence sets are recommended, making it highly feasible (Mueller, 2016). Although the measure is familiar to clinicians, it is only available in English.
The committee noted weak evidence for the recommended 2-dB normative threshold value in its review of the evidence. Specifically, Killion and colleagues (2004) use the 2-dB normative threshold and attribute the 2-dB normative value to previous studies (Bentler, 2000; Killion and Niquette, 2000; Killion et al., 1996). However, the cited study by Killion and colleagues (1996) was not available to the committee as this was a conference proceedings, and a review of Bentler (2000) did not yield strong supporting evidence for the 2-dB SNR as a normative reference as Bentler’s sample included only 40 adults (20 with normal hearing and 20 with sloping hearing loss, with equal numbers of males and females in the two groups).
Bentler (2000) does not explicitly report SNR threshold, but extrapolation of the data in Figure 4 of that report would yield a 50 percent threshold of approximately 2-dB for the subset of 20 participants with normal hearing. The committee additionally reviewed Killion and Niquette (2000) for supporting evidence for the 2-dB normative threshold. Killion and Niquette (2000) referenced Killion and Villchur (1993) as the source for the 2-dB reference value for normal hearing individuals. When reviewed, Killion and Villchur (1993) was found to include case data for 6 participants (3 younger adults and 3 older adults). A similarly sized cohort (18 women, 6 men) with normal hearing was evaluated by Wilson and colleagues (2007a,b) using the QuickSIN, WIN, the HINT, and the Bamford-Kowal-Bench Speech-in-Noise (BKB-SIN) test. The study found that the 50 percent point in the psychometric data for normal hearing listeners was 3.1 dB for list 1 and 4.1 dB for list 8, with an average threshold of 3.5 dB for those two lists. More recent data from Fitzgerald and colleagues (2023) demonstrate significant variability in QuickSIN dB SNR loss scores across patients with normal audiometric thresholds. Although the QuickSIN test appears well suited for use in clinical and research settings, additional research with larger samples would be helpful in evaluating whether 2-dB is representative of a larger normal-hearing cohort.
Conclusion 6-5: The QuickSIN is a good candidate to consider for inclusion in a core outcome set.
The WIN provides a dB signal-to-babble ratio threshold based on percent correct performance during administration of prerecorded monosyllabic
words presented against a multitalker babble background (Wilson et al., 2005, 2007a). The test was originally developed with 70-word lists (Wilson and Strous, 2002; Wilson et al., 2003). Later, to improve efficiency, the test was modified to consist of trials with lists each containing 35 words (Wilson, 2003; Wilson and Burks, 2005). First, the words are presented at 84 dB hearing level with a background six-person babble at 60 dB hearing level. Every five words, the signal level drops 4 dB while the background stays constant, thus the SNR becomes poorer (more difficult) as the test progresses. Each word is scored as correct or incorrect. A stopping rule is implemented when all words are missed at a given SNR, decreasing the total test time for patients with significant functional deficits (Wilson and Burks, 2005). The WIN takes from 4 to 6 minutes to administer and has been recommended for clinical use (Mueller, 2016; Toolbox Assessments, Inc., 2024; Wilson and McArdle, 2007).
The committee reviewed 17 articles pertaining to the development and validation of the WIN (Billings et al., 2023; McArdle et al., 2005; McLean et al., 2021; Mehrkian et al., 2019; Wilson, 2003, 2011; Wilson et al., 2003, 2005, 2006, 2007a,b, 2012; Wilson and Burks, 2005; Wilson and Cates, 2008; Wilson and McArdle, 2007; Wilson and Strouse, 2002; Wilson and Watts, 2012). The WIN has strong criterion validity and correlates well with other measures of speech in noise. It has been benchmarked relative to the BKB-SIN, the HINT, and QuickSIN (Wilson et al., 2007b), digit triplets in noise (Wilson et al., 2006), and the Speech Recognition in Noise Test (Wilson and Cates, 2008).
The psychometric development of the WIN was very strong, and test-retest reliability for each of the word lists is well documented, in addition to documentation of the equivalency of the difficulty of the word lists. Participants with hearing loss typically have WIN thresholds that are 7 to 10 dB poorer than participants with hearing thresholds that are 20 to 25 dB hearing level or better (Wilson et al., 2007b). Based on test-retest data and 95 percent confidence intervals for true change, the minimal detectable difference is 3.5 dB signal-to-babble ratio (Wilson and McArdle, 2007).
A shortcoming of the WIN is that the majority of the development of the WIN was based on data from a largely male veteran population. A strength of the WIN is that it is included in the NIH (National Institutes of Health) ToolBox (Toolbox Assessments, Inc., 2024). The WIN is also available on the VA CD and comes preloaded onto some audiometers (Wilson, 2006). An additional strength of the WIN is the availability of a Spanish test form (Fox et al., 2021), although additional validation of the Spanish WIN is warranted. The NIH Toolbox for Assessment of Neurological and Behavioral Function was explicitly developed in both English and Spanish. An initial assessment of the psychometric properties of the Spanish-language version of the WIN showed that test-retest reliability was
poor relative to prior literature using the English-language version, with no statistically significant correlations across test and retest data (Fox et al., 2021). However, the sample size was small, with only 9 to 10 participants contributing test-retest data.
Conclusion 6-6: The WIN is a good candidate to consider for inclusion in a core outcome set.
The committee compared the two best candidates for a behavioral measure for the core outcome of understanding speech in complex listening situations—the QuickSIN and the WIN—using its predetermined criteria (see Appendix D for a side-by-side comparison of the psychometric evidence for the WIN and QuickSIN). Discussions of these two candidate measures considered several factors, including content validity. Word-based (WIN) and sentence-based (QuickSIN) testing in a background of competing talkers both have ecological validity for the assessment of function in a standardized representation of many everyday listening situations, but a sentence-based test is probably more representative of communication in such situations (e.g., Neal et al., 2022). In addition, some sentence-based tests draw on more top-down resources and may tap broader aspects of speech perception and listening than word-based tests (Neal et al., 2022). In addition, the committee notes that the QuickSIN, which evaluates an individual’s ability to understand sentences, may be more familiar to audiologists and somewhat shorter to administer. However, psychometric evaluation of the QuickSIN shows a lack of evidence on the reliability and consistency of the measurement over time with moderate test-retest reliability. On the other hand, the WIN, which evaluates an individual’s ability to understand single words, has had a much more rigorous psychometric development and evaluation (including strong test-retest reliability), is currently used as part of the NIH Toolbox, and is available in Spanish (although the committee notes that additional validation of the Spanish WIN is needed).
Conclusion 6-7: The WIN is the strongest candidate for use as a behavioral outcome measure for understanding speech in complex listening situations at this time.
The following sections present evidence regarding self-report (subjective) measures of understanding speech in complex listening situations considered for in-depth analysis by the committee.
As discussed in Chapter 4, the 66-item PHAB was developed as a self-report measure of the speech-communication benefits of hearing aids in real-world situations. The measure was quickly found to be too long for routine clinical application and an abbreviated 24-item PHAB (APHAB) was developed and evaluated (Cox and Alexander, 1995). Both the PHAB and the APHAB use a 7-item response scale that asks the frequency with which the respondent experiences various hearing difficulties with responses ranging from “never (1 percent)” to “always (99 percent),” with higher scores reflecting more frequently experienced difficulties. Despite the name (due to the origin of the measure), both the PHAB and APHAB are applicable for all types of hearing interventions.
Factor analyses of PHAB scores led to the development of seven PHAB subscales, five pertaining to communication difficulties in a variety of listening conditions (i.e., ease of communication, familiar talkers, background noise, reverberation, and reduced cues) and two pertaining to the distortion and aversiveness of environmental sounds (Cox and Alexander, 1995). The APHAB retained four of the original seven PHAB subscales, but each scale was limited to 6 items; three scales pertain to communication difficulties (familiar talkers, background noise, reverberation) and one focuses on the aversiveness of environmental sounds. Several subsequent studies found the communication subscales of the APHAB to be strongly correlated and these subscales are often averaged to form a single APHAB-global score (Chisolm et al., 2005; Dornhoffer et al., 2020; Kochkin, 1997; Sabin et al., 2020). The APHAB-global scores are reliable and sensitive to change (from the use of hearing aids) (Chisolm et al., 2005; Cox and Alexander, 1995).
Although, as noted, the shorter APHAB is widely used clinically, the longer PHAB has been used frequently in clinical research, including several clinical trials comparing technologies (Haskell et al., 2002; Walden, 1997) and recent randomized controlled trials evaluating fitting methods (Humes et al., 2017; Sabin et al., 2020). The test-retest correlation was estimated to be approximately 0.85 for the PHAB- and APHAB-global baseline scores (Cox and Gilmore, 1990; Cox and Rivera, 1992; Cox and Alexander, 1995).
Overall, the committee examined nine articles pertaining to the development and validation of the PHAB and APHAB (Chisolm et al., 2005; Cox and Alexander, 1995; Cox and Gilmore, 1990; Cox and Rivera, 1992; Dornhoffer et al., 2020; Kam et al., 2011; Kochkin, 1997; Löhler et al., 2017; Sabin et al., 2020). The measure has strong psychometric evidence of reliability and validity, and there is some evidence for its sensitivity to change. The APHAB is also available in at least 20 languages (Srinivasan and O’Neill, 2023). The APHAB requires around 10 minutes or less to complete.
This is feasible for the research context but is a time commitment in the clinic; however, since it is a self-report measure, it can be completed by the patient prior to an appointment to save time. As noted, the measure has good test-retest reliability for ease of communication, reverberation, and background noise (the scales comprising the global score), and is sensitive to change, making the APHAB-global score a good candidate given its focus on the core outcome, strong psychometrics, and relative feasibility.
Conclusion 6-8: The APHAB-global score is a good candidate to consider for inclusion in a core outcome set.
The SSQ49 is a 49-item questionnaire “designed to measure a range of hearing disabilities” including speech, spatial ability, and qualities of hearing (Gatehouse and Noble, 2004, p. 1). The SSQ49 is meant to reflect real-world hearing performance by assessing the ability to segregate sounds and understand simultaneous speech (Gatehouse and Noble, 2004). Unlike other measures that focus solely on the measurement of hearing and communication ability, the SSQ49 additionally assesses spatial hearing via ratings of ability to detect the direction, distance, and movement of speech. Of the 49 items in the original SSQ, 14 pertained to speech understanding in a variety of situations, 17 assessed spatial hearing and sound localization, and the remaining 18 items assessed various hearing abilities including sound segregation, music and voice identification, and sound source identification.
Although the developers did not perform factor analysis of the item scores, they did note that many of the SSQ items were intercorrelated both within subscales and across subscales (Gatehouse and Noble, 2004). Akeroyd and colleagues (2014) subsequently administered the 49-item SSQ to 1,220 adults and performed factor analysis (Akeroyd, 2014). Three factors were identified in that analysis, each factor largely supporting the three scales of the SSQ. Importantly, oblique rotation of factors was used, which allows for correlations among the factors in the solution; the authors reported interfactor correlations ranging from about 0.5 to 0.7. The results of this factor analysis, including the interfactor correlations, were replicated using a French version of the measure (Moulin et al., 2015). The presence of moderate to high interfactor correlations suggests considerable shared variance among the three scales of the SSQ. Consistent with this, Humes and colleagues (2013) reported that a single factor emerged in their factor analysis of the full SSQ.
To improve the efficiency of the SSQ, a variety of shortened versions have been developed and evaluated, including a 5-item screener (SSQ5) and 12-item version (SSQ12) (Demeester et al., 2012; Noble et al., 2013).
The SSQ12 was developed as a more feasible alternative to capture spatial hearing and real-world hearing more quickly, making it well suited for clinical practice; the measure is only 12 questions long and is well correlated with the SSQ49 (Noble et al., 2013). A limitation of the SSQ12 is that the subscales of the measure are not preserved; the SSQ12 yields a single overall score. Additionally, validation is limited because the 12 items were extracted from the full survey.
A newer version (French SSQ15, or 15iSSQ) contains 5 questions on each of the three subscales (Moulin et al., 2019). This version has been psychometrically evaluated on its own and verified that the scale scores, as well as the total score, were reliable and valid in adults with hearing difficulties (Moulin et al., 2019). The French SSQ15 is distinct from a German SSQ15 (Kiessling et al., 2011); the German SSQ15 does not have three distinct factors,1 whereas the French SSQ15 does (Moulin et al., 2019).
The committee reviewed 10 articles pertaining to the development and validation of the SSQ49, SSQ12, and SSQ15 (Akeroyd et al., 2014; Banh et al., 2012; Demeester et al., 2012; Fitzgerald et al., 2024; Gatehouse and Noble, 2004; Moulin et al., 2019; Noble et al., 2012, 2013; Singh and Pichora-Fuller, 2010; Wyss et al., 2020). Six additional articles related to the validation of the SSQ49 were reviewed (Motlagh Zadeh et al., 2021; Sanchez-Lopez et al., 2022; Saxena et al., 2022; Srinivasan and O’Neill, 2023; Stenbäck et al., 2023; Utoomprurkporn et al., 2021). Major strengths of the SSQ are that the psychometric structure of both the original 49-item survey and the French 15-item survey have been carefully studied with cluster and factor analysis. The tests have good test-retest reliability, and while the tests are the most reliable when administered in interview format, reliability is still excellent when administered using paper and pencil, computer, or online formats. Use of an interview format takes more time. The test is sensitive to change and has been used to measure the effects of hearing aid and cochlear implant interventions on speech in noise and other hearing subscales.
The SSQ49 and SSQ12 are available in multiple languages including Colombian Spanish (Sanchez-Lopez et al., 2022), Turkish (Kılıç et al., 2021), Dutch (Batthyany et al., 2023), Brazilian Portuguese (Aguiar et al., 2019), Iranian (Lotfi et al., 2016), Norwegian (Myhrum et al., 2024), French (Moulin and Richard, 2016; Moulin et al., 2015), Romanian (Radulescu et al., 2024), Spanish (Cañete et al., 2022), Arabic (Alkhodair et al., 2021), and Chinese (Meng et al., 2024). The rigor of the translation and validation process was not considered as part of the review process; compliance with best practices for translating and adapting hearing-related questionnaires
___________________
1 The German SSQ15 was unable to be reviewed by this committee as the full study was not translated into English (Kiessling et al., 2011).
will need to be considered if foreign language versions are used (Hall et al., 2018).
Conclusion 6-9: The SSQ is a good candidate to consider for inclusion in a core outcome set.
The committee compared the two best candidates for a self-report measure for the core outcome of understanding speech in complex listening situations—the APHAB and the SSQ—using its predetermined criteria. Both measures have good evidence for reliability and validity, with some evidence for sensitivity to change, and both are available in multiple languages. One of the brief versions of the SSQ would be more feasible to administer than the full 49-item SSQ and because the French SSQ15 maintains a 5-item SSQ-Speech score, it would appear to be the best candidate for use as an outcome measure among the available SSQ measures. Compared to the APHAB-global, however, much less information is available for the French SSQ15 at this time. Both the APHAB and SSQ can take a relatively long time to administer, but as self-report measures, they could be filled out in advance by the patient. They also would not necessarily have to be administered at every visit. Overall, the 18-item APHAB-global score focuses on the scales related to scenarios of complex listening alone, which is the core outcome. The SSQ, on the other hand, includes a mix of speech, spatial location, and sound quality that are incorporated into the final score and the shorter 5-item SSQ-Speech score requires further psychometric evaluation as an outcome measure.
Conclusion 6-10: The APHAB-global score is the strongest candidate for use as a self-report outcome measure for understanding speech in complex listening situations at this time.
Conclusion 6-11: The SSQ is a good candidate to consider for supplemental measurement (beyond the core set) when sound quality and localization are also of interest.
The following sections present evidence regarding self-report measures of hearing-related psychosocial health considered by the committee. The committee considered many measures but ultimately zeroed in on variations of the Hearing Handicap Inventory (HHI).
The HHIE was the original 25-item self-report measure of HHI (Ventry and Weinstein, 1982, 1983). The HHIE measures emotional consequences and social/situational effects of hearing difficulties in adults 65 years of age and older. Soon after the original HHIE was published, a brief 10-item screening version, the HHIE-S was developed (Weinstein and Ventry, 1983). Subsequently, the original measure was modified to develop the HHI for adults (HHIA), which targeted adults under the age of 65 years (Newman et al., 1990). Emotional and social subscales were recommended for both the HHIE and the HHIA based primarily on the content validity of the test items. After elimination of 9 of the original 25 items of the HHIE to optimize measurement properties, Heffernan et al. (2020) found the remaining 16 items of the shortened HHIE to be unidimensional.
In 2020, two detailed psychometric item analyses of the HHIE and HHIA were published, each recommending an overall reduction in the number of questions (Cassarly et al., 2020; Heffernan et al., 2020). Cassarly and colleagues (2020), using a large community-based convenience sample, referred to their new 18-item scale as the Revised Hearing Handicap Inventory (RHHI) and developed a shorter 10-item screener as well (RHHI-S). Social and emotional subscales were not supported for the abbreviated HHI measures in either analysis. More evaluations of the RHHI measures have been completed since 2020 (Dillard et al., 2024a,b; Humes, 2021).
The committee reviewed 25 articles pertaining to the development and validation of the HHIE and HHIA (Chisolm et al., 2005; Dillon et al., 1997; Heffernan et al., 2020; Humes, 2021; Humes et al., 1996, 2001, 2002, 2003, 2017; Jerger et al., 1996; Malinoff and Weinstein, 1989; McArdle et al., 2005; Mulrow et al., 1990a,b, 1992a,b; Newman et al., 1990; Newman and Weinstein, 1988, 1989; Öberg et al., 2007; Stark and Hickson, 2004; Taylor, 1993; Ventry and Weinstein, 1982; Weinstein et al., 1986; Weinstein and Ventry, 1983), 8 articles pertaining to the development and validation of the HHIE-S (Humes, 2021; Lichtenstein et al., 1988; Lin et al., 2023; Mulrow et al., 1990b; Newman et al., 1991; Sanchez et al., 2024; Tomioka et al., 2013; Ventry and Weinstein, 1983), and 3 articles pertaining to the development and validation of the RHHI or RHHI-S (Cassarly et al., 2020; Dillard et al., 2024a,b).
In terms of administration efficiency, the HHIE requires about 10 minutes to complete, compared to the HHIE-S which takes 2–3 minutes, the RHHI which takes 5–7 minutes, and the RHHI-S which takes about 2–3 minutes.
Generally, the HHIE had the most evidence supporting sensitivity to change, including three randomized controlled trials, and both the HHIE and the HHIE-S had considerable evidence supporting adequate test-retest
reliability. On the other hand, the RHHI and RHHI-S are probably the most psychometrically sound based on the item analyses by Cassarly and colleagues (2020). Given that the RHHI-based measures are a verbatim subset of original HHIE items, it is assumed here that the sensitivity to change and the test-retest reliability observed for the HHIE-based measures apply to the corresponding RHHI-based measures.
Overall, despite the absence of direct information on the reliability and sensitivity of the RHHI-based measures, the RHHI is the strongest candidate for use as an outcome measure for hearing-related psychosocial health at this time. Generally, a self-report measure comprised of 18 items would be expected to be a little more reliable than a 10-item RHHI-S score. Further, given that the minimum detectable difference is closely tied to test-retest reliability, the RHHI would be expected to be more sensitive to postintervention change than the shorter 10-item RHHI-S. Finally, to the extent that assessment of the measure relies primarily on data regarding the reliability and sensitivity to change reported for the 25-item HHIE, it is perhaps more appropriate to assume these results apply to the 18-item RHHI, which includes 72 percent of the items common to the HHIE.
Conclusion 6-12: The RHHI is the strongest candidate for use as a self-report outcome measure for hearing-related psychosocial health at this time.
Strengthening hearing health care outcome measurement will require research in several areas and various new approaches, including better research on evaluating a measure’s responsiveness, linking currently available measures, and using item response theory.
One of the key challenges the committee faced when assessing measures was determining the responsiveness of these measures. Responsiveness (also known as sensitivity to change) can be thought of as the degree to which a measure accurately documents a meaningful change in an outcome, or, in the words of the COSMIN initiative, responsiveness is “the ability of an [health-related patient-reported outcomes] instrument to detect change over time in the construct to be measured” (Mokkink et al., 2010, p. 742). The general consensus is that the best way to assess responsiveness is with a longitudinal study in which at least some of the patients are known to change on the construct of interest (Crosby et al., 2003).
The measurement literature commonly discusses two complementary types of methods for evaluating the responsiveness of a measure, one of which is based only on the data accumulated on the measure of interest, while the other evaluates those data relative to an outside standard. The first type, distribution-based methods, expresses change scores “in terms of an underlying sampling distribution, whether in between-person standard deviation units, within-person standard deviation units, or some variation of the standard error of measurement” (Haley and Fragala-Pinkham, 2006, p. 737), which is an absolute reliability coefficient that quantifies the consistency of measured values in the same units of the original measurement. Distribution-based methods “are based on statistical significance, sample variability, and measurement precision. In contrast, anchor-based approaches require an external, independent standard to ‘anchor’ the meaning of clinical importance, one that is itself interpretable and at least moderately correlated with the test or measure of interest” (Haley and Fragala-Pinkham, 2006, p. 737).
As noted above, distribution-based methods rely on statistical evaluations of the data of interest, without bringing in outside values. To evaluate the size of the effect from a particular treatment, for instance, the typical approach would be to compare the average before-treatment and after-treatment scores of a group of patients, with the effect size being the difference of the two means (“after” minus “before”), with this number divided by a standardizing value intended to account for the choice of scaling. (For instance, changes in effect scored on a scale of 0 to 10 would appear to be twice the size of changes in effect scored on a scale of 0 to 5 even if the absolute change in effect was exactly the same.) In other words, the effect size would be expressed as a fraction whose numerator is the difference between the two means and the denominator is a standardizing factor, typically a standard deviation of one of the distributions under consideration (de Vet et al., 2011). Cohen (1988) defined the “standardized effect size” as being an effect size whose denominator is the standard deviation of the collection of scores on the outcome measure at baseline. Cohen also suggested how to interpret effect sizes: 0.20 should be considered a small effect size, 0.50 a moderate effect size, and 0.80 and greater should be considered a large effect size. A decade later Samsa and colleagues (1999) carried out a comprehensive literature review and, based on its results, suggested that an effect size of 0.20 should be considered the hallmark of a minimal clinically important difference.
Samsa and colleagues (1999) argued that effect sizes calculated with distribution-based methods can be used effectively in determining when treatments result in clinically important differences (CIDs). “First, the effect size
approach is efficient—implementation merely requires: (1) selecting the effect size benchmark; and (2) estimating the standard deviation . . . relevant to the population under study. This standard deviation can usually be obtained from the literature or extant data bases—if not, a small observational study (using a group which is representative of the trial’s target population) should suffice” (Samsa et al., 1999, p. 144). Second, researchers in a variety of areas have used effect size benchmarks in their work, with reasonable results. Samsa and colleagues (1999) also noted that Cohen’s suggestion for the cutoffs for small, moderate, and large effect sizes was based on the analysis of data distributions for a large variety of characteristics. However, the researchers ultimately concluded that the ideal approach would be to use effect sizes for an initial estimate of the CID but then to check that result using at least one anchor-based method (Samsa et al., 1999).
If one is to use effect size to evaluate the effectiveness of a treatment, it is important to keep in mind some of the method’s limitations. One of the most important of these limitations is, as Crosby and colleagues (2003) noted, the way that the “characteristics of the distribution, particularly at baseline, may strongly influence the effect size” (p. 400). For example, the more heterogeneous a sample is at baseline, the larger the standard deviation of that sample—and the smaller the effect size—will be. “Thus, the same amount of individual change produces different effect sizes depending upon the heterogeneity of the sample at baseline” (Crosby et al., 2003, p. 400).
Another way to measure response to treatment which is closely related to the effect size is the standardized response mean (SRM), which is also known as the efficacy index or the responsiveness–treatment coefficient. Like the effect size, the SRM is a ratio whose numerator is the change in a measure, but its denominator, instead of being the standard deviation of the baseline measurement, is the standard deviation of the changes in the measure in the population that was studied. Thus, the SRM takes into account the variation in the measured changes; it is smaller when the variation in changes is larger and larger when the variation is smaller. Researchers have suggested assessing SRM values in the same way that effect size values are typically assessed, with 0.20 representing a small change, 0.50 a medium one, and 0.80 a large change. While the SRM, by taking into account the variation in the measured changes, can be a useful way to characterize responses in a population, it is less valuable for individuals, as their SRM score will vary depending on how heterogeneous the population response is (Crosby et al., 2003).
The standard error of measurement (SEM) offers an alternative approach to determining clinically meaningful differences—and one that is relatively independent of the population sample being measured. The SEM, which provides a measure of how precise a given test or measure is for a given sample, is a function of both the standard deviation of that sample and the sample’s reliability coefficient (Crosby et al., 2003). Both values
vary according to the sample, but because the relationship between the two remains relatively stable from sample to sample, the SEM also remains relatively constant across samples. Thus, as Crosby and colleagues note, “the SEM is considered to be an attribute of the measure and not a characteristic of the sample per se” (Crosby et al., 2003, p. 402). Different authors have suggested different SEM threshold values as indicating clinically meaningful differences, including 1 SEM (Wolinsky et al., 1998), 1.96 SEM (McHorney and Tarlov, 1995), and 2.77 SEM (McHorney and Tarlov, 1995; Wyrwich et al., 1999). Among the limitations of using the SEM approach is that it is based on the assumption that measurement error does not vary from score to score and thus the SEM remains fixed across scores—an assumption that is not borne out in practice (Crosby et al., 2003).
The SEM is used in the calculation of the minimal detectable change (MDC), which is, as the name suggests, the smallest change that can be detected by a measure, given the measurement error of the instrument being used to make the measurements. It is referred to by a variety of other names as well, including the smallest detectable change, the smallest real difference, and the reliable change index (Beckerman et al., 2001; Streiner et al., 2015). The MDC can be calculated in a variety of ways and will depend not only on the SEM but also the desired confidence level. For a 95 percent confidence level, for instance, MDC = 1.96 × × SEM (de Vet et al., 2006). This has often been referred to as “95 percent critical differences” in audiology (Demorest and Walden, 1984; Demorest and Erdman, 1988).
Generally speaking, distribution-based methods of assessing responsiveness have a number of limitations. As noted above, for instance, the effect size varies according to the heterogeneity of the baseline sample; thus, a change of a certain magnitude may be considered important if it is observed in a homogeneous study population but not important if it were observed in a heterogeneous study sample. Similarly, the SRM varies with heterogeneity in the population response. Furthermore, while distribution-based methods can reliably detect change over time in a measure, “these methods do not in themselves provide a good sense of the clinical relevance of the change” (Crosby et al., 2003). Thus, they may require additional comparisons to interpret their clinical meaning.
Crosby and colleagues (2003) concluded that “[t]he distribution-based measures that seem most promising for establishing clinically meaningful change are those based on the measurement precision of the instrument,” such as the SEM (p. 402).
A major advantage of anchor-based methods over distribution-based methods is that the use of an anchor in the method makes it possible to look
for changes that are clinically relevant or important to the patient rather than simply being statistically significant; in particular, the anchor is chosen to reflect the changes that patients and clinicians find to be most valuable or important to study. One important use of anchor-based methods is to determine the “minimal clinically important difference” (MCID), which was defined by Jaeschke and Guyatt (1989) as the “smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side effects and excessive cost, a change in the patient’s management” (p. 408; see also Zhang et al., 2023). The MCID differs from the MDC in that there is no consideration in the MDC that a change is clinically important—only that it can be detected—whereas the MCID is specifically focused on changes that are both detectable and clinically relevant.
Researchers have used various types of anchor-based approaches in establishing an MCID, and there is no consensus on which is the best to use. One anchor-based approach that is commonly used in longitudinal studies to establish which changes are clinically meaningful is the global rating of change scale, which has patients rate their degree of overall health improvement or deterioration over time (Kamper et al., 2009). The mean change method determines the MCID by calculating the mean change in score on the relevant measure among a group of patients who have rated themselves as “a little better” on a global rating of change (Dekker et al., 2024).
Crosby and colleagues (2003) offered the following conclusion about the use of anchor-based approaches in determining the responsiveness of a measure. Their conclusion dealt specifically with measures of health-related quality of life but also applies to hearing measures:
The clear advantage of anchor-based approaches is that change in [health-related quality of life] is linked to a meaningful external anchor. Lydick and Epstein (1993) have likened this approach to establishing the construct validity of a measure. The most notable advantage of global ratings, particularly patient ratings, is that they provide the single best measure of the significance of change from the individual perspective. They also have the potential to take into account more information (e.g., other life circumstance) that may affect [health-related quality of life] than other methods for assessing clinically meaningful change. (p. 399)
From the perspective of what the clinician, rather than the individual patient, finds most meaningful, Crosby and colleagues (2003) recommended the use of clinician global ratings and longitudinal disease-related measures of outcome.
Anchor-based methods that depend on global ratings have a number of limitations. One of these is that they depend upon patient reporting, which
can be biased in various ways, including by inaccurate memory (recall bias). Since memory tends to become less accurate with time, recall bias is a particular issue when there is an extended period of time between the measures. Thus, researchers who use global ratings of change need to be careful to assess how reliable the ratings are (Crosby et al., 2003).
In light of the different types of limitations that affect distribution-based and anchor-based measures, Crosby and colleagues (2003) recommended an integrated approach that uses information from both approaches to arrive at a reliable evaluation of clinically meaningful change. When the two approaches provide answers that agree relatively well with one another, researchers can have reasonable confidence in that answer, and when the two approaches disagree, researchers need to look more closely to determine which answer should be judged more reliable.
The committee particularly notes that MCIDs do not currently exist for the RHHI, APHAB-global, QuickSIN, SSQ, and WIN. Distribution-based MDCs, in the form of 95 percent critical differences, however, have been published for the parent HHI measure of the RHHI, the HHIE (Weinstein, 1986). They also have been published separately for the scales that make up the APHAB global score, but not for the APHAB global outcome measure itself (Cox and Alexander, 1995). In addition, all MDCs published for these outcome measures have been presented as a single value to be applied across all baseline scores, a practice that is not supported by research (Crosby et al., 2003).
For hearing-aid patient-reported outcome measures, MDCs or critical differences most often have been reported as single values, and this holds for the PHAB-based measures reviewed here. For the APHAB measures, Cox and Alexander (1995) reported 95 percent critical differences of 26 percent for both unaided and aided APHAB scores and 33 percent for the APHAB benefit score, for all individual subscales. These values were subsequently confirmed by Haskell and colleagues (2002). For the APHAB global score, Chisolm and colleagues (2005) reported 95 percent critical-difference values of 17.8 percent and 15.9 percent for test–retest intervals of 2 or 10 weeks, respectively. Much additional research is needed to establish MDCs and MCIDs for the recommended outcome measures.
Conclusion 6-12: Statistical approaches such as distribution-based methods and anchor-based approaches could help develop more robust evidence on the sensitivity to change of various outcome measures relative to hearing interventions.
Linking is a statistical process that makes it possible to directly compare two measures that are similar but not identical; their differences can be of various types—e.g., the measures may have different content or different construct severity levels (Brady et al., 2022). To carry out a linking analysis, one first determines a “target measure” and an “anchor measure” to which the target measure will be linked. The two measures should be quantifying essentially the same construct. Next, using a statistical analysis one establishes a relationship between the target measure and the anchor measure that maps scores on the target measure onto equivalent scores on the anchor measure. The resulting links between scores on the two measures are referred to as crosswalks, and they make it possible to compare results from different measures of similar constructs. For example, linking might allow one clinician to use the QuickSIN and another to use the WIN and still be able to compare their scores. Similarly, linking could provide the ability to use any of the HHI variations.
Conclusion 6-13: Linking could be useful to determine comparable scores among various outcome measures for hearing interventions.
Item response theory (IRT) is an approach used both in the design of tests and in the analysis and scoring of the responses on tests. IRT differs from classical test theory in its focus on how an individual answers individual items on a test rather than on the overall score from the test. An individual’s response on a particular item is assumed to be a function both of that individual’s overall latent ability or performance on the construct the test was designed to measure and of the characteristics of the particular item. In particular, the theory does not assume that each item in a measure is equally difficult for an individual to answer or equally informative about the construct being tested. Instead, the theory focuses on how individual items on a test contribute to measuring a construct rather than just looking at an overall test score.
IRT allows for more precise measurement by considering the difficulty of each item and the individual’s ability level, making it possible to make comparisons between different versions of the same measure even if they have different sets of questions. An example of the application of IRT in the development of tests can be found in the design of computer adaptive testing (Benton, 2021). Computer adaptive testing is an administration method that uses test software to adjust the difficulty of questions presented to an individual based on that person’s response to previous questions. The test
software selects questions for each individual based on how they answered previous questions. Correct answers lead to more difficult questions, while incorrect answers lead to less difficult questions. By tailoring the difficulty of questions presented to an individual based on their performance, computer adaptive testing software can make the assessment more efficient to administer. It should be noted that a version of IRT was applied to the development of the RHHI, one of the core outcome measures (Cassarly et al., 2020).
Conclusion 6-14: Item response theory may be useful in making outcome measurement more precise, making the administration of measures more efficient, and allowing for the comparison of different versions of the same measure.
Finding 6-1: The committee used two overarching criteria to assess measures: (1) scientific acceptability—including reliability, validity, and sensitivity to change and (2) feasibility—including burden on individuals, clinicians, and researchers.
Finding 6-2: Many measures have been used primarily for diagnostic assessment, but not for outcome measurement.
Finding 6-3: Some measures are not broadly accessible.
Finding 6-4: No measure had robust evidence for all criteria examined by the committee.
Finding 6-5: The ACT is a newer, language-agnostic measure that strongly predicts other speech-in-noise measurements. However, there is limited evidence on the measure.
Finding 6-6: The AzBio Sentence Test has limited evidence for its development and validation. The test is most commonly used as a candidacy measure and for evaluating cochlear implants.
Finding 6-7: The DIN test focuses on repeating digits (rather than words or sentences), takes about 2 minutes to administer, and is available in multiple languages. The test is primarily used as a screener for hearing loss, and most research has been using the test as a substitution for measures like the audiogram.
Finding 6-8: The QuickSIN takes about a minute to administer and is familiar to clinicians. However, there is a lack of evidence on the reliability of the measure and consistency of the measurement over time, and the test is only available in English.
Finding 6-9: The WIN takes 4–6 minutes to administer. There is a significant amount of psychometric data supporting the quality of the measure, and test-retest reliability for each of the word lists is well documented, in addition to documentation of the equivalency of the difficulty of the word lists. A Spanish version is available, but it has not been sufficiently validated.
Finding 6-10: The APHAB is a self-report measure that includes three subscales pertaining to communication difficulties in various situations (familiar talkers, background noise, reverberation) and one subscale on the aversiveness of environmental sounds.
Finding 6-11: The APHAB has a sufficient amount of psychometric data supporting the reliability and validity of the measure, and there is some evidence for its sensitivity to change. The measure is available in at least 20 languages and takes around 10 minutes to complete.
Finding 6-12: The communication subscales of the APHAB are often averaged to form a single APHAB-global score, which has been shown to be reliable and sensitive to change.
Finding 6-13: The SSQ examines speech, spatial abilities, and qualities of hearing. The measure has a sufficient amount of psychometric data supporting the reliability and validity of the measure, with some evidence for sensitivity to change, and is available in multiple languages.
Finding 6-14: Behavioral measures capture outcomes in a controlled environment while self-report measures capture real-world experiences.
Finding 6-15: The HHIE requires about 10 minutes to complete, compared to the HHIE-S which takes 2–3 minutes, the RHHI which takes 5–7 minutes for completion, and the RHHI-S which takes about 2–3 minutes.
Finding 6-16: The HHIE has the most evidence supporting sensitivity to change, and there is a sufficient amount of psychometric data supporting the test-retest reliability of the measure.
Finding 6-17: The RHHI includes items that are a verbatim subset of the HHIE.
Finding 6-18: Several statistical methods can help improve the evidence base for measures’ sensitivity to change including distribution-based methods and anchor-based approaches. More research is needed to develop appropriate MCIDs for all outcome measures used to assess hearing interventions.
Finding 6-19: Linking is a statistical process that makes it possible to create an equivalency mapping between the scores from two different measures of the same construct. With this technique, researchers can directly compare the scores from various tests that measure the same, or approximately the same, characteristic. These links are known as crosswalks.
Finding 6-20: IRT is a statistical approach used to understand how individual items on a test contribute to measuring a construct rather than just looking at the overall score. Broader application of IRT to measuring outcomes of hearing interventions is desirable.
Many outcome measures have been developed to assess the outcomes of understanding speech in complex listening situations and hearing-related psychosocial health. The committee focused on the measures with the most available information concerning their psychometric development and use. First, the committee documented descriptive characteristics of the studies used in each measure’s development, including the population studied, the number of participants, the setting, and how the measure is scored. Next the committee evaluated each measure according to a series of criteria—scientific acceptability (including reliability, validity, and sensitivity to change) and feasibility (by setting). Overall, current outcome measures are imperfect, but there is adequate evidence to support the standard use of specific measures at this time.
For the assessment of the outcome of understanding speech in complex listening situations, the committee considered both behavioral measures and self-report measures and concluded that both types of measures are necessary. While behavioral measures have face validity, are considered by some to be more objective, and are readily available, these measures often lack data on their sensitivity to change and do not necessarily correlate with an individual’s perception of their improvement.
___________________
2 The word standardized is used here in a broad sense to indicate that the same measures are being used for specific outcomes, and that there are prescribed materials and procedures for the use of these measures. The committee does not imply that the measures are part of national or international standards.
For the self-report measure, the committee narrowed the candidates down to the APHAB and the SSQ. While both measures have good evidence for psychometric strength and both are available in multiple languages, the APHAB (and the APHAB-global score in particular) has a greater focus on scenarios of complex listening situations, whereas the SSQ includes a mix of speech, spatial location, and sound quality. Therefore, the committee concluded that the APHAB-global score currently is the best candidate for assessing an individual’s experience with understanding speech in complex listening environments.
For the behavioral measure, the committee narrowed the candidates down to the QuickSIN and the WIN. The QuickSIN, which evaluates an individual’s ability to understand sentences, may be more familiar to audiologists and somewhat shorter to administer. However, psychometric evaluation of the QuickSIN shows a lack of evidence on the reliability and consistency of the measurement over time with moderate test-retest reliability. The WIN, which evaluates an individual’s ability to understand single words, has had a much more rigorous psychometric development and evaluation (including significant evidence for test–retest reliability), is currently used as part of the NIH Toolbox, and is available in Spanish (although the committee notes that additional validation of the Spanish WIN is needed). The committee emphasizes that measures for understanding speech in complex listening situations need to be implemented as recommended by the developers in order to avoid variability and allow for comparison across studies.
For hearing-related psychosocial health, the committee ultimately zeroed in on variations of the HHI. The committee concluded that the 18-item RHHI is the best current candidate. The well-studied HHIE was eliminated primarily because of its length and because psychometric analyses favor the RHHI. The shorter screening version (RHHI-S) had less robust research on psychometrics. The RHHI, a relatively new measure, has undergone rigorous item analysis but lacks explicit data regarding reliability and sensitivity to change following intervention. Given that 18 of the 25 items included in the HHIE are common to the RHHI, the committee relied on evidence regarding reliability and sensitivity to change for the HHIE when considering the RHHI.
Recommendation 6-1: When assessing outcomes in hearing health, clinicians, researchers, and individuals should use the following outcome measures for each of the outcomes in the core outcome set:
- Understanding speech in complex listening situations
- Abbreviated Profile of Hearing Aid Benefit global score (APHAB-Global)
- Words-in-Noise (WIN) test
- Hearing-related psychosocial health
- Revised Hearing Handicap Inventory (RHHI)
The committee recognizes the potential burden of assessing outcomes with three different measures and emphasizes that it will be important to determine the timing and frequency of the outcome measurements that will deliver optimal information. Additionally, it will not necessarily be required to evaluate each measure at each encounter, and self-report measures could be completed by the adult with hearing difficulties in advance of a clinical or research encounter. Finally, the committee reemphasizes that supplemental measures will likely be needed, depending on the specific context of the outcome measurement, including verification of audibility as appropriate.
The committee recognizes that the measurement of hearing health outcomes requires an improvement in psychometric rigor overall. Research on measure development and refinement is needed to improve the quality of existing measures, including the previously recommended measures. For example, statistical approaches such as IRT and linking may help with refinement of existing measures. In particular, creating links between the two behavioral measures of speech understanding in complex listening situations, the WIN and QuickSIN, and among alternate forms of the HHI measures would be of value. Additionally, research on sensitivity to change, associations among core outcomes, and variations of existing measures may also help with measure choice and refinement.
Recommendation 6-2: Sponsors of hearing health research should fund further psychometric evaluation of the measures recommended for the core outcome set. Specific areas of research include the following:
- Development of links and crosswalks
- Words-in-Noise (WIN) test versus Quick Speech-in-Noise (QuickSIN) test
- Among different variations of the Hearing Handicap Inventory (HHI)
- Establishment of the sensitivity to change relative to intervention (including minimal detectable change and minimal clinically important difference) for the WIN, the global score from the Abbreviated Profile of Hearing Aid Benefit (APHAB-global), the Revised HHI (RHHI), and the screening (RHHI-S)
- Development of WIN (and QuickSIN) in other languages
- Assessment of associations among the set of core outcomes to further establish the independence and uniqueness of each measure
- Application of item response theory to further develop and refine the recommended outcome measures
Research beyond the currently recommended measures is needed to build evidence for the use of measures not recommended by this committee that might be reconsidered for an updated core outcome set.
Recommendation 6-3: Sponsors of hearing health research should fund research to develop and refine hearing health outcome measures beyond the currently recommended measures, including:
- Broader psychometric development of the Quick Speech-in-Noise (QuickSIN) test;
- Exploration of the use of the digits-in-noise test as an outcome measure; and
- Exploration of the usefulness of high-quality language agnostic tests for sound processing in complex listening situations.
Advanced Bionics LLC, Cochlear Americas, and M.-E. Corporation. 2011. Minimum speech test battery (MSTB) for adult cochlear implant users 2011. https://www.auditorypotential.com/MSTBfiles/MSTBManual2011-06-20%20.pdf (accessed November 7, 2024).
Aguiar, R. G. R., K. de Almeida, and E. C. de Miranda-Gonsalez. 2019. Test-retest reliability of the Speech, Spatial and Qualities of Hearing scale (SSQ) in Brazilian Portuguese. International Archives of Otorhinolaryngology 23(04):e380-e383.
Akeroyd, M. A. 2014. An overview of the major phenomena of the localization of sound sources by normal-hearing, hearing-impaired, and aided listeners. Trends in Hearing 18:2331216514560442.
Akeroyd, M. A., F. H. Guy, D. L. Harrison, and S. L. Suller. 2014. A factor analysis of the SSQ (Speech, Spatial, and Qualities of Hearing Scale). International Journal of Audiology 53(2):101–114.
Alkhodair, M. B., T. A. Mesallam, A. Hagr, and M. F. Yousef. 2021. Arabic version of short form of the Speech, Spatial, and Qualities of Hearing scale (SSQ12). Saudi Medical Journal 42(11):1180.
Allen, D., L. Hickson, and M. Ferguson. 2022. Defining a patient-centred core outcome domain set for the assessment of hearing rehabilitation with clients and professionals. Frontiers in Neuroscience 16:787607.
Armstrong, N. M., B. C. Oosterloo, P. H. Croll, M. A. Ikram, and A. Goedegebure. 2020. Discrimination of degrees of auditory performance from the Digits-in-Noise test based on hearing status. International Journal of Audiology 59(12):897–904.
Auditdata. n.d. What is Quick-SIN? https://www.auditdata.com/audiology-solutions/measure/hearing-assessment/quicksin/#:~:text=The%20sentences%20are%20presented%20at,available%20in%20the%20English%20Language (accessed March 4, 2025).
Banh, J., G. Singh, and M. K. Pichora-Fuller. 2012. Age affects responses on the Speech, Spatial, and Qualities of Hearing Scale (SSQ) by adults with minimal audiometric loss. Journal of the American Academy of Audiology 23(2):81–91.
Batthyany, C., A.-R. Schut, M. van der Schroeff, and J. Vroegop. 2023. Translation and validation of the Speech, Spatial, and Qualities of Hearing scale (SSQ) and the Hearing Environments and Reflection on Quality of Life (HEAR-QL) questionnaire for children and adolescents in Dutch. International Journal of Audiology 62(2):129–137.
Beckerman, H., M. E. Roebroeck, G. J. Lankhorst, J. G. Becher, P.D. Bezemer, and A. L. M. Verbeek. 2001. Smallest real difference, a link between reproducibility and responsiveness. Quality of Life Research 10:571–578.
Bentler, R. A. 2000. List equivalency and test-retest reliability of the Speech in Noise test. American Journal of Audiology 9(2):84–100.
Bentler, R. A., D. P. Niebuhr, J. P. Getta, and C. V. Anderson. 1993a. Longitudinal study of hearing aid effectiveness. I: Objective measures. Journal of Speech and Hearing Research 36(4):808–819.
Bentler, R. A., D. P. Niebuhr, J. P. Getta, and C. V. Anderson. 1993b. Longitudinal study of hearing aid effectiveness. II: Subjective measures. Journal of Speech and Hearing Research 36(4):820–831.
Benton, T. 2021. Item response theory, computer adaptive testing, and the risk of self-deception. Research Matters 32:82–100. Available at https://files.eric.ed.gov/fulltext/EJ1317443.pdf (accessed March 27, 2025).
Billings, C. J., T. M. Olsen, L. Charney, B. M. Madsen, and C. E. Holmes. 2023. Speech-in-noise testing: An introduction for audiologists. Seminars in Hearing 45(1):55–82.
Brady, K. J. S., P. Ni, L. Carlasare, T. D. Shanafelt, C. A. Sinsky, M. Linzer, M. Stillman, and M. T. Trockel. 2022. Establishing crosswalks between common measures of burnout in U.S. physicians. Journal of General Internal Medicine 37(4):777–784.
Cañete, O. M., D. Marfull, M. C. Torrente, and S. C. Purdy. 2022. The Spanish 12-item version of the Speech, Spatial and Qualities of Hearing scale (SP-SSQ12): Adaptation, reliability, and discriminant validity for people with and without hearing loss. Disability and Rehabilitation 44(8):1419–1426.
Carhart, R. 1946. Selection of hearing aids. Archives of Otolaryngology (1925) 44:1–18.
Cassarly, C., L. J. Matthews, A. N. Simpson, and J. R. Dubno. 2020. The Revised Hearing Handicap Inventory and screening tool based on psychometric reevaluation of the Hearing Handicap Inventories for the elderly and adults. Ear and Hearing 41(1):95–105.
Chisolm, T. H., H. B. Abrams, R. McArdle, R. H. Wilson, and P. J. Doyle. 2005. The WHODAS II: Psychometric properties in the measurement of functional health status in adults with acquired hearing loss. Trends in Amplification 9(3):111–126.
Clarke, M., and P. R. Williamson. 2016. Core outcome sets and systematic reviews. Systematic Reviews 5:11.
CMS (Centers for Medicare & Medicaid Services). 2024. Measure testing. https://mmshub.cms.gov/measure-lifecycle/measure-testing/evaluation-criteria/overview (accessed March 6, 2024).
Cohen, J. 1988. Statistical power analysis for the behavioral sciences, 2nd edition. New York: Academic Press.
COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments). n.d. Guideline for selecting instruments for a core outcome set. https://www.cosmin.nl/tools/guideline-selecting-proms-cos (accessed June 26, 2024).
Cox, R. M., and C. Gilmore. 1990. Development of the Profile of Hearing Aid Performance (PHAP). Journal of Speech and Hearing Research 33(2):343–357.
Cox, R. M., and I. Rivera. 1992. Predictability and reliability of hearing aid benefit measured using the PHAB. Journal of the American Academy of Audiology 3(4):242–254.
Cox, R. M., and G. C. Alexander. 1992. Maturation of hearing aid benefit: Objective and subjective measurements. Ear and Hearing 13(3):131–141.
Cox, R. M., and G. C. Alexander. 1995. The Abbreviated Profile of Hearing Aid Benefit. Ear and Hearing 16(2):176–186.
Cox, R. M., G. C. Alexander, and C. M. Beyer. 2003. Norms for the international outcome inventory for hearing aids. Journal of the American Academy of Audiology 14(08): 403–413.
Cox, R. M., G. C. Alexander, and G. A. Gray. 2007. Personality, hearing problems, and amplification characteristics: Contributions to self-report hearing aid outcomes. Ear and Hearing 28(2):141–162.
Cox, R. M., J. A. Johnson, and J. Xu. 2014. Impact of advanced hearing aid technology on speech understanding for older listeners with mild to moderate, adult-onset, sensorineural hearing loss. Gerontology 60(6):557–568.
Cox, R. M., J. A. Johnson, and J. Xu. 2016. Impact of hearing aid technology on outcomes in daily life I: The patients’ perspective. Ear Hear 37(4):e224–e237.
Crosby, R. D., R. L. Kolotkin, and G. R. Williams. 2003. Defining clinically meaningful change in health-related quality of life. Journal of Clinical Epidemiology 56(5):395–407.
Dawes, C., D. Cesarini, J. H. Fowler, M. Johannesson, P. K. Magnusson, and S. Oskarsson. 2014. The relationship between genes, psychological traits, and political participation. American Journal of Political Science 58(4):888–903.
Dawes, P., and K. J. Munro. 2017. Auditory distraction and acclimatization to hearing aids. Ear and Hearing 38(2):174–183.
Dawes, P., S. Powell, and K. J. Munro. 2011. The placebo effect and the influence of participant expectation on hearing aid trials. Ear and Hearing 32(6):767–774.
Dawes, P., R. Hopkins, and K. J. Munro. 2013. Placebo effects in hearing-aid trials are reliable. International Journal of Audiology 52(7):472–477.
De Sousa, K. C., C. Smits, D. R. Moore, H. C. Myburgh, and W. Swanepoel. 2020a. Pure-tone audiometry without bone-conduction thresholds: Using the Digits-in-Noise test to detect conductive hearing loss. International Journal of Audiology 59(10):801–808.
De Sousa, K. C., W. Swanepoel, D. R. Moore, H. C. Myburgh, and C. Smits. 2020b. Improving sensitivity of the Digits-in-Noise test using antiphasic stimuli. Ear and Hearing 41(2):442–450.
De Sousa, K. C., V. Manchaiah, D. R. Moore, M. A. Graham, and D. W. Swanepoel. 2023. Effectiveness of an over-the-counter self-fitting hearing aid compared with an audiologist-fitted hearing aid: A randomized clinical trial. JAMA Otolaryngology - Head and Neck Surgery 149(6):522–530.
de Vet, H. C., C. B. Terwee, R. W. Ostelo, H. Beckerman, D. L. Knol, and L. M. Bouter. 2006. Minimal changes in health status questionnaires: Distinction between minimally detectable change and minimally important change. Health and Quality of Life Outcomes 4:54.
de Vet, H. C. W., C. B. Terwee, L. B. Mokkink, and D. L. Knol. 2011. Measurement in medicine: A practical guide, Practical guides to biostatistics and epidemiology. Cambridge, UK: Cambridge University Press.
Dekker, J., M. de Boer, and R. Ostelo. 2024. Minimal important change and difference in health outcome: An overview of approaches, concepts, and methods. Osteoarthritis and Cartilage 32(1):8–17.
Demeester, K., V. Topsakal, J.-J. Hendrickx, E. Fransen, L. Van Laer, G. Van Camp, P. Van de Heyning, and A. Van Wieringen. 2012. Hearing disability measured by the Speech, Spatial, and Qualities of Hearing Scale in clinically normal-hearing and hearing-impaired middle-aged persons, and disability screening by means of a reduced SSQ (the SSQ5). Ear and Hearing 33(5):615–616.
Demorest, M. E., and S. A. Erdman. 1988. Retest stability of the communication profile for the hearing impaired. Ear and Hearing 9(5): 237–242.
Demorest, M. E., and B. E. Walden. 1984. Psychometric principles in the selection, interpretation, and evaluation of communication self-assessment inventories. Journal of Speech and Hearing Disorders 49(3):226–240.
Dillard, L. K., L. J. Matthews, and J. R. Dubno. 2024a. Change on the Revised Hearing Handicap Inventory and associated factors: Results from a longitudinal cohort study. International Journal of Audiology:1–11.
Dillard, L. K., L. J. Matthews, and J. R. Dubno. 2024b. The Revised Hearing Handicap Inventory and pure-tone average predict hearing aid use equally well. American Journal of Audiology 33(1):199–208.
Dillon, H., A. James, and J. Ginis. 1997. Client Oriented Scale of Improvement (COSI) and its relationship to several other measures of benefit and satisfaction provided by hearing aids. Journal of the American Academy of Audiology 8(1):27–43.
Dornhoffer, J. R., T. A. Meyer, J. R. Dubno, and T. R. McRackan. 2020. Assessment of hearing aid benefit using patient-reported outcomes and audiologic measures. Audiology and Neurotology 25(4):215–223.
Erdman, S. A. 2014. The biopsychosocial approach in patient- and relationship-centered care: Implications for audiologic counseling. In Adult audiologic rehabilitation, edited by J. J. Montano and J. B. Spitzer, 2nd ed. San Diego, CA: Plural Publishing Inc. Pp. 159–206.
Fitzgerald, M. B., S. P. Gianakas, Z. J. Qian, S. Losorelli, and A. C. Swanson. 2023. Preliminary guidelines for replacing word-recognition in quiet with speech in noise assessment in the routine audiologic test battery. Ear and Hearing 44(6):1548–1561.
Fitzgerald, M. B., K. M. Ward, S. P. Gianakas, M. L. Smith, N. H. Blevins, and A. P. Swanson. 2024. Speech-in-noise assessment in the routine audiologic test battery: Relationship to perceived auditory disability. Ear and Hearing 45(4):816–826.
Folmer, R. L., J. Vachhani, G. P. McMillan, C. Watson, G. R. Kidd, and M. P. Feeney. 2017. Validation of a computer-administered version of the Digits-in-Noise test for hearing screening in the United States. Journal of the American Academy of Audiology 28(2): 161–169.
Folmer, R. L., G. H. Saunders, J. J. Vachhani, R. H. Margolis, G. Saly, B. Yueh, R. A. McArdle, L. L. Feth, C. M. Roup, and M. P. Feeney. 2021. Hearing health care utilization following automated hearing screening. Journal of the American Academy of Audiology 32(4):235–245.
Fox, R. S., J. J. Manly, J. Slotkin, J. D. Peipert, and R. C. Gershon. 2021. Reliability and validity of the Spanish-language version of the NIH toolbox. Assessment 28(2):457–471.
Gatehouse, S. 2000. The impact of measurement goals on the design specification for outcome measures. Ear and Hearing 21(4 Suppl):100S–105S.
Gatehouse, S., and W. Noble. 2004. The Speech, Spatial and Qualities of Hearing Scale (SSQ). International Journal of Audiology 43(2):85–99.
Giolas, T. G., E. Owens, S. H. Lamb, and E. D. Schubert. 1979. Hearing performance inventory. Journal of Speech and Hearing Disorders 44(2):169–195.
Haley, S. M., and M. A. Fragala-Pinkham. 2006. Interpreting change scores of tests and measures used in physical therapy. Physical Therapy 86(5):735–743.
Hall, D. A., S. Zaragoza Domingo, L. Z. Hamdache, V. Manchaiah, S. Thammaiah, C. Evans, L. L. Wong, International Collegium of Rehabilitative Audiology and TINnitus Research NETwork. 2018. A good practice guide for translating and adapting hearing-related questionnaires for different languages and cultures. International Journal of Audiology 57(3):161–175.
Haskell, G. B., D. Noffsinger, V. D. Larson, D. W. Williams, R. A. Dobie, and J. L. Rogers. 2002. Subjective measures of hearing aid benefit in the NIDCD/VA clinical trial. Ear and Hearing 23(4):301–307.
Hays, R., and D. Hadorn. 1992. Responsiveness to change: An aspect of validity, not a separate dimension. Quality of Life Research 1:73–75.
Heffernan, E., B. E. Weinstein, and M. A. Ferguson. 2020. Application of Rasch analysis to the evaluation of the measurement properties of the Hearing Handicap Inventory for the Elderly. Ear and Hearing 41(5):1125–1134.
High, W. S., G. Fairbanks, and A. Glorig. 1964. Scale for self-assessment of hearing handicap. Journal of Speech and Hearing Disorders 29(3):215–230.
Holder, J. T., L. M. Levin, and R. H. Gifford. 2018. Speech recognition in noise for adults with normal hearing: Age-normative performance for AzBio, BKB-SIN, and QuickSIN. Otology & Neurotology 39(10):e972–e978.
Hoth, S. 2016. [The Freiburg speech intelligibility test: A pillar of speech audiometry in German-speaking countries]. HNO 64(8):540–548.
Humes, L. E. 1999. Dimensions of hearing aid outcome. Journal of the American Academy of Audiology 10(01):26–39.
Humes, L. E. 2003. Modeling and predicting hearing aid outcome. Trends in Amplification 7(2):41–75.
Humes, L. E. 2021. An approach to self-assessed auditory wellness in older adults. Ear and Hearing 42(4):745–761.
Humes, L. E., and V. Krull. 2012. Hearing aids for adults. Evidence-Based Practice in Audiology 61–92.
Humes, L. E., D. Halling, and M. Coughlin. 1996. Reliability and stability of various hearing-aid outcome measures in a group of elderly hearing-aid wearers. Journal of Speech and Hearing Research 39(5):923–935.
Humes, L. E., C. B. Garner, D. L. Wilson, and N. N. Barlow. 2001. Hearing-aid outcome measures following one month of hearing aid use by the elderly. Journal of Speech, Language, and Hearing Research 44(3):469–486.
Humes, L. E., D. L. Wilson, N. N. Barlow, and C. Garner. 2002. Changes in hearing-aid benefit following 1 or 2 years of hearing-aid use by older adults. Journal of Speech, Language, and Hearing Research 45(4):772–782.
Humes, L. E., D. L. Wilson, and A. C. Humes. 2003. Examination of differences between successful and unsuccessful elderly hearing aid candidates matched for age, hearing loss and gender. International Journal of Audiology 42(7):432–441.
Humes, L. E., G. R. Kidd, and J. J. Lentz. 2013. Auditory and cognitive factors underlying individual differences in aided speech-understanding among older adults. Frontiers in Systems Neuroscience 7:55.
Humes, L. E., S. E. Rogers, T. M. Quigley, A. K. Main, D. L. Kinney, and C. Herring. 2017. The effects of service-delivery model and purchase price on hearing-aid outcomes in older adults: A randomized double-blind placebo-controlled clinical trial. American Journal of Audiology 26(1):53–79.
Interacoustics. 2022. Quick Speech in Noise (QuickSIN). https://www.interacoustics.com/audiometers/ac40/support/quick-speech-in-noise-quicksin (accessed September 10, 2024).
Jaeschke, R., J. Singer, and G. H. Guyatt. 1989. Measurement of health status. Ascertaining the minimal clinically important difference. Control Clin Trials 10(4):407–415.
Jansen, S., H. Luts, K. C. Wagener, B. Kollmeier, M. Del Rio, R. Dauman, C. James, B. Fraysse, E. Vormès, B. Frachet, J. Wouters, and A. van Wieringen. 2012. Comparison of three types of French speech-in-noise tests: A multi-center study. International Journal of Audiology 51(3):164–173.
Jansen, S., H. Luts, P. Dejonckere, A. van Wieringen, and J. Wouters. 2013. Efficient hearing screening in noise-exposed listeners using the digit triplet test. Ear and Hearing 34(6):773–778.
Jerger, J., R. Chmiel, E. Florin, F. Pirozzolo, and N. Wilson. 1996. Comparison of conventional amplification and an assistive listening device in elderly persons. Ear and Hearing 17(6):490–504.
Johnson, J. A., R. M. Cox, and G. C. Alexander. 2010. Development of APHAB norms for WDRC hearing aids and comparisons with original norms. Ear and Hearing 31(1):47–55.
Johnson, J. A., J. Xu, and R. M. Cox. 2016. Impact of hearing aid technology on outcomes in daily life II: Speech understanding and listening effort. Ear Hear 37(5):529–540.
Johnson, J. A., J. Xu, and R. M. Cox. 2017. Impact of hearing aid technology on outcomes in daily life III: Localization. Ear Hear 38(6):746–759.
Kam, A. C., M. C. Tong, and A. van Hasselt. 2011. Cross-cultural adaptation and validation of the Chinese Abbreviated Profile of Hearing Aid Benefit. International Journal of Audiology 50(5):334–339.
Kamper, S. J., C. G. Maher, and G. Mackay. 2009. Global rating of change scales: A review of strengths and weaknesses and considerations for design. Journal of Manual & Manipulative Therapy 17(3):163–170.
Kiessling, J., M. Meis, and H. Meister. 2011. German translations of questionnaires SADL, ECHO and SSQ and their evaluation. Audiological Acoustics 50(1):6–16.
Kılıç, N., G. İ. Ş. Kamışlı, B. Gündüz, İ. Bayramoğlu, and Y. K. Kemaloğlu. 2021. Turkish validity and reliability study of the Speech, Spatial and Qualities of Hearing scale. Turkish Archives of Otorhinolaryngology 59(3):172.
Killion, M., and P. A. Niquette. 2000. What can the pure-tone audiogram tell us about a patient’s SNR loss? Hearing Journal 53(3):46–48, 50, 52–53.
Killion, M. C., and E. Villchur. 1993. Kessler was right—partly: But SIN test shows some aids improve hearing in noise. Hearing Journal 46(9):31–35.
Killion, M. C., W. O. Olsen, C. L. Clifford, D. D. VanVliet, D. E. Rose, D. E. Bensen, M. W. Marion, P. A. Tillman, D. B. Hawkins, S. M. Dalzell, and D. A. Fabry. 1996. “Preliminary data on the SIN Test,” presented at the annual convention of the American Academy of Audiology, Salt Lake City, UT.
Killion, M. C., P. A. Niquette, G. I. Gudmundsen, L. J. Revit, and S. Banerjee. 2004. Development of a Quick Speech-in-Noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners. Journal of the Acoustical Society of America 116(4 Pt 1):2395–2405.
Kochkin, S. 1997. Subjective measures of satisfaction and benefit: Establishing norms. Seminars in Hearing 18(1):37–48.
Koole, A., A. P. Nagtegaal, N. C. Homans, A. Hofman, R. J. Baatenburg de Jong, and A. Goedegebure. 2016. Using the Digits-in-Noise test to estimate age-related hearing loss. Ear and Hearing 37(5):508–513.
Kraus, E. M., J. A. Shohet, and P. J. Catalano. 2011. Envoy esteem totally implantable hearing system: Phase 2 trial, 1-year hearing results. Otolaryngology—Head and Neck Surgery 145(1):100–109.
Kwak, C., J. H. Seo, Y. Oh, and W. Han. 2022. Efficacy of the Digit-in-Noise test: A systematic review and meta-analysis. Journal of Audiology & Otology 26(1):10–21.
Lamb, S. H., E. Owens, and E. D. Schubert. 1983. The revised form of the hearing performance inventory. Ear and Hearing 4(3):152–157.
Larson, V. D., D. W. Williams, W. G. Henderson, L. E. Luethke, L. B. Beck, D. Noffsinger, R. H. Wilson, R. A. Dobie, G. B. Haskell, G. W. Bratt, J. E. Shanks, P. Stelmachowicz, G. A. Studebaker, A. E. Boysen, A. Donahue, R. Canalis, S. A. Fausti, B. Z. Rappaport, and f. t. P. o. t. N. V. H. A. C. T. Group. 2000. Efficacy of 3 commonly used hearing aid circuitsa crossover trial. JAMA 284(14):1806–1813.
Lichtenstein, M. J., F. H. Bess, and S. A. Logan. 1988. Validation of screening tools for identifying hearing-impaired elderly in primary care. JAMA 259(19):2875–2878.
Lin, F. R., J. R. Pike, M. S. Albert, M. Arnold, S. Burgard, T. Chisolm, D. Couper, J. A. Deal, A. M. Goman, N. W. Glynn, T. Gmelin, L. Gravens-Mueller, K. M. Hayden, A. R. Huang, D. Knopman, C. M. Mitchell, T. Mosley, J. S. Pankow, N. S. Reed, V. Sanchez, J. A. Schrack, B. G. Windham, and J. Coresh. 2023. Hearing intervention versus health education control to reduce cognitive decline in older adults with hearing loss in the USA (ACHIEVE): A multicentre, randomised controlled trial. Lancet 402(10404):786–797.
Löhler, J., F. Gräbner, B. Wollenberg, P. Schlattmann, and R. Schönweiler. 2017. Sensitivity and specificity of the Abbreviated Profile of Hearing Aid Benefit (APHAB). European Archives of Oto-Rhino-Laryngology 274(10):3593–3598.
Lotfi, Y., A. R. Nazeri, A. Asgari, A. Moosavi, and E. Bakhshi. 2016. Iranian version of Speech, Spatial, and Qualities of Hearing scale: A psychometric study. Acta Medica Iranica 756–764.
Lydick, E., and R. S. Epstein. 1993. Interpretation of quality of life changes. Quality of Life Research 2(3):221–226.
Lyzenga, J., and C. Smits. 2011. Effects of coarticulation, prosody, and noise freshness on the intelligibility of digit triplets in noise. Journal of the American Academy of Audiology 22(4):215–221.
Malinoff, R. L., and B. E. Weinstein. 1989. Measurement of hearing aid benefit in the elderly. Ear and Hearing 10(6):354–356.
McArdle, R. A., and R. H. Wilson. 2006. Homogeneity of the 18 QuickSIN lists. Journal of the American Academy of Audiology 17(3):157–167.
McArdle, R. A., R. H. Wilson, and C. A. Burks. 2005. Speech recognition in multitalker babble using digits, words, and sentences. Journal of the American Academy of Audiology 16(9):726–739.
McHorney, C. A., and A. Tarlov. 1995. Individual-patient monitoring in clinical practice: Are available health status surveys adequate? Quality of Life Research 4:293–307.
McLean, W. J., A. S. Hinton, J. T. J. Herby, A. N. Salt, J. J. Hartsock, S. Wilson, D. L. Lucchino, T. Lenarz, A. Warnecke, N. Prenzler, H. Schmitt, S. King, L. E. Jackson, J. Rosenbloom, G. Atiee, M. Bear, C. L. Runge, R. H. Gifford, S. D. Rauch, D. J. Lee, R. Langer, J. M. Karp, C. Loose, and C. LeBel. 2021. Improved speech intelligibility in subjects with stable sensorineural hearing loss following intratympanic dosing of FX-322 in a phase 1b study. Otology & Neurotology 42(7):e849–e857.
McShefferty, D., W. M. Whitmer, and M. A. Akeroyd. 2016. The just-meaningful difference in speech-to-noise ratio. Trends in Hearing 20:1–11.
Mehrkian, S., Z. Bayat, M. Javanbakht, H. Emamdjomeh, and E. Bakhshi. 2019. Effect of wireless remote microphone application on speech discrimination in noise in children with cochlear implants. International Journal of Pediatric Otorhinolaryngology 125:192–195.
Melo, I. M. M., A. R. X. Silva, R. Camargo, H. G. Cavalcanti, D. V. Ferrari, K. V. M. Taveira, and S. A. Balen. 2022. Accuracy of smartphone-based hearing screening tests: A systematic review. CoDAS 34(3):e20200380.
Mendel, L. L. 2007. Objective and subjective hearing aid assessment outcomes. American Journal of Audiology 16(2):118–129.
Meng, L., D. Hao, D. Li, J. Yue, Y. Wan, and L. Shi. 2024. Establishment of self-reported hearing cut-off value on the Chinese version of short form of Speech, Spatial and Qualities of Hearing scale (SSQ12). International Journal of Audiology 1–8.
Mokkink, L. B., C. B. Terwee, D. L. Patrick, J. Alonso, P. W. Stratford, D. L. Knol, L. M. Bouter, and H. C. W. de Vet. 2010. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. Journal of Clinical Epidemiology 63(7):737–745.
Motlagh Zadeh, L., N. H. Silbert, W. Swanepoel, and D. R. Moore. 2021. Improved sensitivity of Digits-in-Noise test to high-frequency hearing loss. Ear and Hearing 42(3):565–573.
Moulin, A., and C. Richard. 2016. Sources of variability of speech, spatial, and qualities of hearing scale (SSQ) scores in normal-hearing and hearing-impaired populations. International Journal of Audiology 55(2):101–109.
Moulin, A., A. Pauzie, and C. Richard. 2015. Validation of a French translation of the Speech, Spatial, and Qualities of Hearing Scale (SSQ) and comparison with other language versions. International Journal of Audiology 54(12):889–898.
Moulin, A., J. Vergne, S. Gallego, and C. Micheyl. 2019. A new Speech, Spatial, and Qualities of Hearing Scale short-form: Factor, cluster, and comparative analyses. Ear and Hearing 40(4):938–950.
Mueller, H. G. 2016. Signia expert series: Speech-in-Noise testing for selection and fitting of hearing aids: Worth the effort? https://www.audiologyonline.com/articles/signia-expert-series-speech-in-18336 (accessed January 12, 2025).
Mulrow, C. D., C. Aguilar, J. E. Endicott, M. R. Tuley, R. Velez, W. S. Charlip, M. C. Rhodes, J. A. Hill, and L. A. DeNino. 1990a. Quality-of-life changes and hearing impairment. A randomized trial. Annals of Internal Medicine 113(3):188–194.
Mulrow, C. D., M. R. Tuley, and C. Aguilar. 1990b. Discriminating and responsiveness abilities of two hearing handicap scales. Ear and Hearing 11(3):176–180.
Mulrow, C. D., M. R. Tuley, and C. Aguilar. 1992a. Correlates of successful hearing aid use in older adults. Ear and Hearing 13(2):108–113.
Mulrow, C. D., M. R. Tuley, and C. Aguilar. 1992b. Sustained benefits of hearing aids. Journal of Speech and Hearing Research 35(6):1402–1405.
Myhrum, M., M. G. Heldahl, A. K. Rødvik, O. E. Tvete, and G. E. Jablonski. 2024. Validation of the Norwegian version of the Speech, Spatial and Qualities of Hearing scale (SSQ). Audiology and Neurotology 29(2):124–135.
Neal, K., C. M. McMahon, S. E. Hughes, and I. Boisvert. 2022. Listening-based communication ability in adults with hearing loss: A scoping review of existing measures. Front Psychol 13:786347.
Newman, C. W., and B. E. Weinstein. 1988. The Hearing Handicap Inventory for the Elderly as a measure of hearing aid benefit. Ear and Hearing 9(2):81–85.
Newman, C. W., and B. E. Weinstein. 1989. Test-retest reliability of the Hearing Handicap Inventory for the Elderly using two administration approaches. Ear and Hearing 10(3):190–191.
Newman, C. W., B. E. Weinstein, G. P. Jacobson, and G. A. Hug. 1990. The Hearing Handicap Inventory for Adults: Psychometric adequacy and audiometric correlates. Ear and Hearing 11(6):430–433.
Newman, C. W., G. P. Jacobson, G. A. Hug, B. E. Weinstein, and R. L. Malinoff. 1991. Practical method for quantifying hearing aid benefit in older adults. Journal of the American Academy of Audiology 2(2):70–75.
Noble, W., G. Naylor, N. Bhullar, and M. A. Akeroyd. 2012. Self-assessed hearing abilities in middle-and older-age adults: A stratified sampling approach. International Journal of Audiology 51(3):174–180.
Noble, W., N. S. Jensen, G. Naylor, N. Bhullar, and M. A. Akeroyd. 2013. A short form of the Speech, Spatial, and Qualities of Hearing Scale suitable for clinical use: The SSQ12. International Journal of Audiology 52(6):409–412.
Öberg, M., T. Lunner, and G. Andersson. 2007. Psychometric evaluation of hearing specific self-report measures and their associations with psychosocial and demographic variables. Audiological Medicine 5(3):188–199.
Oremule, B., J. Abbas, G. Saunders, K. Kluk, R. Isba, S. Bate, and I. Bruce. 2024. Mobile audiometry for hearing threshold assessment: A systematic review and meta-analysis. Clinical Otolaryngology 49(1):74–86.
Ou, H., and M. Wetmore. 2020. Development of a revised performance-perceptual test using Quick Speech in Noise test material and its norms. Journal of the American Academy of Audiology 31(03):176–184.
Patro, A., A. C. Moberly, M. H. Freeman, E. L. Perkins, T. A. Jan, K. O. Tawfik, M. R. O’Malley, M. L. Bennett, R. H. Gifford, D. S. Haynes, and N. I. Chowdhury. 2024. Investigating the minimal clinically important difference for AzBIO and CNC speech recognition scores. Otology & Neurotology 45(9):e639–e643.
Pearsons, K. S., R. L. Bennett, and S. Fidell. 1977. Speech levels in various noise environments (report no. EPA-600/1-77-025). https://nepis.epa.gov/Exe/ZyPURL.cgi?Dockey=P100CWGS.TXT (accessed January 12, 2025).
Perron, M., B. Lau, and C. Alain. 2023. Interindividual variability in the benefits of personal sound amplification products on speech perception in noise: A randomized cross-over clinical trial. PLoS One 18(7):e0288434.
Phatak, S. A., B. M. Sheffield, D. S. Brungart, and K. W. Grant. 2018. Development of a test battery for evaluating speech perception in complex listening environments: Effects of sensorineural hearing loss. Ear and Hearing 39(3):449–456.
Potgieter, J. M., D. W. Swanepoel, H. C. Myburgh, T. C. Hopper, and C. Smits. 2015. Development and validation of a smartphone-based Digits-in-Noise hearing test in South African English. International Journal of Audiology 55(7):405–411.
Potgieter, J. M., W. Swanepoel, H. C. Myburgh, and C. Smits. 2018a. The South African English smartphone Digits-in-Noise hearing test: Effect of age, hearing loss, and speaking competence. Ear and Hearing 39(4):656–663.
Potgieter, J. M., W. Swanepoel, and C. Smits. 2018b. Evaluating a smartphone Digits-in-Noise test as part of the audiometric test battery. South African Journal of Communication Disorders 65(1):e1–e6.
Prinsen, C. A. C., S. Vohra, M. R. Rose, M. Boers, P. Tugwell, M. Clarke, P. R. Williamson, and C. B. Terwee. 2016. How to select outcome measurement instruments for outcomes included in a “core outcome set”—A practical guideline. Trials 17(1):449.
Radulescu, L., O. Astefanei, R. Serban, S. Cozma, C. Butnaru, and C. Martu. 2024. The validation of the Speech, Spatial and Qualities of Hearing scale SSQ12 for native Romanian speakers with and without hearing impairment. Journal of Personalized Medicine 14(1):90.
Reynard, P., J. Lagacé, C. A. Joly, L. Dodelé, E. Veuillet, and H. Thai-Van. 2022. Speech-in-noise audiometry in adults: A review of the available tests for French speakers. Audiology and Neurotology 27(3):185–199.
Roup, C. M., E. Post, and J. Lewis. 2018. Mild-gain hearing aids as a treatment for adults with self-reported hearing difficulties. Journal of the American Academy of Audiology 29(6):477–494.
Sabin, A. T., D. J. Van Tasell, B. Rabinowitz, and S. Dhar. 2020. Validation of a self-fitting method for over-the-counter hearing aids. Trends in Hearing 24:2331216519900589.
Samsa, G., D. Edelman, M. L. Rothman, G. R. Williams, J. Lipscomb, and D. Matchar. 1999. Determining clinically important differences in health status measures: A general approach with illustration to the Health Utilities Index Mark II. Pharmacoeconomics 15(2):141–155.
Sanchez, V. A., M. L. Arnold, E. E. Garcia Morales, N. S. Reed, S. Faucette, S. Burgard, H. N. Calloway, J. Coresh, J. A. Deal, and A. M. Goman. 2024. Effect of hearing intervention on communicative function: A secondary analysis of the ACHIEVE randomized controlled trial. Journal of the American Geriatrics Society 72(12):3784–3799.
Sanchez-Lopez, R., T. Dau, and W. M. Whitmer. 2022. Audiometric profiles and patterns of benefit: A data-driven analysis of subjective hearing difficulties and handicaps. International Journal of Audiology 61(4):301–310.
Saunders, G. H., and K. M. Cienkowski. 1997. Acclimatization to hearing aids. Ear and Hearing 18(2):129–139.
Saxena, U., S. K. Mishra, H. Rodrigo, and M. Choudhury. 2022. Functional consequences of extended high frequency hearing impairment: Evidence from the Speech, Spatial, and Qualities of Hearing scale. Journal of the Acoustical Society of America 152(5):2946–2952.
Schafer, E. C., J. Pogue, and T. Milrany. 2012. List equivalency of the AzBIO sentence test in noise for listeners with normal-hearing sensitivity or cochlear implants. Journal of the American Academy of Audiology 23(7):501–509.
Schimmel, C., V. Manchaiah, D. Swanepoel, and A. Sharma. 2024. Digits-in-Noise test as an assessment tool for hearing loss and hearing aids. Audiology Research 14:342–358.
Singh, G., and M. K. Pichora-Fuller. 2010. Older adults performance on the Speech, Spatial, and Qualities of Hearing Scale (SSQ): Test-retest reliability and a comparison of interview and self-administration methods. International Journal of Audiology 49(10):733–740.
Śliwińska-Kowalska, M. 2020. [Preventive hearing tests in workers exposed to noise and organic solvents]. Medycyna Pracy 71(4):493–505.
Smits, C., S. E. Kramer, and T. Houtgast. 2006. Speech reception thresholds in noise and self-reported hearing disability in a general adult population. Ear and Hearing 27(5): 538–549.
Smits, C., S. Theo Goverts, and J. M. Festen. 2013. The Digits-in-Noise test: Assessing auditory speech recognition abilities in noise. Journal of the Acoustical Society of America 133(3):1693–1706.
Spahr, A. J., M. F. Dorman, L. M. Litvak, S. Van Wie, R. H. Gifford, P. C. Loizou, L. M. Loiselle, T. Oakes, and S. Cook. 2012. Development and validation of the AzBIO sentence lists. Ear and Hearing 33(1):112–117.
Srinivasan, N., and S. O’Neill. 2023. Comparison of Speech, Spatial, and Qualities of Hearing scale (SSQ) and the Abbreviated Profile of Hearing Aid Benefit (APHAB) questionnaires in a large cohort of self-reported normal-hearing adult listeners. Audiology Research 13(1):143–150.
Stark, P., and L. Hickson. 2004. Outcomes of hearing aid fitting for older people with hearing impairment and their significant others. International Journal of Audiology 43(7):390–398.
Stenbäck, V., E. Marsja, R. Ellis, and J. Rönnberg. 2023. Relationships between behavioural and self-report measures in speech recognition in noise. International Journal of Audiology 62(2):101–109.
Streiner, D., G. Norman, and J. Cairney. 2015. Health measurement scales: A practical guide to their development and use. Oxford, UK: Oxford University Press.
Surr, R. K., M. T. Cord, and B. E. Walden. 1998. Long-term versus short-term hearing aid benefit. Journal of the American Academy of Audiology 9(3).
Taylor, K. S. 1993. Self-perceived and audiometric evaluations of hearing aid benefit in the elderly. Ear and Hearing 14(6):390–394.
Thorén, E. S., G. Andersson, and T. Lunner. 2012. The use of research questionnaires with hearing impaired adults: Online vs. paper-and-pencil administration. BMC Ear, Nose and Throat Disorders 12:1–6.
Tomioka, K., H. Ikeda, K. Hanaie, M. Morikawa, J. Iwamoto, N. Okamoto, K. Saeki, and N. Kurumatani. 2013. The Hearing Handicap Inventory for Elderly-Screening (HHIE-S) versus a single question: Reliability, validity, and relations with quality of life measures in the elderly community, Japan. Quality of Life Research 22(5):1151–1159.
Toolbox Assessments, Inc. 2024. Words-in-Noise test. https://nihtoolbox.org/test/words-in-noise-test (accessed January 12, 2025).
Toussaint, A., P. Hüsing, A. Gumz, K. Wingenfeld, M. Härter, E. Schramm, and B. Löwe. 2020. Sensitivity to change and minimal clinically important difference of the 7-item Generalized Anxiety Disorder Questionnaire (GAD-7). Journal of Affective Disorders 265:395–401.
Utoomprurkporn, N., J. Stott, S. G. Costafreda, and D. E. Bamiou. 2021. Lack of association between audiogram and hearing disability measures in mild cognitive impairment and dementia: What audiogram does not tell you. Healthcare (Basel) 9(6).
Van den Borre, E., S. Denys, A. van Wieringen, and J. Wouters. 2021. The digit triplet test: A scoping review. International Journal of Audiology 60(12):946–963.
Ventry, I. M., and B. E. Weinstein. 1982. The Hearing Handicap Inventory for the Elderly: A new tool. Ear and Hearing 3(3):128–134.
Ventry, I. M., and B. E. Weinstein. 1983. Identification of elderly people with hearing problems. ASHA 25(7):37–42.
Vermiglio, A. J., L. Leclerc, M. Thornton, H. Osborne, E. Bonilla, and X. Fang. 2021. Diagnostic accuracy of the AzBIO speech recognition in noise test. Journal of Speech, Language, and Hearing Research 64(8):3303–3316.
Walden, B. E. 1997. Toward a model clinical-trials protocol for substantiating hearing aid user-benefit claims. American Journal of Audiology 6(2):13–24.
Walden, T. C., and B. E. Walden. 2004. Predicting success with hearing aids in everyday living. Journal of the American Academy of Audiology 15(05):342–352.
Wang, S., and L. L. N. Wong. 2024. Development of the Mandarin Digit-in-Noise test and examination of the effect of the number of digits used in the test. Ear and Hearing 45(3):572–582.
Watson, C. S., G. R. Kidd, J. D. Miller, C. Smits, and L. E. Humes. 2012. Telephone screening tests for functionally impaired hearing: Current use in seven countries and development of a US version. Journal of the American Academy of Audiology 23(10):757–767.
Weinstein, B. E., and I. M. Ventry. 1983. Audiometric correlates of the Hearing Handicap Inventory for the Elderly. Journal of Speech and Hearing Disorders 48(4):379–384.
Weinstein, B. E., J. B. Spitzer, and I. M. Ventry. 1986. Test-retest reliability of the Hearing Handicap Inventory for the Elderly. Ear and Hearing 7(5):295–299.
Wilson, R. H. 2003. Development of a speech-in-multitalker-babble paradigm to assess word-recognition performance. Journal of the American Academy of Audiology 14(09):453–470.
Wilson, R. H. 2006. Speech recognition and identification materials, disc 4.0. https://chs.asu.edu/sites/default/files/2022-05/booklet-speech_recid_disc_4.0_0.pdf (accessed January 14, 2025).
Wilson, R. H. 2011. Clinical experience with the Words-in-Noise test on 3430 veterans: Comparisons with pure-tone thresholds and word recognition in quiet. Journal of the American Academy of Audiology 22(07):405–423.
Wilson, R. H., and A. Strouse. 2002. Northwestern University auditory test no. 6 in multi-talker babble: A preliminary report. Journal of Rehabilitation Research and Development 39(1).
Wilson, R. H., and D. G. Weakley. 2004. The use of digit triplets to evaluate word-recognition abilities in multitalker babble. Seminars in Hearing 25(1):93–111.
Wilson, R. H., and C. A. Burks. 2005. Use of 35 words for evaluation of hearing loss in signal-to-babble ratio: A clinic protocol. Journal of Rehabilitation Research and Development 42(6):839–852.
Wilson, R. H., and R. McArdle. 2007. Intra- and inter-session test, retest reliability of the Words-in-Noise (WIN) test. Journal of the American Academy of Audiology 18(10):813–825.
Wilson, R. H., and W. B. Cates. 2008. A comparison of two word-recognition tasks in multitalker babble: Speech Recognition in Noise Test (SPRINT) and Words-in-Noise test (WIN). Journal of the American Academy of Audiology 19(7):548–556.
Wilson, R. H., and K. L. Watts. 2012. The Words-in-Noise test (WIN), list 3: A practice list. Journal of the American Academy of Audiology 23(02):092–096.
Wilson, R. H., H. B. Abrams, and A. L. Pillion. 2003. A word-recognition task in multitalker babble using a descending presentation mode from 24 db to 0 db signal to babble. Journal of Rehabilitation Research & Development 40(4).
Wilson, R. H., C. A. Burks, and D. G. Weakley. 2005. Word recognition in multitalker babble measured with two psychophysical methods. Journal of the American Academy of Audiology 16(08):622–630.
Wilson, R. H., C. A. Burks, and D. G. Weakley. 2006. Word recognition of digit triplets and monosyllabic words in multitalker babble by listeners with sensorineural hearing loss. Journal of the American Academy of Audiology 17(06):385–397.
Wilson, R. H., C. S. Carnell, and A. L. Cleghorn. 2007a. The Words-in-Noise (WIN) test with multitalker babble and speech-spectrum noise maskers. Journal of the American Academy of Audiology 18(06):522–529.
Wilson, R. H., R. A. McArdle, and S. L. Smith. 2007b. An evaluation of the BKB-SIN, HINT, QuickSIN, AND WIN materials on listeners with normal hearing and listeners with hearing loss. Journal of Speech, Language, and Hearing Research 50(4):844–856.
Wilson, R. H., C. P. Trivette, D. A. Williams, and K. L. Watts. 2012. The effects of energetic and informational masking on the Words-in-Noise test (WIN). Journal of the American Academy of Audiology 23(07):522–533.
Wolinsky, F. D., G. Wan, and W. Tierney. 1998. Changes in the SF-36 in 12 months in a sample of disadvantaged older adults. Medical Care 36:1589–1598.
Wright, D., and J.-P. Gagné. 2021. Acclimatization to hearing aids by older adults. Ear and Hearing 42(1):193–205.
Wyrwich, K., W. M. Tierney, and F. D. Wolinsky. 1999. Further evidence supporting an SEM-based criterion for identifying meaningful intra-individual changes in health-related quality of life. Journal of Clinical Epidemiology 52:861–873.
Wyss, J., D. J. Mecklenburg, and P. L. Graham. 2020. Self-assessment of daily hearing function for implant recipients: A comparison of mean total scores for the Speech Spatial Qualities of Hearing Scale (SSQ49) with the SSQ12. Cochlear Implants International 21(3):167–178.
Zaar, J., P. Ihly, T. Nishiyama, S. Laugesen, S. Santurette, C. Tanaka, G. Jones, M. Vatti, D. Suzuki, and T. Kitama. 2023. Predicting speech-in-noise reception in hearing-impaired listeners with hearing aids using the Audible Contrast Threshold (ACT) test. [Preprint]. PsyArXiv Preprints. https://doi.org/10.31234/osf.io/m9khu.
Zaar, J., L. B. Simonsen, R. Sanchez-Lopez, and S. Laugesen. 2024. The Audible Contrast Threshold (ACT) test: A clinical spectro-temporal modulation detection test. Hearing Research 453:109103.
Zhang, Y., X. Xi, and Y. Huang. 2023. The anchor design of anchor-based method to determine the minimal clinically important difference: A systematic review. Health and Quality of Life Outcomes 21:74.
This page intentionally left blank.