Clinician-rated instruments should demonstrate three types of reliability: 1) internal reliability, 2) retest reliability, and 3) interrater reliability. Cronbach’s alpha statistic (78) is used to evaluate internal reliability, and estimates ≥0.70 reflect adequate reliability (79, 80). The internal reliability of individual items is calculated by using corrected item-to-total correlation with Pearson’s r; items should have a correlation greater than 0.20 (79, 80). Retest reliability assesses the extent to which multiple administrations of the scale generate the same results. When scores on an instrument are expected to change in response to effective treatment, it is necessary to demonstrate that these scores remain the same in the absence of treatment. Interrater reliability assesses the extent to which multiple raters generate the same result. Although Pearson’s r is often used to compute these estimates, the preferred method is the intraclass r (81), which allows for adjustment for agreement by chance. Estimates of retest and interrater reliability should be at a minimum of 0.70 (Pearson’s r) and 0.60 (intraclass r) (82). For retest reliability of scale items, Pearson’s r >0.70 is considered acceptable (83).
t2 summarizes the results from studies examining internal reliability of the total Hamilton depression scale. Estimates ranged from 0.46 to 0.97, and 10 studies reported estimates ≥0.70. t3 summarizes the studies that examined internal reliability at the item level. The majority of Hamilton depression scale items show adequate reliability. Six items met the reliability criteria in every sample (guilt, middle insomnia, psychic anxiety, somatic anxiety, gastrointestinal, general somatic), and an additional five items met the criteria in all but one sample (depressed mood, suicide, early insomnia, late insomnia, work and interests, hypochondriasis). Loss of insight was the item with the most variable findings, suggesting a potential problem with this item.
Total Hamilton depression scale interrater reliabilities are displayed in t2. Pearson’s r ranged from 0.82 to 0.98, and the intraclass r ranged from 0.46 to 0.99. Some investigators provided evidence that the skill level or expertise of the interviewer and the provision of structured queries and scoring guidelines affect reliability (19, 23, 35, 54). Across studies, the best estimate mean of interrater reliability for studies reporting higher levels of interviewer skill and use of expert raters, structured queries, and scoring guidelines did not statistically differ from that for other studies (z=0.81, n.s.).
At the individual item level, interrater reliability is poor for many items. Cicchetti and Prusoff (19) assessed reliability before treatment initiation and 16 weeks later at trial end. Only early insomnia was adequately reliable before treatment, and only depressed mood was adequately reliable after treatment. Thirteen items had coefficients <0.50 before treatment, and 11 items had coefficients <0.50 after treatment. Rehm and O’Hara (61) performed a similar analysis with data from two samples. Six items showed adequate reliability in the first sample (early insomnia, middle insomnia, late insomnia, somatic anxiety, gastrointestinal, loss of libido), as did 10 in the second sample (depressed mood, guilt, suicide, early insomnia, middle insomnia, late insomnia, work/interests, psychic anxiety, somatic anxiety, gastrointestinal). Loss of insight showed the lowest interrater agreement in both samples. Craig et al. (20) found that only one item, work/interests, had adequate interrater reliability. Moberg et al. (50) reported that nine items demonstrated adequate reliability when the standard Hamilton depression scale was administered (depressed mood, guilt, suicide, early insomnia, late insomnia, agitation, psychic anxiety, hypochondriasis, loss of insight), but all items showed adequate reliability when the scale was administered with interview guidelines. Potts et al. (59) demonstrated that a single omnibus coefficient can mask specific problems. Using a structured interview version of the Hamilton depression scale, they found an overall intraclass coefficient of 0.92; however, two trained psychiatrists differed at least 20% of the time in their ratings of psychic anxiety, psychomotor agitation, and psychomotor retardation, and they differed by at least two points 15% of the time in their ratings of loss of libido. The ratings of trained raters disagreed with the psychiatrists’ ratings on psychomotor agitation (50% of the time), hypochondriasis (60%), loss of libido (90%), and loss of energy (100%).
Retest reliability for the Hamilton depression scale ranged from 0.81 to 0.98 (t2). Retest reliability at the item level (t3) ranged from 0.00 to 0.85. Williams (76) argued in favor of using structured interview guides to boost item and total scale reliability and developed the Structured Interview Guide for the Hamilton Depression Rating Scale. This effort increased the mean retest reliability across individual items to 0.54, although only four items met the criteria for adequate reliability (depressed mood, early insomnia, psychic anxiety, and loss of libido).
Standard psychometric practice dictates that items within an instrument should measure a single symptom and contain response options linked to increasing or decreasing amounts of that symptom. Each item is assumed to contribute equally to the total score or be backed with evidence in support of differential weighting. These criteria are not consistently met by using the current scaling procedure or the options for rating symptoms. Although improperly scaled items can cause problems in quantitative measurement, evaluation of item scaling takes place first at a qualitative level. Some Hamilton depression scale items measure single symptoms along a meaningful continuum of severity; many do not. The item assessing depressed mood includes a combination of affective, behavioral, and cognitive features, such as gloomy attitude, pessimism about the future, subjective feeling of sadness, and tendency to weep. The general somatic symptoms item, which is also symptomatically heterogeneous, includes feelings of heaviness, diffuse backache, and loss of energy. Headache is coded only as part of somatic anxiety along with such symptoms as indigestion, palpitations, and respiratory difficulties. Genital symptoms for women entail loss of libido and menstrual disturbances. The problems inherent in the heterogeneity of these rating descriptors reduce the potential meaningfulness of these items, a problem exacerbated if the different components of an item actually measure multiple constructs and thus measure different effects.
Most items on the Hamilton depression scale at least are scaled so that increasing scores represent increasing severity. It is less clear whether the anchors used for different scores on certain items actually assess the same underlying construct/syndrome. This ambiguity is most obvious for severity ratings involving psychotic features. The feelings of guilt item, for example, is graded as follows: 0=absent, 1=self-reproach, 2=ideas of guilt or rumination over past errors or sinful deeds, 3=present illness is a punishment, and 4=hears accusatory or denunciatory voices and/or experiences threatening visual hallucinations. A patient with guilt-themed hallucinations may be more severely ill than a patient who has nonpsychotic guilty feelings, but is he/she feeling more guilt? The psychotic features may instead represent a qualitatively different construct/syndrome associated with more severe illness. Similarly, the hypochondriasis item progresses through bodily self-absorption (rated 1) and preoccupation with health (rated 2) before switching to querulous attitude (rated 3) and then again to hypochondriacal delusions (rated 4). These item-scoring anchors violate basic measurement principles, because nominal scaling and ordinal scaling are combined in a single item.
Although Hamilton (1) explained the rationale for the inclusion of both 3-point and 5-point items, the argument was not made on the grounds of differential weighting. Hamilton believed that certain items would be difficult to anchor dimensionally and therefore assigned them fewer response options. The end result is that certain items contribute more to the total score than others. Contrasting psychomotor retardation and psychomotor agitation, for example, reveals that a severe manifestation of the former contributes 4 points, whereas an equally severe manifestation of the latter contributes 2 points. Similarly, someone who weeps all the time can contribute 3 or 4 points on depressed mood, whereas someone who feels tired all the time can contribute only 2 points on the general somatic symptoms item.
A psychiatric rating scale should measure a single psychopathological construct (i.e., an illness or syndrome) and be composed of items that adequately cover a range of symptoms that are consistently associated with the syndrome. Item response theory, a method used increasingly in the evaluation and construction of psychometric instruments, permits empirical evaluation of these premises. It is important to note that this method was not available when the original Hamilton depression scale was developed, although some researchers more recently used this method to evaluate this instrument. According to item response theory, a scale and its constituent items may have good reliability estimates but still fail to meet item response theory criteria. For example, if a depression scale were composed only of items measuring mild depression, the instrument would have great difficulty distinguishing between moderate and severe cases of depression, as both would be characterized by high scores on all items. This issue is particularly pressing in studies of clinical change; not only is a wide range of severity often represented in this research, but individual patients are expected to move along this continuum as they improve. Continued use of items insensitive to change underestimates the strength of actual treatment effects and makes it necessary to have larger samples to demonstrate that an effect is statistically significant. Falsely identifying patients as not having changed represents an additional source of "noise" and weakens the "signal" of a true treatment effect. A pragmatic implication of such lack of sensitivity is that new compounds shown to be promising in the laboratory may appear spuriously ineffective in clinical trials.
A related issue concerns the extent to which a severity score actually measures a single unidimensional syndrome. To summarize a syndrome with a single score requires a precise understanding of what that score represents. The implicit assumption is that the severity score represents a single dimension (84); if depression is heterogeneous, interpretation of a single summed score is unclear. If, for example, items assessing psychological and physical symptoms were only loosely related, a single score would not distinguish between two potentially different groups of depressed patients—one group whose symptoms were primarily psychological and another group with primarily vegetative symptoms. Any effects of an intervention targeting only one of these aspects would be harder to detect.
Gibbons et al. (85) presented a strategy for identifying a unidimensional set of items from a psychiatric rating scale and evaluating the extent to which these items adequately measure the full range of depression severity. Subsequently, a subset of Hamilton depression scale items that would measure a single dimension of depression across a wide range of severity was developed (30). This subset included depressed mood, which was sensitive at low levels; work/interests, psychic anxiety, and loss of libido, which were sensitive at mild levels; somatic anxiety, psychomotor agitation, and guilt, which were sensitive at moderate levels; and suicide, which was sensitive at severe levels. These items were proposed as a psychometrically stronger form of the full Hamilton depression scale.
Santor and Coyne (64, 65) used item response theory to examine the functioning of the full Hamilton depression scale and its individual items. In one of these studies (65) they examined individual Hamilton depression scale item performance in a combined sample of primary care patients and depressed patients from the National Institute of Mental Health Treatment of Depression Collaborative Research Program. One expects different item ratings at different levels of depression severity, with zeroes more common at mild levels of overall depression and higher item scores more common with more severe overall depression. Moreover, whereas most items on the Hamilton depression scale are, overall, sensitive to depression severity, 12 items had at least one problematic response option (the five items that had no such problems were depressed mood, guilt, suicide, work/interests, and psychic anxiety) (64). For example, the likelihood of receiving a rating of 1 on the insomnia items was essentially the same regardless of the overall severity of depression, but the likelihood of receiving a rating of 4 on somatic anxiety was very low even when overall depression was severe. These findings confirm that the rating scheme is not ideal for many items on the Hamilton depression scale, with the unfortunate effect of decreasing the capacity of the Hamilton depression scale to detect change (6, 7).
Additional efforts to analyze the performance of individual Hamilton depression scale items and to identify an underlying single dimension of depression severity have benefited from a technique known as Rasch analysis, a method similar to item response theory. Rasch analysis proposes an ideal underlying dimension based on mathematical and theoretical reasoning about the construct that is being measured and then assesses the extent to which actual data correspond to this ideal. This approach was first applied to the Hamilton depression scale by Bech et al. (86), who confirmed that six items previously shown to have properties associated with unidimensionality (87) could be combined to create a shorter scale that met the formal Rasch criteria. This six-item scale was thus proposed as a better measure than the full Hamilton depression scale for assessing depression severity along a single dimension; the six-item scale is composed of items for depressed mood, guilt, work/interests, psychomotor retardation, anxiety psychic, and general somatic symptoms (87). The unidimensionality of this six-item subscale has since been confirmed in two studies that used Rasch methods (13, 14). Maier and Philipp (44) used Rasch analysis to confirm unidimensionality for a subset of Hamilton depression scale items. The resulting scale was similar to that obtained by Bech et al. (86). In another study that used Rasch analysis (46), six items were found to be problematic: suicide, psychomotor agitation, anxiety somatic, general somatic symptoms, hypochondriasis, and loss of insight.
Validity of psychiatric rating scales such as the Hamilton depression scale comprises 1) content, 2) convergent, 3) discriminant, 4) factorial, and 5) predictive validity. Content validity is assessed by examining scale items to determine correspondence with known features of a syndrome. Convergent validity is adequate when a scale shows Pearson’s r values of at least 0.50 in correlations with other measures of the same syndrome. Discriminant validity is established by showing that groups differing in their diagnostic status can be separated by using the scale. Predictive validity for symptom severity measures such as the Hamilton depression scale is determined by a statistically significant (p<0.05) capacity to predict change with treatment. Factorial validity is established by using factor analysis or related techniques (e.g., principal-component analysis) to demonstrate that a meaningful structure can be found in multiple samples. An a priori criterion of 0.40 has been used to identify which items are part of which factors (88).
Because of its wide use and long clinical tradition, the Hamilton depression scale seems to both define as well as measure depression. One could criticize DSM-IV for not adequately capturing Hamilton depression scale depression as much as one could criticize the Hamilton depression scale for not providing full coverage of DSM-IV depression. Nonetheless, the operational criteria provided in DSM-IV are used as the official nosology for much of psychiatry worldwide. The criteria for major depression have been revised three times in response to developments in field trial research and clinical consensus based on expert opinion, most recently in 1994. Researchers have developed a number of longer versions of the Hamilton depression scale that include additional symptoms such as the reverse vegetative features of atypical depression. However, the core items of the Hamilton depression scale have remained unchanged for more than 40 years. It is reasonable to ask whether this instrument captures depression as it is currently conceptualized. Several symptoms contained within the Hamilton depression scale are not official DSM diagnostic criteria, although they are recognized as features associated with depression (e.g., psychic anxiety). For other symptoms included in the Hamilton depression scale (e.g., loss of insight, hypochondriasis), the link with depression is more tenuous. More critically, important features of DSM-IV depression are often buried within more complex items and sometimes are not captured at all. The work/interests item includes anhedonic features along with listlessness, indecisiveness, social avoidance, and lowered productivity. It is impossible to determine the extent to which anhedonia per se influences severity. Guilt is captured in both Hamilton depression scale depression and DSM-IV depression, but the Hamilton depression scale contains no explicit assessment of feelings of worthlessness. Decision-making difficulties are buried within the work/interests item of the Hamilton depression scale, but concentration difficulties are not included. The reverse vegetative symptoms—weight gain, hyperphagia, and hypersomnia—were provided by Hamilton (1) as additional items but are not scored on the original Hamilton depression scale.
A wide range of instruments has been used to examine the convergent validity of the Hamilton depression scale (t4). Most of the correlation coefficients met the preestablished criterion, and the Hamilton depression scale showed adequate convergent validity in correlations with all but two scales, including the major depression section of the Structured Clinical Interview for DSM-IV. The latter finding provides evidence of noncorrespondence between the Hamilton depression scale and DSM-IV.
Two approaches have been used to evaluate the discriminant validity of the Hamilton depression scale. In the first approach, several studies used the receiver operating curve as a statistical means of determining the cutoff scores for detecting depression and then provided corresponding rates of sensitivity, specificity, positive predictive power, and negative predictive power for the Hamilton depression scale in distinguishing depressed and nondepressed subjects. In other studies, researchers have examined the capacity of the Hamilton depression scale to distinguish different groups of clinical patients (e.g., patients with endogenous versus those with nonendogenous depression, patients with anxiety versus those with depression) using statistical techniques to detect mean group differences. Classification rates resulting from receiver operating curve analysis have not been widely reported in the Hamilton depression scale literature. Our search only identified seven studies (t5), and some of these investigations sought to detect depression in samples of patients with medical conditions other than psychiatric disorders (t1). Sensitivity, specificity, and negative predictive power were generally consistent and large, but positive predictive power was more variable, and two studies reported very low positive predictive power.
The second type of discriminant validity study attempts to distinguish different clinical groups. In a comparison of healthy, depressed, and bipolar depressed individuals, Rehm and O’Hara (61) found that the total Hamilton depression scale score clearly differentiated these three categories, with the depressed patients scoring higher than the healthy participants and with the bipolar depressed patients scoring higher than both of the other groups. At the item level, four items—psychomotor agitation, gastrointestinal symptoms, loss of insight, and weight loss—failed to differentiate depressed from healthy subjects. Only psychic anxiety and hypochondriasis significantly differentiated the subjects with unipolar and bipolar depression. Kobak et al. (37) showed significant total scale score differences between individuals with major depression, individuals with minor depression, and healthy comparison subjects. Zheng et al. (77) reported that the Hamilton depression scale was able to discriminate psychiatric patients classified as mildly, moderately, and severely dysfunctional on the basis of Global Severity Scale scores. Thase et al. (73) found that the Hamilton depression scale could distinguish patients with endogenous depression from patients with nonendogenous depression, with patients in the former category having higher scores. Gottlieb et al. (32) reported no significant differences between the Hamilton depression scale scores of patients classified as having low-severity versus high-severity Alzheimer’s disease. Several researchers have investigated the capacity of the Hamilton depression scale to differentiate between patients with anxiety and those with depression. Prusoff and Klerman (89) suggested the Hamilton depression scale could indeed separate these constructs, and Maier et al. (45) demonstrated that the Hamilton depression scale had a higher correlation with an external measure of depression than with an external measure of anxiety, but the saturation of the Hamilton depression scale with anxiety-related concepts was nonetheless considerable.
Edwards et al. (90) performed a meta-analysis of 19 studies with a total of 1,150 patients that compared the predictive validity of the Hamilton depression scale and the Beck Depression Inventory. Treatments included pharmacotherapy, behavior therapy, cognitive restructuring, dynamic psychotherapy, and various combinations. The Hamilton depression scale was found to be more sensitive to change, compared to the Beck Depression Inventory. Lambert et al. (39) performed a meta-analysis that included 36 studies and a total of 1,850 patients and that compared the Hamilton depression scale to the Beck Depression Inventory and the Zung Self-Rating Depression Scale. They reported that the Hamilton depression scale was more sensitive to change than were the two self-report measures. Sayer et al. (66) also demonstrated that the Hamilton depression scale outperformed the Beck Depression Inventory in detecting change. Lambert et al. (40) reported that the Beck Depression Inventory is more likely to show treatment effects at 12 weeks than the Zung Self-Rating Depression Scale or the Hamilton depression scale; the Zung Self-Rating Depression Scale and the Hamilton depression scale were more likely to detect changes after 3 weeks.
One disadvantage of a multidimensional instrument such as the Hamilton depression scale in detecting change is that specific treatments may affect only a single dimension. If the total score includes somatic symptoms that actually reflect treatment side effects, estimates of treatment response will be spuriously low (44). In two studies and one meta-analysis researchers addressed this issue using the various unidimensional core depression item sets described earlier in the section on item characteristics (91, 92). The six-item subscale developed by Bech et al. (87) was found to be at least as responsive as the full Hamilton depression scale. A meta-analysis of eight fluoxetine studies with 1,658 patients showed that the different unidimensional subscales (44, 87) were more sensitive to change than was the full Hamilton depression scale score. These results were replicated in a second meta-analysis of four tricyclic antidepressant studies (25).
A total of 15 studies with 17 samples reported a factor analysis of the Hamilton depression scale (t6). In most of the studies, researchers used the eigenvalue ≥1 rule to determine the number of factors, extracted those factors from the data using principal-component analysis, and then determined the optimal configuration of items on factors using varimax rotation. The number of factors identified ranged from two to eight. Insomnia items appeared consistently on the same factor in 13 data sets, suggesting a sleep disturbance factor. There was some support for the presence of a general depression factor, as depressed mood, guilt, and suicide appeared together on the same factor in six data sets, and the combination of depressed mood, suicide, and psychic anxiety appeared on the same factor in seven data sets. Support was also found for an anxiety/agitation factor, with the agitation, psychic anxiety, and somatic anxiety items appearing together in six samples. Clearly, the Hamilton depression scale is not unidimensional, as separate sets of items do seem to reliably represent general depression and insomnia factors; however, the exact structure of the Hamilton depression scale’s multidimensionality remains unclear.
The Hamilton depression scale has been the standard for the assessment of depression for more than 40 years. Researchers and policy makers charged with the task of providing standards to evaluate treatment outcomes in depression are faced with three possible solutions: retain, revise, or reject. The latter solution argues for the development of a new instrument or the replacement of the Hamilton depression scale with existing, psychometrically superior instruments.
Many of the psychometric properties of the Hamilton depression scale are adequate and consistently meet established criteria. The internal, interrater, and retest reliability estimates for the overall Hamilton depression scale are mostly good, as are the internal reliability estimates at the item level. Similarly, established criteria are met for convergent, discriminant, and predictive validity, although the latter does suffer somewhat due to multidimensionality. At the item level, interrater and retest coefficients are weak for many items, and the internal reliability coefficients indicate that some items are problematic. The lack of individual item reliability is not necessarily a fatal psychometric flaw; what is critical is that the items as a whole provide adequate reliability.
Evaluation of item response shows that many of the individual items are poorly designed and sum to generate a total score whose meaning is multidimensional and unclear. The problem of multidimensionality was highlighted in the evaluation of factorial validity, which showed a failure to replicate a single unifying structure across studies. Although the unstable factor structure of the Hamilton depression scale may be partly attributable to the diagnostic diversity of population samples, well-designed scales assessing clearly defined constructs produce factor structures that are invariant across different populations (88). Finally, the Hamilton depression scale is measuring a conception of depression that is now several decades old and that is, at best, only partly related to the operationalization of depression in DSM-IV.
These findings indicate that continued use of the Hamilton depression scale requires, at the very least, a complete overhaul of its constituent items. Accumulated empirical evidence offers some hope that substantial revision can redress a number of psychometric problems, thereby providing an improved measure. Shortened versions of the Hamilton depression scale converge on a common set of core features and in general have proven more effective in detecting change. The truncated item sets for these instruments, however, are limited in that they do not permit capture of the full depressive syndrome. Other studies based on item response theory methods have indicated that modifications of the rating scheme are readily implemented and can enhance the unidimensionality of these core symptoms in a manner that allows uniform assessment of change. Identifying a core set of symptoms with proven psychometric qualities, along with making rating scheme changes that would allow consistent assessment of the severity of depression, could provide a foundation for a reconstructed scale. One advantage of such a revision is that it would maintain continuity with the long-standing use of the original Hamilton depression scale. This sort of transition is probably more palatable and therefore more readily acceptable to regulatory commissions.
The Depression Rating Scale Standardization Team revised the Hamilton depression scale (i.e., the GRID-HAMD [93, 94]) by employing several of the methodological advances we have been advocating in this article. They used item response theory methods to inform, in part, the revision process; developed clear structured interview prompts and scoring guidelines; and to some extent standardized the scoring system. We nonetheless believe that by making an effort to retain the original 17 items, the Depression Rating Scale Standardization Team failed to address many of the flaws of the original instrument. Most of the items still measure multiple constructs, items that have consistently been shown to be ineffective have been retained, and the scoring system still includes differential weighting of items. Moreover, the GRID-HAMD content is virtually unchanged from the original. All the items that appeared on the Hamilton depression scale in 1960 are included in the GRID-HAMD. Thus, this revision has neither removed items based on outdated concepts nor added items that incorporate contemporary definitions of depression.
Rejection of the Hamilton depression scale and replacement with an alternative existing measure or the implementation of a new instrument has scientifically compelling advantages over revision. The Inventory of Depressive Symptomatology (95) and the Montgomery-Åsberg Depression Rating Scale (96), designed to address the limitations of the Hamilton depression scale, represent two potential replacement alternatives. Although these instruments measure contemporary definitions of depression (33), neither item response theory methods nor other contemporary measurement techniques were employed in their development. As indicated earlier, such techniques, especially item response theory, maximize the capacity of an instrument to detect change. On the other hand, the development and implementation of a new instrument that is based on current knowledge of depression and that takes advantage of psychometric and statistical advances might offer the best solution. The decision to replace the Hamilton depression scale with either an existing instrument or a newly developed instrument would ultimately rest on consensus that such an instrument could capture more adequately the full spectrum of the depression construct and on empirical evidence of the new instrument’s superiority in detecting treatment effects.
In conclusion, we have been struck with the marked contrast between the effort and scientific sophistication involved in designing new antidepressants and the continued reliance on antiquated concepts and methods for assessing change in the severity of the depression that these very medications are intended to affect. Effort in both areas is critical to the accessibility of new medications for patients with depression. Many scales and instruments used in psychiatry today are based on—or at least include—current DSM symptoms, and the measurement of depression should follow this trend. It is time to retire the Hamilton depression scale. The field needs to move forward and embrace a new gold standard that incorporates modern psychometric methods and contemporary definitions of depression.