Numerous randomized clinical trials have demonstrated the efficacy of somatic antidepressant therapy for major depressive disorder (1–7). These studies, as with randomized clinical trials in general, were designed to evaluate the benefits of treatment in tightly controlled settings measured under ideal circumstances among relatively homogeneous groups of subjects (8). Randomized clinical trials have been an indispensable source of information about efficacy. Protocols for randomized clinical trials include proscribed treatment decisions, a defined duration of treatment, limited choices of interventions (including placebo), and strict inclusion and exclusion criteria. For instance, protocols tend to exclude the mild to moderately depressed (e.g., Hamilton Depression Rating Scale score <18) and, for both ethical and legal reasons, the acutely suicidal or psychotic patients, a group in most need of treatment. Patients taking other medications and those with comorbid psychiatric or other medical illnesses are also often excluded.
As a consequence, randomized clinical trials have informed clinical practice about the monotherapeutic treatment of nonsuicidal patients with minimal comorbid illnesses. Taken as a whole, these criteria very likely increase the drug-placebo differences. Yet, randomized clinical trial results do not apply to a substantial proportion of individuals who suffer from depressive disorders (9, 10). In contrast, effectiveness studies are designed to evaluate treatments among a more inclusive group of patients in settings more similar to those seen in clinical practice. Effectiveness studies are far less common than randomized clinical trials in medicine in general and in psychiatry in particular.
An observational study of affective disorders can be used to examine the association between treatments as administered in the community and a range of psychopathology among a heterogeneous group of subjects. Yet by design, such a study observes but does not manipulate the treatment received by subjects. As a consequence, the causal path between treatment and level of psychopathology is often ambiguous. For example, some subjects are asymptomatic because they receive treatment, whereas others receive treatment because their symptoms are exacerbated. Without experimental control over treatment decisions, the direction of the causality is not clear.Thus, observational evaluations of treatment effectiveness are less useful for treatment evaluation than randomized clinical trials because of the confounding variable of recent symptoms, which are related to both the intervention and the outcome.
Cochran (11) proposed the method of subclassification, an approach that can be applied to reduce bias in estimates of treatment effectiveness. The fundamental premise of this approach is that analyses that are stratified by a confounding variable remove the influence of that variable. That is, separate analyses of subjects with and without the characteristic of interest hold constant what otherwise confounds the relation between the intervention and the outcome. The simplicity of stratification is appealing. However, the mechanism that drives individuals to seek treatment probably consists of more than one variable (e.g., health insurance, treatment history, and comorbidity). Analyses that require multiple strata to account for numerous confounding variables are unwieldy and difficult to interpret.
The propensity adjustment (12–15) is a univariate alternative to multivariable stratification in that a linear combination of variables related to the likelihood of treatment seeking comprise a propensity score. In the context of antidepressant treatment effectiveness, the propensity model can examine clinical and demographic predictors of receiving treatment. The multifaceted treatment-seeking mechanism is then incorporated by stratifying effectiveness analyses by the propensity score. That is, separate effectiveness analyses are conducted for subjects who are least likely to seek treatment (i.e., those with low propensity scores), those somewhat more likely (i.e., those with moderate propensity scores), and those most likely to seek treatment (i.e., those with high propensity scores). Although the propensity adjustment reduces the bias in the estimates of treatment effectiveness associated with variables in the propensity model, unmeasured or hidden sources of bias remain (16, 17). In contrast, with randomization, both observed and hidden sources of bias tend to be removed from estimates of efficacy.
We applied the propensity methodology to the National Institute of Mental Health (NIMH) Collaborative Depression Study, a longitudinal observational study of affective illness that includes subjects with a range of illness severity and complexity. Our objectives were twofold. First, we examined features that distinguished those who received varying levels of somatic antidepressant treatment and incorporated those in estimates of the propensity for treatment intensity. Second, we evaluated treatment effectiveness in analyses that were stratified by the propensity for treatment intensity.
From 1978 through 1981, the NIMH Collaborative Depression Study recruited 955 subjects who sought treatment for one of the major affective disorders (major depressive disorder, mania, or schizoaffective disorder) at one of five academic medical centers in the United States (located in Boston, Chicago, Iowa City, New York, and St. Louis). All subjects were at least 17 years of age, English speaking, and Caucasian. Each subject provided written informed consent. The objectives and design of the NIMH Collaborative Depression Study have been described previously (18). The NIMH Collaborative Depression Study follow-up is ongoing, and the current analyses include up to 20 years of follow-up data. The patient group examined in these analyses was derived from the 431 subjects who met criteria for major depressive disorder at intake, had no underlying minor or intermittent depression of at least 2 years duration, and had no history of mania, hypomania, or schizoaffective disorder (19). Neither alcohol nor substance abuse was an exclusion criterion. Of these 431 subjects, the study group was limited to the 285 subjects who recovered from their intake episode and then had at least one recurrent affective episode over the course of the follow-up period. This was done because 1) the variables in the propensity model (described in the Data Analyses section) include clinical characteristics such as treatment during the prior episode and prior well interval, and 2) detailed clinical information on prior treatment was only available on episodes that commenced after intake into the NIMH Collaborative Depression Study.
The Schedule for Affective Disorders and Schizophrenia (20) and clinical records were used for diagnostic assessment according to Research Diagnostic Criteria (RDC) (21). The Longitudinal Interval Follow-Up Evaluation (22) was administered by trained, well-supervised raters for assessment of psychopathology, functional impairment, and dose and duration of somatic treatment. Patients were assessed with this semistructured interview semiannually for the first 5 years of the follow-up period and annually thereafter. The specific wording of the Longitudinal Interval Follow-Up Evaluation items, rater qualifications, and interrater reliability of the ratings have been reported previously (22). For instance, the intraclass correlation coefficient for week of recovery was 0.95. Severity of symptoms of major affective disorders (i.e., major depressive disorder, mania, schizoaffective depression, and schizoaffective mania) was recorded by using the Longitudinal Interval Follow-Up Evaluation psychiatric status ratings, which range from 1 (no symptoms) to 6 (severe symptoms). Information regarding somatic treatment collected during Longitudinal Interval Follow-Up Evaluation interviews was corroborated with available clinical records. During each interview, the rater assigned Longitudinal Interval Follow-Up Evaluation ratings for each week that had elapsed since the prior interview. To do so, the rater identified chronological anchor points (e.g., holidays) to assist the subject in recalling when significant clinical improvement or deterioration took place.
The NIMH Collaborative Depression Study developed composite ratings to quantify treatments appropriate for unipolar depression, psychotic depression, and bipolar disorder (23). The unipolar composite antidepressant rating is a summary measure of the intensity of somatic antidepressant treatment. The rationale and method for deriving the unipolar composite antidepressant rating have been described previously (23). The unipolar composite antidepressant rating algorithms continue to be revised with the introduction of new medications and further clinical experience with existing medications. A panel of experts, drawn from NIMH Collaborative Depression Study investigators, bases the approximations of dose equivalents largely on clinical experience, since there is limited randomized clinical trial literature that provides comparisons across graduated doses of the wide variety of medications included in the unipolar composite antidepressant rating. Daily doses of different classes of somatic antidepressant therapies are rated on a scale designed to reflect the overall commitment to somatic antidepressant treatment or intensity of treatment (examples are presented in t1). The algorithms include rules for increased treatment intensity associated with the use of medication for augmentation. Tests of plasma levels are not incorporated in the algorithms. The unipolar composite antidepressant rating does not purport to represent biologically equivalent doses. Instead, it is an ordinal scale of treatment intensity ranging from 0 to 4. A unipolar composite antidepressant rating of 0 indicates no somatic treatment, and unipolar composite antidepressant ratings of 1 to 4 represent progressively larger doses. We acknowledge that this scale is somewhat coarse. The analyses compare broad classes of treatment intensity and are not meant for inferences regarding differences in effectiveness of two medications or two doses of any one specific medication.
The analyses were conducted in two stages. First, analysis of the propensity for treatment intensity examined characteristics that distinguished among those receiving various levels of somatic antidepressant treatment. A dynamic adaptation of the propensity adjustment for ordinal doses (24) was employed in a mixed-effect ordinal logistic regression model (25); MIXOR software (26) was used for this model. Unipolar composite antidepressant rating was the ordinal dependent variable, and fixed effects included several demographic and clinical variables that were hypothesized to be associated with treatment intensity, such as gender, site, socioeconomic status, age, number of prior affective episodes, and treatment intensity during the most recent prior episode and prior well period. In addition, both symptom severity (mean psychiatric status rating in the 8 weeks before commencing treatment) and trajectory of symptom severity in the 8 weeks before the change in treatment (i.e., whether psychiatric status ratings were increasing, stable, or decreasing) were entered into the model. The significance of each variable was evaluated based on –2 log likelihood difference between models with and without the additional variable. A linear combination of these variables, called the propensity score, was derived on the basis of the results of the logistic model. A subject-specific intercept was included as a random effect to account for within-subject clustering.
Treatment effectiveness analyses were then conducted with a mixed-effect grouped-time survival model (27) of the time from the start of the course of a particular intensity of treatment until recovery from major affective episode; MIXGSUR software (28) was used for these analyses. Survival time represented the "time until recovery," defined as the number of consecutive weeks during which treatment remained at one level of intensity during an affective episode. A survival interval terminated in one of three ways: 1) resolving of an episode, 2) a change in antidepressant treatment intensity, or 3) end of follow-up. The latter two were classified as censored and were assumed to be unrelated to time until recovery. Recovery from an episode was the target "terminal" event that ended a survival interval and was defined according to RDC as 8 consecutive weeks of no more than minimal symptoms. Thus, the survival chronometer started over with each new episode and each change in level of treatment. A subject accumulated additional survival intervals, hereafter referred to as "treatment intervals," with each new episode and each change in treatment intensity while in an episode. The unit of analysis for both the propensity and effectiveness models was treatment interval. A separate propensity score was calculated for each treatment interval.
The treatment effectiveness analyses, which included fixed effects of treatment levels and a random effect for the subject-specific intercepts, were stratified by propensity score quintile, as recommended by Rosenbaum and Rubin (12). Thus, separate effectiveness analyses were conducted for those least likely to get aggressive somatic treatment, those somewhat more likely to get aggressive treatment, and so on. These stratified results were then pooled by using the Mantel-Haenszel procedure (described by Fleiss ) after evaluating the appropriateness of combining results across strata. Most important, stratum-specific results cannot be pooled if there is a significant propensity-by-treatment interaction because such an interaction would indicate that treatment effects vary across groups defined by their propensity for treatment. Mixed-effect models were used for both stages of analyses, since many subjects had multiple episodes and multiple treatment intervals within episodes. This approach allowed for within-subject variation in treatment intensity and propensity scores across treatment intervals. A two-tailed alpha level of 0.05 was used for each statistical test. According to the statistical power algorithm from Diggle et al. (30), the group size was sufficient to detect differences in response rates of about 10%–15%, with statistical power of 0.80 and a two-tailed alpha level of 0.05.
Demographic and clinical characteristics are presented for the 285 subjects who met criteria for major depressive disorder at intake into the NIMH Collaborative Depression Study and had at least one prospectively observed episode (t2). Many of these subjects would likely have been excluded from randomized clinical trials. For instance, 15.4% had a history of serious suicide attempts, and 14.0% (N=40) were over 65 years old during the final treatment interval examined in these analyses. Among these subjects, the number of affective episodes that commenced after intake into the NIMH Collaborative Depression Study ranged from 1 to 18 (mean=3.2, median=2.0, SD=2.9).
The demographic and clinical characteristics of these 285 subjects were compared with the 146 subjects who presented with major depressive disorder at intake into the NIMH Collaborative Depression Study but were excluded from the analyses because they did not have at least two prospectively observed episodes. Those who were included were younger than those excluded (mean=37.2 [SD=14.7] versus 41.3 [SD=15.1] years, respectively) (t=2.40, df=429, p<0.02), and the included group was overrepresented by women (64.2% versus 53.4%) (χ2=4.26, df=1, p<0.04). However, included and excluded subjects did not differ with regard to marital status (χ2=4.42, df=2, p=0.11), site (χ2=4.75, df=4, p=0.31), social class (Mann-Whitney p=0.53), inpatient status (χ2=0.38, df=1, p=0.54), intake Global Assessment Scale score (t=0.45, df=425, p=0.66), or intake Hamilton depression scale score (t=0.31, df=414, p=0.76).
Since either a new episode or a change in treatment intensity while in an episode designated a new treatment interval, the number of treatment intervals (mean=11.0 [SD=11.6], median=8.0, range=1–65) almost always exceeded the number of affective episodes for each subject. The propensity and effectiveness analyses included 3,141 observations (i.e., treatment intervals) for these 285 subjects. The median follow-up time was 17 years (mean=14.3, SD=5.4) and ranged from 6 months to 20 years after intake into the NIMH Collaborative Depression Study. The data span from 1978 through 1999.
Propensity for Antidepressant Treatment Intensity
The results of the propensity for treatment intensity model indicate that those who were more severely ill and those who had received more intensive treatment earlier tended to receive more intensive somatic antidepressant therapy (t3). For instance, the odds ratios revealed that those with worsening symptoms in the 8 weeks before commencing treatment (i.e., an increasing trajectory for psychiatric status ratings) were 62% more likely to receive higher levels of somatic antidepressant treatment than those whose symptom severity remained stable. Similarly, those with more severe symptoms immediately before treatment commenced were 24% more likely to receive more intensive somatic treatment (i.e., a 24% increase with each additional psychiatric status rating point). Furthermore, those with more prior affective episodes or more intensive treatment in either their prior episode or their prior well interval tended to receive more aggressive treatment during their current affective episode. These results underscore the need to account for various aspects of the course and treatment of affective illness in the effectiveness analyses. Demographic factors were not even marginally significant and thus not included in the model (gender: –2 log likelihood=0.001, df=1, p=0.98; site: –2 log likelihood=4.71, df=4, p=0.32; socioeconomic status: –2 log likelihood=1.50, df=4, p=0.83; age: –2 log likelihood=2.46, df=4, p=0.65).
After developing a propensity for treatment intensity model, and as a prerequisite to the treatment effectiveness evaluation, we determined whether all levels of treatment intensity were represented in each of the propensity quintiles (t4). As expected, those in the lowest propensity for treatment intensity quintile were overrepresented among those receiving lower levels of treatment. Similarly, those in the highest propensity for treatment intensity quintile were disproportionately represented among those receiving high levels of treatment. Nevertheless, because all four levels of treatment were well represented in each of the five quintiles of treatment intensity, the effectiveness evaluation proceeded as described.
Mixed-effect grouped-time survival analyses of time until recovery were used to examine treatment effectiveness. Separate analyses were conducted for each of the propensity quintiles, and the results were then pooled by using the Mantel-Haenszel procedure. (Before pooling the quintile-specific results, one model that included all observations examined the propensity-by-treatment interaction, which was nonsignificant [–2 log likelihood=5.817, df=12, p<0.93]. Thus, pooling of results was indicated.) The pooled results indicated that when treated with higher levels of somatic antidepressant therapy, subjects were nearly twice as likely to recover as those who received no somatic treatment (odds ratio=1.86, 95% CI=1.27–2.72; z=3.17, p=0.002) after we controlled for propensity for treatment intensity. In contrast, neither low levels of antidepressant treatment (odds ratio=0.86, 95% CI=0.55–1.23; z=–0.93, p<0.35) nor moderate levels (odds ratio=1.13, 95% CI=0.79–1.63; z=0.67, p<0.51) were associated with a significant increase in the likelihood of recovery. Furthermore, although higher levels of antidepressant treatment were significantly superior to lower levels, overlapping confidence intervals signified that there was no significant difference between high and moderate levels of antidepressant treatment.
The effectiveness of somatic antidepressant treatment was examined in a longitudinal observational study of subjects who met criteria for unipolar major depressive disorder at intake into the NIMH Collaborative Depression Study. Those who received higher levels of treatment tended to be more ill as measured by more severe symptoms and worsening symptoms. They also had more prior episodes and a history of more aggressive treatment in both their prior episode and prior well interval. Nevertheless, in analyses that controlled for these differences through stratification, those who received higher levels of antidepressant treatment were significantly more likely to recover from a major affective episode than those who received no somatic treatment. In contrast, those receiving lower levels were no more likely to recover than those who were untreated.
This study extends the generalizability of reports from randomized clinical trials in which the baseline level of illness, as well as the dose and duration of pharmacologic interventions, have been carefully controlled. In contrast to subjects in randomized clinical trials, subjects in the NIMH Collaborative Depression Study received a variety of antidepressant medications, both alone and in combination, that were rated on a scale of treatment intensity. Furthermore, unlike most randomized clinical trials, we included elderly subjects, subjects with comorbid medical illnesses, and subjects with a history of serious suicide attempts. Finally, randomized clinical trials typically evaluate the efficacy of a medication relative to placebo or another active agent. In this observational study, a substantial proportion of depressive episodes received no somatic treatment (30%, N=946 of 3,141 [ t4]). Accordingly, we have compared the effectiveness of various intensities of somatic antidepressant treatments to no somatic treatment, allowing us to remove much of the "package of placebo effects" (32) from the efficacy estimates that are reported in placebo-controlled randomized clinical trials.
The analyses presented here proceeded in two stages. Initially, we used a propensity for treatment intensity model to examine differences among patients who received various intensities of antidepressants. Then, after we controlled for those differences through stratification, treatment effectiveness analyses were conducted. In standard covariate-adjusted analyses of treatment effectiveness, it would have been unwieldy, at best, to verify the representativeness of the treatment levels across the hundreds of combinations of levels of these five covariates. However, using the propensity approach of Rosenbaum and Rubin (12–15), we verified that each treatment level was well represented within each propensity quintile. Most important, beneficial effects of higher doses of somatic antidepressant therapy were detected in this observational study. Furthermore, because a mixed-model approach was used, multiple episodes within-subject and multiple treatment intervals within-episode were included in the analyses, and the analyses accounted for the varying duration of both episodes and treatment intervals.
There are several limitations of this observational study. First, although the propensity adjustment reduces bias associated with variables in the propensity model, other sources of bias can remain. In fact, the propensity adjustment removed or greatly reduced treatment group differences on all of the propensity components (data not shown). Second, the treatment intensity data are based on Longitudinal Interval Follow-Up Evaluation interviews. Although this was verified with clinical records whenever possible, availability and quality of records were highly variable. Moreover, we do not have blood levels to confirm the treatment data. Third, treatment intensity is defined on a composite antidepressant scale. We acknowledge that this scale has broad classes of treatment intensity, based on consensus judgment among clinical researchers. Fourth, the scale does not include other psychotropic medications such as neuroleptics or psychotherapy, which for that reason, have been ignored in these analyses. Fifth, the analyses did not examine side effects or toxicity of antidepressants because such data were not available.
Finally, the analyses focused on recurrent affective episodes and did not include the intake depressive episode. This was done for a variety of reasons. All subjects were recruited into the study when seeking treatment. In these analyses, we sought to compare a wide range of antidepressant treatment levels, including no somatic treatment. Furthermore, recruitment into the NIMH Collaborative Depression Study took place at varying points in the course of the subjects’ episodes, not strictly as the episode commenced. Thus, the results that are reported are based on all prospectively observed major affective episodes that began after intake into the NIMH Collaborative Depression Study. This allowed the propensity for treatment intensity model to include comprehensive information on treatment in prior well intervals and prior depressive episodes. It also permitted us to examine treatment effectiveness in a context that most closely mirrors community practice not influenced by clinical research, since the first prospective episode of depression occurred on average 20 months (median) after remission of the intake episode.
In conclusion, this study provides evidence of the effectiveness of higher levels of somatic antidepressant therapy in a more inclusive group of subjects than is generally included in a randomized clinical trial. These findings indicate that clinicians should try to administer higher antidepressant doses and work with patients to overcome obstacles such as side effects, financial costs, and lack of motivation. The results from this observational study extend the generalizability of reports from randomized clinical trials of antidepressants to a wider, more representative group of individuals who suffer from major depressive disorder.
Clinical studies for the National Institute of Mental Health Collaborative Program on the Psychobiology of Depression were conducted with the participation of the following investigators: M.B. Keller, M.D. (Chairperson, Providence); W. Coryell, M.D. (Co-Chairperson, Iowa City); T.I. Mueller, M.D., D.A. Solomon, M.D. (Providence); J. Fawcett, M.D., W.A. Scheftner, M.D. (Chicago); W. Coryell, M.D., J. Haley (Iowa City); J. Endicott, Ph.D., A.C. Leon, Ph.D., J. Loth, M.S.W. (New York); J. Rice, Ph.D., T. Reich, M.D. (St. Louis). Other contributors include H.S. Akiskal, M.D.; N.C. Andreasen, M.D., Ph.D.; P.J. Clayton, M.D.; J. Croughan, M.D.; R.M.A. Hirschfeld, M.D.; L. Judd, M.D.; M.M. Katz, Ph.D.; P.W. Lavori, Ph.D.; J.D. Maser, Ph.D.; M.T. Shea, Ph.D.; R.L. Spitzer, M.D.; and M.A. Young, Ph.D. Deceased: G.L. Klerman, M.D.; E. Robins, M.D.; R.W. Shapiro, M.D.; and G. Winokur, M.D.
Received Oct. 2, 2001; revisions received Aug. 14 and Nov. 13, 2002; accepted Dec. 2, 2002. From the NIMH Collaborative Program on the Psychobiology of Depression. Address reprint requests to Dr. Leon, Department of Psychiatry–Box 140, Weill Medical College of Cornell University, 525 East 68th St., New York, NY 10021; firstname.lastname@example.org (e-mail). Supported in part by NIMH grant MH-60447 (Dr. Leon). This manuscript has been reviewed by the Publication Committee of the NIMH Collaborative Depression Study and has its endorsement.