Limitations in the current design and analysis of randomized controlled trials (RCTs) have created concerns that the trials have both impaired accuracy and reduced efficiency with the result that more patients are required for study than are necessary [1]. An important source of this problem is randomization “bias” that creates differences in baseline equivalence between the intervention and control groups. When this occurs, any observed differences in outcome from treatment may be attributable in part to randomization bias rather than the effects of the new drug, treatment modality, or device. This concern may be especially salient for trials of psychological or psychiatric disorders, well known to be highly sensitive to such issues as differential placebo responses, prior treatment history, or treatment context [2]. In this paper, we propose and illustrate an approach to the classification of a patient’s risk for an adverse outcome, the composite clinical score, that specifically addresses randomization bias.

We also discuss challenges that the Food and Drug Administration highlighted in two recent publications and how a composite clinical score along with an expanded set of relevant features can be used to address those concerns. The first publication was titled, “Covariates in Randomized Clinical Trials for Drugs and Biological Products Guidance for Industry,” in May 2023 [1]. The second, “Addressing the Challenge of Common Chronic Diseases — A View from the FDA,” published in February 2024, acknowledged the growing frequency of common chronic diseases and the need for better ways to demonstrate the benefits of treatment in diseases responsible for increasing morbidity and mortality in the US population [3].

These issues have become increasingly prominent for several reasons. First, there is a surprising neglect of nuanced clinical features among patients included in RCTs [4]. Distinctions in comorbidity, disease severity, co-therapies, patient expectations, and geographic variability in treatment patterns receive insufficient attention [5]. At the same time that all this heterogeneity is ignored, investigators are often testing new therapies looking for ever smaller incremental benefits over already effective treatments. It is not unusual today to see trials designed with an effect size of 0.9 in the setting of relatively low outcome rates. In these circumstances, understanding and accounting for clinical heterogeneity is critical to avoid missing potential treatment benefits that exist or exaggerating treatment effects that do not. In the following sections, we describe the problem of randomization bias, the role of composite clinical scores in mitigating its effects and consider both multi-morbidity and the impact of patient and context heterogeneity on the estimation of treatment effects.

Clinical trialists often suggest that RCTs are the “gold standard” for evaluating the potential benefits of new or existing medicines and devices [6]. Clinicians know that these claims are exaggerated [7]. Substantial heterogeneity of the participants in RCTs, and the considerable heterogeneity of treatment effects, has always made the average treatment effect estimated in trials less applicable to the care of individual patients than to drug approval for the population [8, 9].

Consequently, clinicians frequently request subgroup analyses to assess treatment effects in groups of patients that more closely resemble patients they see in practice. However, when these analyses are carried out, they too often rely on 1 variable by 1 analysis for baseline features such as age, sex, race, and a limited number of clinical characteristics that may occur at different rates between intervention and control groups.

Less appreciated by many is the possibility that even the average treatment effect may not accurately estimate the true treatment benefit. The reason this discrepancy occurs is attributable to the extensive heterogeneity of the study population for any particular disease creating the possibility for baseline differences between groups that randomization may reduce but not eliminate. Ironically, clinical trials have enjoyed their preference over other designs specifically because they rely on randomization, and yet this very reliance becomes a vulnerability when prior knowledge of risks for bad outcomes is excluded in design and yet is contributing to the results [10].

It is this problem in what we term “randomization bias” and that the FDA has sought to address recently with its refreshed guidance on covariates in clinical trial design and analysis. This guidance affirms the use of stratification in the design of the trial using covariates – if they can be measured – known or suspected to be important in affecting risk for the outcome. The guidance further recommends that investigators specify in advance of the trial the risk factors (covariates) and analyses that would reduce “randomization bias.” The guidance explicitly suggests that employing predictive covariates in the design of trials would improve the accuracy of the average treatment effect while simultaneously make the trial more efficient (reduce the number of patients required).

Although balancing randomization in design with selected clinical and related covariates is desirable, recent advances in machine learning make it possible to develop composite clinical scores that are more accurate and efficient. In the next sections, we describe the clinical composite score and illustrate its advantage over current methods.

If the selection of covariates in the design and analysis of a clinical trial is so fundamental to its validity (truthfulness), efficiency, and clinical appositeness, how should these features be identified and how should they be employed in design and analysis? The traditional method uses prior knowledge of risk factors to select clinical features for stratified randomization that achieves a better baseline equivalence between compared treatment arms [11]. Those same clinical features can also be used in analysis by forming subgroups that divide the trial population into groups (e.g., by sex, age, race/ethnicity) and examine differences in relative treatment effects across the subgroups. This method is inadequate for several reasons. One frequent criticism of this approach refers to its low statistical power or failure to account for multiple comparisons [12].

A more fundamental concern is that patients we see in clinical practice rarely have a single relevant feature that influences their risk of an adverse outcome. Rather, they have multiple risk factors for outcomes all occurring at the same time, such as male sex, old age, comorbid medical conditions, low adherence to medication, social or economic adversity, physical or psychological stress, history of prior treatments and settings in which treatments are delivered, etc. [13]. In addition, the features we commonly select as covariates are often a small number of the many features that contribute to risk for adverse outcomes. What is needed is a method that better reflects the clinical circumstances in which patients are cared for and assessed for risk identifying as many relevant features as possible.

Estimating the susceptibility of an individual to an adverse outcome – risk prediction – is central to clinical decision-making. Risk prediction tools have been widely used in clinical medicine, and an early index, the Apgar Score, is often cited as a paradigmatic example. The Apgar score, introduced by Virginia Apgar to assess the clinical condition of a newborn baby, was notable because it did not adhere to the typical psychometric criteria that emphasize homogeneity of the components of the scale. Just as the Apgar score contained different features (color, heart rate, respirations, reflexes, and muscle tone), so too do clinical predictive scores for poor outcomes require combining heterogeneous components (clinimetric scales) [14].

Currently, clinical risk prediction for adverse disease outcomes typically relies on basic demographic characteristics, such as age, gender, and ethnicity; basic lifestyle factors, such as body mass index, smoking status, alcohol consumption, and physical exercise habits; and measurement of clinical characteristics collected at baseline in trials, such as blood chemistries or biomarkers indicative of disease severity. Applying new developments in data availability and analytical tools makes it possible to extend the scope of features included in the predictive scores and to provide improved guidance on the where, what, and how to use clinical prediction models to improve the design and analysis of RCTs and other forms of clinical research.

In brief, composite clinical risk scores are most commonly calculated as a weighted sum of an individual patient’s clinical features, comorbid disease, and other pertinent characteristics. While the ultimate goal of a composite risk score may be balancing the risk of adverse outcomes across treatment arms in an RCT, an additional and important goal is the identification of a subset of individuals at elevated risk of adverse disease outcomes.

Let us consider an example of how the composite clinical risk score is developed and employed. Although we might ideally have identified data for a commonly studied psychiatric intervention, replete as they are with sources of heterogeneity, we were limited for the following demonstration to a profile clinical problem: COVID-19. The outcome of infection with the coronavirus is known to vary according to several accepted risk factors, including age, sex, and certain frequently cited comorbid diseases (e.g., asthma and obesity) [15]. The clinical prediction scores that exist are not especially good at separating high-risk from low-risk patients. To address this shortcoming, a new Severe COVID Risk Score (SCORS) was developed using data from the Healthjump electronic medical records system, which comprised records of 1 million patients with confirmed diagnoses of COVID-19.

A secondary dataset including 211,000 new patient records from the Healthjump electronic medical records system was selected to validate the SCORS. This dataset served to simulate clinical trials and test the SCORS against traditional patient allocation methods [16].

A multidisciplinary team, consisting of clinicians, data scientists, and statisticians, conducted a rigorous analysis of a vast array of patient variables related to disease trajectory and patient outcomes after COVID infection. Variables analyzed included demographic details, clinical histories, healthcare utilization patterns, and prior health outcomes. In addition, the team tested thousands of variables in an unsupervised fashion to identify unexpected predictive factors.

From this analysis, a subset of features with substantial predictive value for severe COVID-19 outcomes, such as respiratory failure, ventilator use, or ICU admission, were selected and engineered for inclusion in the SCORS model. The final model’s performance, as indicated by an area under the receiver operating characteristic curve of 0.92, demonstrated high predictive accuracy (a full description of the SCORS and its development is found in this citation).

A total of 100,000 simulated trials were conducted, each consisting of a random sample of 1,000 patient events drawn from the secondary dataset. These simulations were bifurcated into two distinct sets: one in which trial arms were balanced using standard COVID covariates (age, gender, race, vaccination status, and diabetes status) and another where SCORS was the principal criterion for patient allocation.

To ensure the validity of the comparisons drawn from these simulations, each trial arm was meticulously balanced. Consistency was maintained across key variables for trials using standard covariates, while in SCORS-balanced trials, the risk score itself achieved balance across arms in the trial. Validated by the simulation and using standard trial sizing statistical methods, adjusting for SCORS increased the accuracy of trials and reduced the patients required to show a statistically significant benefit by 19% across treatment effects. It is estimated that this strategy can increase trial efficiency by greater than 20% as investigators incorporate predictive factors beyond those recorded in medical records into composite risk scores.

The FDA properly called out the underappreciated impact of chronic common diseases on the health of the population. As they point out in their essay, 7 of the 9 top causes of death in the USA are related to chronic disease, and the two that are not, accidental deaths and deaths from COVID-19, are affected by comorbidity [3].

The FDA viewpoint goes on to suggest several ways that RCTs of chronic conditions could be altered to strengthen trial results. One strategy involves transforming evidence-generation methods, including designs that follow patients for longer follow-up, account for lower outcome rates, and broaden inclusion criteria. The FDA also encourages the development of new biomarkers and surrogate endpoints to overcome the challenge that many candidate therapies with promising results in phase 2 trials are not found effective in phase 3.

A major unaddressed consideration is that we are challenged not just by chronic diseases as comorbidity but as multi-morbidity. All the FDA suggestions treat chronic diseases in isolation when it is more likely that they occur together. In a study of 21 family practices in the Saguenay region, Quebec, the prevalence of multi-morbidity was 69% in 18–44 year olds, 93% in 45–64 year olds, and 98% in those aged over 65, and the average number of chronic conditions varied from 2.8 in the youngest to 6.4 in the oldest [17].

Yet most research and clinical practice, and indeed the discussion from the FDA on common chronic diseases, is still based on a disease paradigm that is inappropriate for patients with complex and overlapping health problems. Classic clinical trials emphasize efficacy at the expense of effectiveness. In doing so, and by ignoring the reality that most patients with these diseases have multi-morbidity, clinical trials compromise the transportability (external validity) of the trial. A composite clinical score can help to address this problem of multi-morbidity by including the full range of comorbid conditions in the development of the clinical index.

There is another crucial consideration in the application of RCTs to clinical practice that is also best managed by improved methods to predict clinical outcomes in patients receiving the new treatment and the controls. Physicians commonly assess patients’ risk for adverse outcomes. For example, after a myocardial infarction, patients with persistent ischemia, or heart failure, or electrical instability, or some combination of features, may have a substantially higher rate of death than low-risk patients without one or more of these features.

Even when the average treatment benefit (effect size) is the same across these risk groups, the absolute benefit of treatment is greatest in the patients at highest risk. Physicians intuitively understand this heterogeneity in risk and often make clinical decisions weighing the absolute treatment benefit, not the relative risk benefit. A well-designed trial using a composite clinical score can enable subgroup analyses that accurately measure differences in treatment benefit even when the average treatment effect is homogeneous across risk groups.

All of these methodological efforts will be ineffective without greater consideration of the clinical context in which RCTs are carried out. Interventions in patients with multiple chronic disease, for instance, need to accurately identify the small proportion of patients who have the worst outcomes and leverage that knowledge, both in the design and analysis of trials.

It is also crucial that we recognize the complexity introduced by placebo effects and how the results of trials may be affected by patient expectations, prior treatment responses, treatment setting, etc. Substantial evidence from many trials documents the variability in outcomes in placebo trial arms. Fava and colleagues have written elegantly about the need for a deeper appreciation for the role placebos. They point out that the prevailing conceptualization consists of an undifferentiated placebo response that needs to be minimized in controlled investigations and yet maximized in clinical practice. Fava points out that treatment outcomes are the cumulative result of the interaction of several classes of variables with a selected treatment: “living conditions (housing, nutrition, work environment, social support), patient characteristics (age, sex, genetics, general health conditions, personality, well-being), illness features and previous therapeutic experience, self-management, and treatment setting (physician’s attitude and attention, illness behavior)” [2].

A renewed attention to placebo effects and the influence of social context is needed especially as we focus increasingly on small treatment effects studied in highly diverse settings. Many trials report considerable variation in placebo response rates in different geographies (e.g., Europe, Asia, North and South America). Some of this variation may be attributable to differences in culture, or healthcare systems, or patient expectations [18].

Indeed, we endorse the view of Fava et al. [2] who wrote, “We call for a different conception of clinical trial whose aim is not the simple demonstration of statistical superiority of a treatment compared to control in a short-term study, but also the appraisal of differential interaction effects of multiple ingredients. The data that may originate from this type of clinical trials are closer to offer what a clinician needs for assessing treatment options in the individual case.” Regrettably, of course, much of the potentially useful data for imputing risk may not be readily available in any administrative dataset like claims data or EHRs and require direct attention to data collection for trial design.

Many of the issues discussed in this paper are especially prominent in psychotherapy research. An international group of researchers has addressed many methodological considerations in trials and urged a greater emphasis on clinical measurement and use of active control groups. Adoption of their recommendations will substantially improve the design and reliability of results in psychotherapy trials [19].

In conclusion, and consistent with FDA guidance on incorporating predictive covariates into trial design, heterogeneity in RCTs can benefit from composite risk scores that can reduce inherent biases in patient risk characteristics resulting from unbalanced randomization and promise more dependable and accurate estimates of the average treatment effect. Use of composite risk scores can also accommodate the increasing frequency of multi-morbidity and enable a more robust approach to assessing the effect of outcome risk on the variation in absolute benefits of treatment that are of most utility to decision-making in clinical practice. Applying these advances in the design and analysis of RCTs should become standard for anyone designing, analyzing, or applying the results of trials.

The data, technology, and services used in the generation of these research findings were generously supplied pro bono by the COVID-19 Research Database partners, who are acknowledged at https://covid19researchdatabase.org/.

MRC is a Senior Scientific Advisor to Clinica AI, and he separately holds paid consultant positions with the Bill and Melinda Gates Foundation and Pfizer, Inc.

This research was unfunded and conducted at Clinica AI as part of a mission to enhance patient care through predictive analytics.

Ralph I. Horwitz, James B. Baker, Arnab Ghatak, and Mark R. Cullen had access to the data and a significant role in writing and editing the manuscript.

1.
Guidance Document
: Adjusting for Covariates in Randomized Clinical Trials for Drugs and Biological Products. FDA-2019-D-0934.
2023
.
2.
Fava
GA
,
Guidi
J
,
Rafanelli
C
,
Rickels
K
.
The clinical inadequacy of the placebo model and the development of an alternative conceptual framework
.
Psychother Psychosom
.
2017
;
86
(
6
):
332
40
.
3.
Warraich
HJ
,
Marston
HD
,
Califf
RM
.
Addressing the challenge of common chronic diseases: a view from the FDA
.
N Engl J Med
.
2024
;
390
(
6
):
490
2
.
4.
Fava
GA
.
Forty years of clinimetrics
.
Psychother Psychosom
.
2022
;
91
(
1
):
1
7
.
5.
Charlson
ME
,
Wells
MT
.
Comorbidity: from a confounder in longitudinal clinical research to the main issue in population management
.
Psychother Psychosom
.
2022
;
91
(
3
):
145
51
.
6.
Bothwell
LE
,
Greene
JA
,
Podolsky
SH
,
Jones
DS
.
Assessing the gold standard--lessons from the history of RCTs
.
N Engl J Med
.
2016
;
374
(
22
):
2175
81
.
7.
Horwitz
RI
.
The dark side of evidence-based medicine
.
Cleve Clin J Med
.
1996
;
63
(
6
):
320
3
.
8.
Hill
AB
.
Reflections on controlled trial
.
Ann Rheum Dis
.
1966
;
25
(
2
):
107
13
.
9.
Tukey
JW
.
Some thoughts on clinical trials, especially problems of multiplicity
.
Science
.
1977
;
198
(
4318
):
679
84
.
10.
Deaton
A
,
Cartwright
N
.
Understanding and misunderstanding randomized controlled trials
.
Soc Sci Med
.
2018
;
210
:
2
21
.
11.
Kernan
WN
,
Viscoli
CM
,
Makuch
RW
,
Brass
LM
,
Horwitz
RI
.
Stratified randomization for clinical trials
.
J Clin Epidemiol
.
1999
;
52
(
1
):
19
26
.
12.
Krzywinski
M
,
Altman
N
.
Importance of being uncertain
.
Nat Methods
.
2013
;
10
(
9
):
809
10
.
13.
Fava
GA
,
Tomba
E
,
Sonino
N
.
Clinimetrics: the science of clinical measurements
.
Int J Clin Pract
.
2012
;
66
(
1
):
11
5
.
14.
Apgar
V
.
A proposal for a new method of evaluation of the newborn infant
.
Curr Res Anest Analg
.
1953
;
32
(
1
):
260
7
.
15.
Wolff
D
,
Nee
S
,
Hickey
NS
,
Marschollek
M
.
Risk factors for Covid-19 severity and fatality: a structured literature review
.
Infection
.
2021
;
49
(
1
):
15
28
.
16.
Baker
JB
,
Ghatak
A
,
Cullen
MR
,
Horwitz
RI
.
Development of a novel clinical risk score for COVID-19 infections
.
Am J Med
.
2023
;
136
(
12
):
1169
78.e7
.
17.
Fortin
M
,
Bravo
G
,
Hudon
C
,
Vanasse
A
,
Lapointe
L
.
Prevalence of multimorbidity among adults seen in family practice
.
Ann Fam Med
.
2005
;
3
(
3
):
223
8
.
18.
Vieta
E
,
Pappadopulos
E
,
Mandel
FS
,
Lombardo
I
.
Impact of geographical and cultural factors on clinical trials in acute mania: lessons from a ziprasidone and haloperidol placebo-controlled study
.
Int J Neuropsychopharmacol
.
2011
;
14
(
8
):
1017
27
.
19.
Guidi
J
,
Brakemeier
EL
,
Bockting
CLH
,
Cosci
F
,
Cuijpers
P
,
Jarrett
RB
, et al
.
Methodological recommendations for trials of psychological interventions
.
Psychother Psychosom
.
2018
;
87
(
5
):
276
84
.