Subgroup identification searches for groups of patients that have an especially positive response to the drug of interest, and as a result can be useful for physicians when making more personalized prescribing decisions.

What is subgroup identification and why use it?

Subgroup identification is an exploratory method often applied on clinical data. Clinical trials usually report the average effectiveness of a drug, but this can be inadequate when drug efficacy varies widely for every patient. Genetic makeup, age, and medical history can all influence how a patient responds to treatment. A question that can then arise is, “which group of patients have the best response”? Subgroup identification can help answer this question.

If successful, subgroup identification will help direct the drugs to patients who will benefit the most and help physicians make personalized prescribing decisions for each patient.

Despite the focus on clinical trials, subgroup identification can be adapted to other domains. For example, a pharmaceutical company may be testing new promotion strategies for its drugs. Instead of evaluating average effectiveness of a particular promotion strategy, it could be useful to identify groups of physicians who are most receptive to the promotion. This could lead to targeted promotions tailored to each physician.

All these application examples underscore the potential problems averages can have. Averages are a convenient way to summarize data, but they can also conceal important information. In the case of clinical trials, averages can miss vital details if treatment outcomes have patterns like the one shown in the picture below. Even if there is a group of patients with very positive response to treatment (shown in red), the overall average would fail to detect this. Subgroup identification works to identify these subgroups which have better-than-average outcomes.

What are some examples of subgroup identification methods?

In the past decade, several methods have been developed to identify subgroups with above-average outcomes. Although these methods can be generalized, this article will focus on applications in clinical trials. Some examples are tree-based methods like Virtual Twins, GUIDE, and PRIM. Decision trees are well-suited for subgroup analysis because of its built-in ability to partition the population space. But whereas traditional decision trees optimize the uniformity in each grouping, these methods maximize the average response to treatment in each grouping.

An added complexity that these methods can handle is treatment and control arms. To accurately measure the true efficacy of a drug, clinical studies are often designed with a treatment and control arm. This however means that a simple search for subgroups with above-average outcomes is not enough. A valid subgroup now should compare outcomes in treatment with outcomes in control. More specifically, the treatment arm should have much better outcomes than the control arm. This ensures that the subgroup consists of patients who are truly receptive to the treatment.

GUIDE and PRIM couples the decision tree algorithm with hypothesis tests to make the search for subgroups more efficient and less biased. Hypothesis tests help with the search for patient characteristics that define the most optimal subgroups. For example, a subgroup with a common genotype may have the biggest contrast between treatment and control arm.

Virtual Twins got its name from the way it predicts two hypothetical outcomes for each patient: outcome if he/she received treatment and outcome if he/she did not. The difference between these two values represents the efficacy of the treatment. A valid subgroup would have individuals where this difference is large. An advantage of this kind of methodology is that it can simulate the treatment/control design on observational data as well.

These methods are by no means the only ones available. There are other tree-based methods like SIDES as well as more diverse methods like FindIt, which utilizes support vector machines. Literature reviews on these methods and more can be found here and here.

Subgroup identification should be used with caution, as there is a chance the methods will identify spurious subgroups. All methods suggest using resampling and bootstrapping methods to mitigate these problems. Despite the risks, careful use makes possible a systematic search for novel subgroups that could then be confirmed in further experiments. These subgroups will be important in developing personalized treatment strategies not previously possible.


Virtual Twin

  • Foster JC, Taylor JMG, Ruberg SJ. Subgroup identification from randomized clinical trial data. Statistics in medicine. 2011;30(24). doi:1002/sim.4322 Link


  • Loh W-Y, Man M, Wang S. Subgroups from regression trees with adjustment for prognostic effects and postselection inference: Subgroups with prognostic effects and postselection inference. Statistics in Medicine. 2019;38(4):545-557. doi:1002/sim.7677 Link
  • Loh W-Y, He X, Man M. A regression tree approach to identifying subgroups with differential treatment effects. Statistics in Medicine. 2015;34(11):1818-1833. doi:1002/sim.6454 Link
  • Loh, W.-Y. and Zhou, P. (2020), The GUIDE approach to subgroup identification. In Design and Analysis of Subgroups with Biopharmaceutical Applications, N. Ting, J. C. Cappelleri, S. Ho, and D.-G. Chen (Eds.) Springer, in press. Link


  • Chen G, Zhong H, Belousov A, Devanarayan V. A PRIM approach to predictive-signature development for patient stratification: A PRIM approach to predictive-signature development for patient stratification. Statistics in Medicine. 2015;34(2):317-342. doi:1002/sim.6343 Link
  • Ott A, Hapfelmeier A. Nonparametric Subgroup Identification by PRIM and CART: A Simulation and Application Study. Computational and Mathematical Methods in Medicine. 2017. doi:1155/2017/5271091 Link
  • Kehl V, Ulm K. Responder identification in clinical trials with censored data. Computational Statistics & Data Analysis. 2006;50(5):1338-1355. doi:1016/j.csda.2004.11.015 Link