Subtype and Stage Inference (SuStaIn) is an unsupervised learning algorithm that uniquely enables the identification of subgroups of individuals with distinct pseudo-temporal disease progression patterns from cross-sectional datasets.
SuStaIn model is introduced by Young et al., 2018 from “Uncovering the heterogeneity and temporal complexity of neurodegenerative diseases with Subtype and Stage Inference.” It was built on event-based model Event-based Progression Model (EBM). However, besides progression trajectory, it also discussed heterogeneity within a disease.
The heterogeneity of neurodegenerative diseases is a key confound to disease understanding and treatment development, as study cohorts typically include multiple phenotypes on distinct disease trajectories. Here we introduce a machine-learning technique—Subtype and Stage Inference (SuStaIn)—able to uncover data-driven disease phenotypes with distinct temporal progression patterns, from widely available cross-sectional patient studies.
The framework of SuStaIn is built on event-based model, where we assume the binary transition, so-called ‘abrupt normal-to-abnormal’ transition, following the function below.
$p(X | S) = \prod_{j=1}^{J} \sum_{k=0}^{N} p(k) \left( \prod_{i=1}^{k} p(x_{s(i),j} \mid E_{s(i)}) \cdot \prod_{i=k+1}^{N} p(x_{s(i),j} \mid \neg E_{s(i)}) \right)$ … (1)
In SuStaIn, we plan to replace it with a more realistic gradual accumulation (linear increase) of biomarker abnormality.
<aside>
Recap —Notations for Event-based Progression Model
As Young mentioned from the codebook
I'd suggest choosing a value around the 95th percentile of your data but you can experiment with different values. I typically choose an integer for interpretability but you don't have to.
After defining $z_{max_i}$ and then we need to specify $R_i$. Then, the z-score events can be set as $z_{ri} = \frac{z_{max_i}}{R_i +1}$, which splite $[0,z_{max_i}]$ into equal intervals for interpolation and gives uniformly sampled information on the trajectories. This is the default option unless specified with some domain expertise.
In the paper, the mapping is like the following
| $R_i$ (number of z-score events) | $z_{max}$ (final z-score reached) |
|---|---|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
<aside>
What are the underlying assumptions to set $z_{max_i}$ and $R_i$ this way?
It assumes that the maximum z-scores are the same across all subtypes—or more generally, that the degree of abnormality a single biomarker can reach is consistent across subtypes. (but ofc this is way too ideal in real life)
Additionally, by folding time into stages (i.e., time is not explicitly modeled), the model ignores the potential variation in the rate of degeneration across individuals or subtypes.
</aside>
Time axis $t$: Normalized from 0 to 1 for simplicity. At each disease stage $k$ (ranging from 0 to N), time spans from $t = \frac{k}{N+1}$ to $t = \frac{k+1}{N+1}$
The biomarkers evolve as time $t$ progress according to a piecewise linear function $g_i(t)$, where
$g(t)=\left\{\begin{array}{c} \frac{z_1}{t_{E_{z_1}}} t, 0<t \leq t_{E_{z_1}} \\ z_1+\frac{z_2-z_1}{t_{E_{z_2}}-t_{E_{z_1}}}\left(t-t_{E_{z_1}}\right), t_{E_{z_1}}<t \leq t_{E_{z_2}} \\ \vdots \\ z_{R-1}+\frac{z_R-z_{R_{-1}}}{t_{E_{z_R}}-t_{E_{z_{R-1}}}}\left(t-t_{E_{z_{R-1}}}\right), t_{E_{z_{R-1}}}<t \leq t_{E_{z_R}} \\ z_R+\frac{z_{\max }-z_R}{1-t_{E_{z_R}}}\left(t-t_{E_{z_R}}\right), t_{E_{z_R}}<t \leq 1 \end{array} .\right.$
Thus, the times $t_{E_{i z}}$ are determined by the position of the z-score event $E_{i z}$ in the sequence $S$, so if event $E_{i z}$ occurs in position $k$ in the sequence then $t_{E_{i z}}=\frac{k+1}{N+1}$.
![Hong, Jimin, Jiaying Lu, Fengtao Liu, Min Wang, Xinyi Li, Christoph Clement, Leonor Lopes, et al. “Uncovering Distinct Progression Patterns of Tau Deposition in Progressive Supranuclear Palsy Using [18F]Florzolotau PET Imaging and Subtype/Stage Inference Algorithm.” eBioMedicine 97 (October 14, 2023): 104835. https://doi.org/10.1016/j.ebiom.2023.104835.](attachment:23b7cbac-3b5e-430e-b851-61f5b34e2ab4:Screenshot_2025-08-08_at_1.03.43_PM.png)
Hong, Jimin, Jiaying Lu, Fengtao Liu, Min Wang, Xinyi Li, Christoph Clement, Leonor Lopes, et al. “Uncovering Distinct Progression Patterns of Tau Deposition in Progressive Supranuclear Palsy Using [18F]Florzolotau PET Imaging and Subtype/Stage Inference Algorithm.” eBioMedicine 97 (October 14, 2023): 104835. https://doi.org/10.1016/j.ebiom.2023.104835.
To formulate the model likelihood for the linear z-score model, we replace the event-based model
$p(X | S) = \prod_{j=1}^{J} \sum_{k=0}^{N} p(k) \left( \prod_{i=1}^{k} p(x_{s(i),j} \mid E_{s(i)}) \cdot \prod_{i=k+1}^{N} p(x_{s(i),j} \mid \neg E_{s(i)}) \right)$ … (1)