Dear Editor,
We would like to thank Dr. Yu and his colleagues [
1] for their interest in our recent transcriptome analysis from the IMbrave150 clinical trial, which resulted in the development of the ISS10 predictor [
2]. We are grateful for their thorough and insightful work in comparing 23 predictive gene expression signatures for immunotherapy and assessing the predictability of each using the clinical outcomes from the IMbrave150 trial [
3,
4]. Their comprehensive analysis has added valuable depth to the ongoing discussion in finding and validating robust biomarkers and prediction models for immunotherapy in hepatocellular carcinoma (HCC). We are pleased to see such thoughtful engagement with related research.
However, making direct head-to-head comparisons of different predictive biomarkers is fraught with significant challenges. Factors such as differences in prediction algorithms, cutoff thresholds, biomarker platforms, biological context, and the reproducibility of findings complicate comparisons across studies [
5,
6]. While a comprehensive discussion on biomarker comparisons is beyond the scope of this correspondence, we highlight several key challenges that warrant attention in clinical hepatology and cancer research.
One of the main barriers to direct comparison is the variability in the algorithms used to derive predictive biomarkers or prediction models. Predictive biomarkers are often developed using different machine learning models or statistical frameworks, each with its own assumptions and limitations. Some studies may use simple linear models, while others might rely on more complex approaches like neural networks or decision trees [
7]. These differences in algorithmic approaches can lead to variations in how biomarkers or prediction models perform when applied to real-world patient data. This algorithmic variability poses a substantial obstacle to standardized comparison, as the underlying mathematical frameworks can differ greatly between studies. The choice of algorithm can influence not only the biomarker’s predictive power but also its interpretability and generalizability across different patient populations. In an attempt to address this challenge, Li et al. adopted a pragmatic approach by using single-sample Gene Set Enrichment Analysis (ssGSEA) scores with gene sets from 23 different signatures. While this method allows for a broad comparison of multiple signatures simultaneously, it is data-adaptive, nonlinear dimension-reduction approach and may introduce unintended bias. The case of the Tumor Immune Dysfunction and Exclusion (TIDE) signature may illustrate this point. As TIDE scores reflect immune evasion, higher TIDE scores were associated with poor response to immunotherapy [
8]. However, in the current analysis, these scores paradoxically correlated with improved response to the Atezolizumab and Bevacizumab combination therapy. This discrepancy underscores the potential pitfalls of applying a universal approach to diverse predictive models. While such standardization may be necessary for large-scale comparisons, it risks altering the original intent and performance characteristics of individual biomarkers or prediction models. Consequently, we must carefully consider these limitations when interpreting results from comparative studies and be cautious about drawing definitive conclusions without considering the original context and methodology of each predictive tool.
Cutoff thresholds represent another key challenge when comparing predictive biomarkers. These thresholds, used to classify patients as responders or non-responders, are often chosen arbitrarily or optimized for a specific cohort. This can lead to discrepancies when biomarkers are applied to different populations. This is particularly relevant when considering biomarkers based on continuous variables, such as gene expression levels or mutation burden, where the choice of cutoff can significantly impact the biomarker’s predictive power. In the study by Li et al., the biomarkers were categorized into high and low signature subtypes based on median ssGSEA scores. However, using the median as a cutoff may not be optimal for all datasets, particularly if the distribution of biomarker expression is skewed or if there is substantial heterogeneity in the patient population. Consequently, biomarkers that perform well in one cohort may fail when applied to a different population with a different cutoff threshold, limiting the ability to make reliable cross-study comparisons.
Another major hurdle in comparing predictive biomarkers is the biological context in which they are developed. Biomarkers derived from one cancer type or treatment modality may not be applicable to other cancers or therapies due to differences in tumor biology and the tumor microenvironment. For example, a biomarker predicting response to immunotherapy in lung cancer may not perform well in hepatocellular carcinoma due to differences in immune cell infiltration or tumor mutation burden. In the context of the study on Atezolizumab plus Bevacizumab, the study by Li et al. highlighted that many of the signatures derived from other cancer types, such as melanoma or lung cancer, were also predictive of response in HCC. However, this was not universally true, as certain signatures performed poorly, likely due to differences in the biological context of HCC compared to other cancers. This reinforces the need for caution when generalizing predictive biomarkers across different tumor types.
Lastly, the methodology used for comparing predictive biomarkers is as crucial as the biomarkers themselves. The proposed hazard ratio (HR) score by Li et al. represents an innovative approach to evaluating the robustness of predictive biomarkers. However, it has not yet been widely accepted as a standard criterion for comparing predictive signatures or models in the context of survival outcomes. This HR score approach has two main limitations. Firstly, it provides only a point estimate without accompanying standard error, which makes it difficult to quantify the uncertainty associated with the estimate. Secondly, the HR is derived from the Cox proportional hazards regression model, which relies on the assumption of proportional hazards. This assumption may not always hold true in realworld data scenarios. A more appropriate and widely accepted criterion for such comparisons would be the Cindex (also known as the C-statistic) [
9]. This measure is an adaptation of the area under the receiver operating characteristic (ROC) curve (AUC) specifically designed for timeto-event outcomes that involve censored data. The C-index offers several advantages: it provides a measure of discriminative ability that is not dependent on a specific time point, it accounts for censoring in the data, and it allows for a more comprehensive assessment of model performance across the entire follow-up period.
Despite these challenges, the comparison of predictive biomarkers remains a critical endeavor in precision oncology. As more biomarkers are developed and validated, there is a growing need for standardized methodologies to facilitate direct comparisons across studies. This includes the use of harmonized prediction algorithms, standardized cutoff thresholds, and rigorous validation protocols in diverse patient cohorts. Furthermore, the integration of multi-omics data, such as genomic, transcriptomic, and proteomic profiles, may provide a more comprehensive view of tumor biology and improve the predictive power of biomarkers.
In conclusion, while predictive biomarkers hold great promise for guiding cancer treatment such as immunotherapy for HCC treatment, the challenges associated with direct comparisons between biomarkers need to be acknowledged. Differences in prediction algorithms, cutoff thresholds, and reproducibility across studies complicate the ability to make reliable comparisons. Regardless of inherent challenges in comparing biomarkers from different sources, we want to commend Dr. Yu and colleagues for their comprehensive and thought-provoking analysis. Their work represents a significant contribution to the field, stimulating important discussions about biomarker development and validation in HCC immunotherapy. Such rigorous comparative studies are crucial for advancing our understanding and improving patient care. We believe that their efforts, combined with ongoing research in this area, will pave the way for more standardized and robust approaches to biomarker comparison in the future. We look forward to continuing this important dialogue and collaborating with our colleagues to refine and improve predictive tools for the benefit of HCC patients.
ACKNOWLEDGMENTS
This work is supported by the NIH/NCI under award numbers R01CA237327, P50CA217674, and P30CA016672; The University of Texas MD Anderson Cancer Center Institutional Research Grant (IRG) Program; The University of Texas MD Anderson Cancer Center Institutional Bridge Funds; the Duncan Family Institute for Cancer Prevention and Risk Assessment Seed Funding Research Program at MD Anderson Cancer Center.
Abbreviations
AUC
area under the receiver operating characteristic curve
HCC
hepatocellular carcinoma
ROC
receiver operating characteristic
ssGSEA
singlesample Gene Set Enrichment Analysis
TIDE
Tumor Immune Dysfunction and Exclusion
REFERENCES
2. Yim SY, Lee SH, Baek SW, Sohn B, Jeong YS, Kang SH, et al. Genomic biomarkers to predict response to atezolizumab plus bevacizumab immunotherapy in hepatocellular carcinoma: insights from the IMbrave150 trial. Clin Mol Hepatol 2024;30:807-823.
4. Finn RS, Qin S, Ikeda M, Galle PR, Ducreux M, Kim TY, et al. Atezolizumab plus bevacizumab in unresectable hepatocellular carcinoma. N Engl J Med 2020;382:1894-1905.
7. Dakal TC, Dhakar R, Beura A, Moar K, Maurya PK, Sharma NK, et al. Emerging methods and techniques for cancer biomarker discovery. Pathol Res Pract 2024;262:155567.
9. Harrell FE Jr, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA 1982;247:2543-2546.