Today's Overview
- Pre-CV Feature Screening Creates Widespread Leakage in Cancer Drug Response Models Pre-CV feature screening inflates accuracy by 16.6% MSE on average across 265 cancer drugs.
Featured
01Pre-CV Feature Screening Creates Widespread Leakage in Cancer Drug Response Models
Accurate prediction of drug response in cancer cell lines is central to identifying genomic biomarkers that can guide patient stratification. Yet most published models rely on supervised feature selection applied to the entire dataset before cross-validation, a protocol that leaks label information and spuriously lowers prediction error.
Re-analyzing 265 drugs across 1,462 cell lines showed that leakage-free cross-validation increases mean squared error by 16.6% on average. The leaked and corrected pipelines share almost no selected features (mean Jaccard = 0.18), and 36% of drugs have zero overlap, even though the leaked version retains fivefold more genes. Both pipelines recover known drug targets at similar rates, indicating the extra features capture noise, not biology. A survey of 32 recent methods found leakage in 72%, collectively cited >3,000 times, with error reductions mirroring those reported over elastic-net baselines.
All experiments remain in silico on public cell-line panels; no prospective wet-lab validation is reported. The audit focused only on one leakage mode—pre-CV feature screening—so other protocol flaws may remain. The authors supply code templates for leakage-free evaluation, but adoption will require re-running benchmarks across the field.
Also Worth Noting
Seven pKa prediction algorithms (three commercial, four open-source ML) were benchmarked on a curated 90,000-entry public data set from 31,000 molecules to quantify accuracy across charge states and polyprotic species. link (Chem)
Today's Observation
Cancer drug response prediction is a canonical benchmark for multi-omics machine learning, yet a widespread data-leakage pitfall undercuts its utility. A survey of 32 recent studies shows that 72 % perform feature selection before cross-validation, inflating mean-squared error by 16.6 % on average across 265 compounds in GDSC and CCLE. Leakage drives models to pick five times more genes than leakage-corrected pipelines, and the two gene sets overlap <20 %, indicating the inflated scores reflect sample-specific noise rather than generalizable signal. The gains reported over plain elastic-net baselines disappear once leakage is removed, implying that many “state-of-the-art” improvements are illusory.
Practically, any project that screens thousands of molecular features must nest selection inside each CV fold or use an external validation cohort. The identical issue applies to other omics-assisted tasks—e.g., predicting CRISPR essentiality or patient outcome—where pre-filtering is tempting. Until journals and competitions enforce stricter code inspection, practitioners should treat published MSE or Pearson r values as upper bounds and retrain models with scrupulous nested CV before deploying biomarkers or moving into expensive in-vitro confirmation.
The above is personal commentary for reference only. Refer to the original papers for authoritative content.