Cautionary note on using cross-validation for molecular classification Journal Article

Authors: Qin, L. X.; Huang, H. C.; Begg, C. B.
Article Title: Cautionary note on using cross-validation for molecular classification
Abstract: Purpose Reproducibility of scientific experimentation has become a major concern because of the perception that many published biomedical studies cannot be replicated. In this article, we draw attention to the connection between inflated overoptimistic findings and the use of cross-validation for error estimation in molecular classification studies. We show that, in the absence of careful design to prevent artifacts caused by systematic differences in the processing of specimens, established tools such as cross-validation can lead to a spurious estimate of the error rate in the overoptimistic direction, regardless of the use of data normalization as an effort to remove these artifacts. Methods We demonstrated this important yet overlooked complication of cross-validation using a unique pair of data sets on the same set of tumor samples. One data set was collected with uniform handling to prevent handling effects; the other was collected without uniform handling and exhibited handling effects. The paired data sets were used to estimate the biologic effects of the samples and the handling effects of the arrays in the latter data set, which were then used to simulate data using virtual rehybridization following various array-to-sample assignment schemes. Results Our study showed that (1) cross-validation tended to underestimate the error rate when the data possessed confounding handling effects; (2) depending on the relative amount of handling effects, normalization may further worsen the underestimation of the error rate; and (3) balanced assignment of arrays to comparison groups allowed cross-validation to provide an unbiased error estimate. Conclusion Our study demonstrates the benefits of balanced array assignment for reproducible molecular classification and calls for caution on the routine use of data normalization and cross-validation in such analysis. © 2016 by American Society of Clinical Oncology.
Journal Title: Journal of Clinical Oncology
Volume: 34
Issue: 32
ISSN: 0732-183X
Publisher: American Society of Clinical Oncology  
Date Published: 2016-11-10
Start Page: 3931
End Page: 3938
Language: English
DOI: 10.1200/jco.2016.68.1031
PROVIDER: scopus
PUBMED: 27601553
PMCID: PMC5477984
Notes: Article -- Export Date: 6 December 2016 -- Source: Scopus
Altmetric Score
MSK Authors
  1. Colin B Begg
    231 Begg
  2. Li-Xuan Qin
    111 Qin
  3. Huei Chung Huang
    4 Huang