Optimized variable selection via repeated data splitting Journal Article


Authors: Capanu, M.; Giurcanu, M.; Begg, C. B.; Gönen, M.
Article Title: Optimized variable selection via repeated data splitting
Abstract: Model selection in high-dimensional settings has received substantial attention in recent years, however, similar advancements in the low-dimensional setting have been lacking. In this article, we introduce a new variable selection procedure for low to moderate scale regressions (n>p). This method repeatedly splits the data into two sets, one for estimation and one for validation, to obtain an empirically optimized threshold which is then used to screen for variables to include in the final model. In an extensive simulation study, we show that the proposed variable selection technique enjoys superior performance compared with candidate methods (backward elimination via repeated data splitting, univariate screening at 0.05 level, adaptive LASSO, SCAD), being amongst those with the lowest inclusion of noisy predictors while having the highest power to detect the correct model and being unaffected by correlations among the predictors. We illustrate the methods by applying them to a cohort of patients undergoing hepatectomy at our institution. © 2020 John Wiley & Sons, Ltd.
Keywords: adult; controlled study; cohort analysis; simulation; liver resection; linear regression analysis; linear regression; variable selection; human; male; female; article; data splitting; empirical threshold; variable screening
Journal Title: Statistics in Medicine
Volume: 39
Issue: 16
ISSN: 0277-6715
Publisher: John Wiley & Sons  
Date Published: 2020-07-20
Start Page: 2167
End Page: 2184
Language: English
DOI: 10.1002/sim.8538
PUBMED: 32282097
PROVIDER: scopus
PMCID: PMC8547352
DOI/URL:
Notes: Article -- Export Date: 1 July 2020 -- Source: Scopus
Altmetric
Citation Impact
BMJ Impact Analytics
MSK Authors
  1. Colin B Begg
    306 Begg
  2. Mithat Gonen
    1028 Gonen
  3. Marinela Capanu
    385 Capanu