Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival

被引:9
|
作者
Vilardell, Mireia [4 ]
Buxo, Maria [5 ,6 ]
Cleries, Ramon [7 ,8 ]
Martinez, Jose Miguel [9 ,10 ,11 ]
Garcia, Gemma [4 ]
Ameijide, Alberto [12 ]
Font, Rebeca [7 ]
Civit, Sergi [4 ]
Marcos-Gragera, Rafael [1 ,2 ,5 ,6 ]
Vilardell, Maria Loreto [6 ]
Carulla, Maria [12 ]
Espinas, Josep Alfons [7 ]
Galceran, Jaume [12 ]
Izquierdo, Angel [3 ,6 ]
Borras, Josep Ma [7 ,8 ]
机构
[1] Univ Girona UdG, Sch Med, Girona, Spain
[2] Ctr Invest Biomed Red Epidemiol & Salud Publ CIBE, Madrid, Spain
[3] Hosp Univ Girona Doctor Josep Trueta, Inst Catala Oncol, Serv Oncol Med, Girona 17005, Spain
[4] Univ Barcelona, Secc Estadist, Dept Genet Microbiol & Estadist, Fac Biol, Barcelona 08028, Spain
[5] IDIBGI, Inst Invest Biomed Girona, C Dr Castany S-N Edifici M2, Salt 17190, Spain
[6] Grup Epidemiol Descript Genet & Prevencio Canc Gi, Inst Catala Oncol, Registre Canc Girona Unitat Epidemiol, Pla Director Oncol, Girona 17005, Spain
[7] IDIBELL, Oncol, Ave Gran Via 199-203, Lhospitalet De Llobregat 08908, Spain
[8] Univ Barcelona, Dept Ciencies Clin, Barcelona 08907, Spain
[9] MC Mutual, Dept Anal & Planificac Recursos Sanitarios, Barcelona 08037, Spain
[10] Tech Univ Catalonia, Dept Stat, Barcelona 08028, Spain
[11] Univ Alicante, Publ Hlth Res Grp, Alicante 03690, Spain
[12] Hosp Univ St Joan Reus, Registre Canc Tarragona, Serv Epidemiol & Prevencio Canc, IISPV, Reus, Spain
关键词
Breast cancer; Survival; Graphical models; Missing data; Oversampling; Simulation; COVARIATE DATA; SPAIN; STAGE; DISCRETE; SMOTE;
D O I
10.1016/j.artmed.2020.101875
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background: Two common issues may arise in certain population-based breast cancer (BC) survival studies: I) missing values in a survivals' predictive variable, such as "Stage" at diagnosis, and II) small sample size due to "imbalance class problem" in certain subsets of patients, demanding data modeling/simulation methods. Methods: We present a procedure, ModGraProDep, based on graphical modeling (GM) of a dataset to overcome these two issues. The performance of the models derived from ModGraProDep is compared with a set of frequently used classification and machine learning algorithms (Missing Data Problem) and with oversampling algorithms (Synthetic Data Simulation). For the Missing Data Problem we assessed two scenarios: missing completely at random (MCAR) and missing not at random (MNAR). Two validated BC datasets provided by the cancer registries of Girona and Tarragona (northeastern Spain) were used. Results: In both MCAR and MNAR scenarios all models showed poorer prediction performance compared to three GM models: the saturated one (GM.SAT) and two with penalty factors on the partial likelihood (GM.K1 and GM.TEST). However, GM.SAT predictions could lead to non-reliable conclusions in BC survival analysis. Simulation of a "synthetic" dataset derived from GM.SAT could be the worst strategy, but the use of the remaining GMs models could be better than oversampling. Conclusion: Our results suggest the use of the GM-procedure presented for one-variable imputation/prediction of missing data and for simulating "synthetic" BC survival datasets. The "synthetic" datasets derived from GMs could be also used in clinical applications of cancer survival data such as predictive risk analysis.
引用
收藏
页数:11
相关论文
共 13 条
  • [1] Impact of Imputation of Missing Data on Estimation of Survival Rates: An Example in Breast Cancer
    Baneshi, M. R.
    Talei, A. R.
    IRANIAN JOURNAL OF CANCER PREVENTION, 2010, 3 (03) : 127 - 131
  • [2] Missing data imputation in longitudinal cohort studies - application of PLANN-ARD in breast cancer survival
    Fernandes, Ana S.
    Jarman, Ian H.
    Etchells, Terence A.
    Fonseca, Jose M.
    Biganzoli, Ella
    Bajdik, Chris
    Lisboa, Paulo J. G.
    SEVENTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2008, : 644 - +
  • [3] Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values
    Garcia-Laencina, Pedro J.
    Abreu, Pedro Henriques
    Abreu, Miguel Henriques
    Afonoso, Noemia
    COMPUTERS IN BIOLOGY AND MEDICINE, 2015, 59 : 125 - 133
  • [4] Determining of complexity parameter for recursive partitioning trees by simulation of survival data and an application on breast cancer data
    Ture, Mevlut
    Omurlu, Imran Kurt
    JOURNAL OF STATISTICS & MANAGEMENT SYSTEMS, 2018, 21 (01): : 125 - 138
  • [5] Jointly modeling longitudinal proportional data and survival times with an application to the quality of life data in a breast cancer trial
    Song, Hui
    Peng, Yingwei
    Tu, Dongsheng
    LIFETIME DATA ANALYSIS, 2017, 23 (02) : 183 - 206
  • [6] Jointly modeling longitudinal proportional data and survival times with an application to the quality of life data in a breast cancer trial
    Hui Song
    Yingwei Peng
    Dongsheng Tu
    Lifetime Data Analysis, 2017, 23 : 183 - 206
  • [7] Using routinely collected health data to investigate the association between ethnicity and breast cancer incidence and survival: what is the impact of missing data and multiple ethnicities?
    Downing, Amy
    West, Robert M.
    Gilthorpe, Mark S.
    Lawrence, Gill
    Forman, David
    ETHNICITY & HEALTH, 2011, 16 (03) : 201 - 212
  • [8] Composite quantile regression analysis of survival data with missing cause-of-failure information and its application to breast cancer clinical trial
    Zou, Yuye
    Wu, Chengxin
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2023, 182
  • [9] Performance of standard imputation methods for missing quality of life data as covariate in survival analysis based on simulations from the International Breast Cancer Study Group Trials VI and VII*
    Procter, Marion
    Robertson, Chris
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2019, 48 (10) : 3063 - 3077
  • [10] Matching methods to create paired survival data based on an exposure occurring over time: a simulation study with application to breast cancer
    Savignoni, Alexia
    Giard, Caroline
    Tubert-Bitter, Pascale
    De Rycke, Yann
    BMC MEDICAL RESEARCH METHODOLOGY, 2014, 14