Evaluation of ensemble data preprocessing strategy on forensic gasoline classification using untargeted GC-MS data and classification and regression tree (CART) algorithm

被引:5
|
作者
Ghazi, Md Gezani Bin Md [1 ,2 ]
Lee, Loong Chuen [1 ,3 ]
Samsudin, Aznor Sheda Binti [4 ]
Sino, Hukil [1 ]
机构
[1] Univ Kebangsaan Malaysia, Fac Hlth Sci, Forens Sci Program, CODTIS, Bangi, Selangor, Malaysia
[2] Fire & Rescue Dept Malaysia, Fire Invest Div, Putrajaya, Malaysia
[3] Univ Kebangsaan Malaysia, Inst IR 4 0, Bangi, Selangor, Malaysia
[4] Fire & Rescue Dept Selangor, Fire Invest Div, Fire Invest Lab, Bangi, Malaysia
关键词
Untargeted GC-MS; Classification and regression tree; Gasoline; Ensemble data preprocessing technique; CHEMOMETRIC ANALYSIS; TOOLS;
D O I
10.1016/j.microc.2022.107911
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
The purpose of this work is to evaluate the ensemble data preprocessing (DP) strategy composing the selected variant of normalization, parametric time warping and baseline correction techniques in varying sequences for modelling a gas chromatography-mass spectrometry (GC-MS) data via classification and regression tree (CART) algorithm. Firstly, the relative merits between single-DP and ensemble-DP strategies were carefully compared using the best-performing sub-retention time (RT) windows reported elsewhere. Then, all the preprocessed subdatasets were assessed based on predictive capability estimated via the CART algorithm. Performances of CART models were estimated from 50 pairs of training and testing samples that were prepared by a stratified random resampling method. Then, the three shortlisted sub-datasets were further evaluated using increased pairs of training and testing samples. Additionally, the most discriminative RT points were also identified using the three sub-datasets. Eventually, the most desired CART model was constructed using the shortlisted RT points after being treated by the most outstanding DP strategy. Results showed that 3-DP strategies tended to outperform the 1-DP and 2-DP strategies. However, the sequence of application must be carefully optimized as not all the 3-DP strategies induced positive impacts. It was found that the data aligned before baseline correction or normalization will likely outperform those being first normalized or baseline corrected. In conclusion, the untargeted GC-MS data of neat gasoline preferably be first aligned, followed by normalization, and ended by baseline correction.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Ensemble imbalance classification: Using data preprocessing, clustering algorithm and genetic algorithm
    Abolkarlou, Niloofar Afshari
    Niknafs, Ali Akbar
    Ebrahimpour, Mohammad Kazem
    [J]. 2014 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE), 2014, : 171 - 176
  • [2] Combining loglinear model with classification and regression tree (CART): an application to birth data
    Fu, CY
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2004, 45 (04) : 865 - 874
  • [3] HC-CART: A Parallel System Implementation of Data Mining Classification and Regression Tree (CART) Algorithm on a Multi-FPGA System
    Chrysos, Grigorios
    Dagritzikos, Panagiotis
    Papaefstathiou, Ioannis
    Dollas, Apostolos
    [J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2013, 9 (04)
  • [4] Automatic detection and classification of ignitable liquids from GC-MS data of casework samples in forensic fire-debris analysis
    Pasternak, Zohar
    Avissar, Yaniv Y.
    Ehila, Fitfit
    Grafit, Arnon
    [J]. FORENSIC CHEMISTRY, 2022, 29
  • [5] Hybrid Ensemble Classification of Tree Genera Using Airborne LiDAR Data
    Ko, Connie
    Sohn, Gunho
    Remmel, Tarmo K.
    Miller, John
    [J]. REMOTE SENSING, 2014, 6 (11) : 11225 - 11243
  • [6] Evaluation of Forensic Data Using Logistic Regression-Based Classification Methods and an R Shiny Implementation
    Biosa, Giulia
    Giurghita, Diana
    Alladio, Eugenio
    Vincenti, Marco
    Neocleous, Tereza
    [J]. FRONTIERS IN CHEMISTRY, 2020, 8
  • [7] Application and evaluation of machine-learning model for fire accelerant classification from GC-MS data of fire residue
    Park, Chihyun
    Park, Wooyong
    Jeon, Sookyung
    Lee, Sumin
    Lee, Joon-Bae
    [J]. ANALYTICAL SCIENCE AND TECHNOLOGY, 2021, 34 (05): : 231 - 239
  • [8] Classification of GC-MS measurements of wines by combining data dimension reduction and variable selection techniques
    Ballabio, Davide
    Skov, Thomas
    Leardi, Riccardo
    Bro, Rasmus
    [J]. JOURNAL OF CHEMOMETRICS, 2008, 22 (7-8) : 457 - 463
  • [9] Classification of repeated measurements data using tree-based ensemble methods
    Adler, Werner
    Potapov, Sergej
    Lausen, Berthold
    [J]. COMPUTATIONAL STATISTICS, 2011, 26 (02) : 355 - 369
  • [10] Classification of repeated measurements data using tree-based ensemble methods
    Werner Adler
    Sergej Potapov
    Berthold Lausen
    [J]. Computational Statistics, 2011, 26