Statistical Depth for Text Data: An Application to the Classification of Healthcare Data

被引:2
|
作者
Bolivar, Sergio [1 ]
Nieto-Reyes, Alicia [1 ]
Rogers, Heather L. [2 ,3 ]
机构
[1] Univ Cantabria, Dept Math Stat & Comp Sci, Santander 39005, Spain
[2] Biocruces Bizkaia Hlth Res Inst, Baracaldo 48903, Spain
[3] Basque Fdn Sci, IKERBASQUE, Bilbao 48013, Spain
关键词
compositional depth; multivariate data; natural language processing; qualitative data; statistical depth; supervised classification; text mining; DISTANCE;
D O I
10.3390/math11010228
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
This manuscript introduces a new concept of statistical depth function: the compositional D-depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the tf-idf (term frequency-inverse document frequency) statistic, which results in most vector entries taking a value of zero. The proposed data depth consists of considering the inverse discrete Fourier transform of the vectorized text fragments and then applying a statistical depth for functional data, D. This depth is intended to address the problem of sparsity of numerical features resulting from the transformation of qualitative text data into quantitative data, which is a common procedure in most natural language processing frameworks. Indeed, this sparsity hinders the use of traditional statistical depths and machine learning techniques for classification purposes. In order to demonstrate the potential value of this new proposal, it is applied to a real-world case study which involves mapping Consolidated Framework for Implementation and Research (CFIR) constructs to qualitative healthcare data. It is shown that the DDG-classifier yields competitive results and outperforms all studied traditional machine learning techniques (logistic regression with LASSO regularization, artificial neural networks, decision trees, and support vector machines) when used in combination with the newly defined compositional D-depth.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Hierarchical Data Augmentation and the Application in Text Classification
    Yu, Shujuan
    Yang, Jie
    Liu, Danlei
    Li, Runqi
    Zhang, Yun
    Zhao, Shengmei
    [J]. IEEE ACCESS, 2019, 7 : 185476 - 185485
  • [2] Data Mining on the Router Logs for Statistical Application Classification
    Rahmati, M.
    Mirzababaei, S. M.
    [J]. PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 6, 2005, : 324 - 327
  • [3] Statistical interpretation of crosshole data and application to the definition of saturation depth
    Callerio, A.
    Milani, D.
    Priano, S.
    Janicki, K.
    [J]. EARTHQUAKE GEOTECHNICAL ENGINEERING FOR PROTECTION AND DEVELOPMENT OF ENVIRONMENT AND CONSTRUCTIONS, 2019, 4 : 1545 - 1553
  • [4] Application of statistical mining in healthcare data management for allergic diseases
    Wawrzyniak, Zbigniew M.
    Santolaya, Sara Martinez
    [J]. PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2014, 2014, 9290
  • [5] Supervised Classification of Healthcare Text Data Based on Context-Defined Categories
    Bolivar, Sergio
    Nieto-Reyes, Alicia
    Rogers, Heather L.
    [J]. MATHEMATICS, 2022, 10 (12)
  • [6] On data depth and the application of nonparametric multivariate statistical process control charts
    Bae, Suk Joo
    Do, Giang
    Kvam, Paul
    [J]. APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY, 2016, 32 (05) : 660 - 676
  • [7] Statistical data depth and the graphics hardware
    Krishnan, Shankar
    Mustafa, Nabil H.
    Venkatasubramanian, Suresh
    [J]. DATA DEPTH: ROBUST MULTIVARIATE ANALYSIS, COMPUTATIONAL GEOMETRY AND APPLICATIONS, 2006, 72 : 223 - 246
  • [8] ABSTRACTS - APPLICATION OF DATA PROCESSING EQUIPMENT TO CLASSIFICATION INDEXING AND TEXT PROCESSING
    不详
    [J]. AMERICAN DOCUMENTATION, 1965, 16 (01): : 49 - &
  • [9] Application of locally linear embedding algorithm on hotel data text classification
    Huang, Jinming
    [J]. 2020 3RD INTERNATIONAL CONFERENCE ON COMPUTER INFORMATION SCIENCE AND APPLICATION TECHNOLOGY (CISAT) 2020, 2020, 1634
  • [10] Integrated data depth for smooth functions and its application in supervised classification
    Hlubinka, Daniel
    Gijbels, Irene
    Omelka, Marek
    Nagy, Stanislav
    [J]. COMPUTATIONAL STATISTICS, 2015, 30 (04) : 1011 - 1031