Statistical Depth for Text Data: An Application to the Classification of Healthcare Data

被引:2
|
作者
Bolivar, Sergio [1 ]
Nieto-Reyes, Alicia [1 ]
Rogers, Heather L. [2 ,3 ]
机构
[1] Univ Cantabria, Dept Math Stat & Comp Sci, Santander 39005, Spain
[2] Biocruces Bizkaia Hlth Res Inst, Baracaldo 48903, Spain
[3] Basque Fdn Sci, IKERBASQUE, Bilbao 48013, Spain
关键词
compositional depth; multivariate data; natural language processing; qualitative data; statistical depth; supervised classification; text mining; DISTANCE;
D O I
10.3390/math11010228
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
This manuscript introduces a new concept of statistical depth function: the compositional D-depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the tf-idf (term frequency-inverse document frequency) statistic, which results in most vector entries taking a value of zero. The proposed data depth consists of considering the inverse discrete Fourier transform of the vectorized text fragments and then applying a statistical depth for functional data, D. This depth is intended to address the problem of sparsity of numerical features resulting from the transformation of qualitative text data into quantitative data, which is a common procedure in most natural language processing frameworks. Indeed, this sparsity hinders the use of traditional statistical depths and machine learning techniques for classification purposes. In order to demonstrate the potential value of this new proposal, it is applied to a real-world case study which involves mapping Consolidated Framework for Implementation and Research (CFIR) constructs to qualitative healthcare data. It is shown that the DDG-classifier yields competitive results and outperforms all studied traditional machine learning techniques (logistic regression with LASSO regularization, artificial neural networks, decision trees, and support vector machines) when used in combination with the newly defined compositional D-depth.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] Big Data Classification in IOT Healthcare Application Using Optimal Deep Learning
    Akhtar, Md Mobin
    Ahamad, Danish
    Shatat, Abdallah Saleh Ali
    Shatat, Ahmad Saleh Ali
    [J]. INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2023, 17 (01) : 33 - 58
  • [32] A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification
    Wang, Yanbo J.
    Coenen, Frans
    Sanderson, Robert
    [J]. ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 338 - +
  • [33] Application of Statistical Machine Learning Algorithms for Classification of Bridge Deformation Data Sets
    Avendano, Juan C.
    Otero, Luis Daniel
    Otero, Carlos
    [J]. 2021 15TH ANNUAL IEEE INTERNATIONAL SYSTEMS CONFERENCE (SYSCON 2021), 2021,
  • [34] Cipher-Text Classification with Data Mining
    Khadivi, Pejman
    Momtazpour, Marjan
    [J]. 2010 IEEE 4TH INTERNATIONAL SYMPOSIUM ON ADVANCED NETWORKS AND TELECOMMUNICATION SYSTEMS (ANTS), 2010, : 64 - 66
  • [35] Gender Classification using Twitter Text Data
    Vashisth, Pradeep
    Meehan, Kevin
    [J]. 2020 31ST IRISH SIGNALS AND SYSTEMS CONFERENCE (ISSC), 2020, : 56 - 61
  • [36] Text Summarization Based on Conceptual Data Classification
    AlJa'am, Jihad M.
    Jaoua, Ali M.
    Hasnah, Ahmad M.
    Hassan, F.
    Mohamed, H.
    Mosaid, T.
    Saleh, H.
    Abdullah, F.
    Cherif, H.
    [J]. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND WEB ENGINEERING, 2006, 1 (04) : 22 - 36
  • [37] Combining Embeddings of Input Data for Text Classification
    Zuzanna Parcheta
    Germán Sanchis-Trilles
    Francisco Casacuberta
    Robin Rendahl
    [J]. Neural Processing Letters, 2021, 53 : 3123 - 3151
  • [38] Text Ranking and Classification using Data Compression
    Kasturi, Nitya
    Markov, Igor L.
    [J]. WORKSHOP AT NEURIPS 2021, VOL 163, 2021, 163 : 48 - 53
  • [39] Combining Embeddings of Input Data for Text Classification
    Parcheta, Zuzanna
    Sanchis-Trilles, German
    Casacuberta, Francisco
    Rendahl, Robin
    [J]. NEURAL PROCESSING LETTERS, 2021, 53 (05) : 3123 - 3151
  • [40] Learning to Integrate Unlabeled Data in Text Classification
    Jiang, Eric P.
    [J]. ICCSIT 2010 - 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, VOL 4, 2010, : 82 - 86