An analysis of hierarchical text classification using word embeddings

被引:130
|
作者
Stein, Roger Alan [1 ]
Jaques, Patricia A. [1 ]
Valiati, Joao Francisco [2 ]
机构
[1] Univ Vale Rio Sinos UNISINOS, Programa Posgrad Comp Aplicada PPGCA, Av Unisinos 950, Sao Leopoldo, RS, Brazil
[2] AIE, Rua Vieira Castro 262, Porto Alegre, RS, Brazil
关键词
Hierarchical text classification; Word embeddings; Gradient tree boosting; fastText; Support vector machines;
D O I
10.1016/j.ins.2018.09.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Efficient distributed numerical word representation models (word embeddings) combinec with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This stud investigates the application of those models and algorithms on this specific problem b3 means of experimentation and analysis. We trained classification models with prominent machine learning algorithm implementations-fastText, XGBoost, SVM, and Keras' CNN-and noticeable word embeddings generation methods-GloVe, word2vec, and fastTextwith publicly available data and evaluated them with measures specifically appropriate fot the hierarchical context. FastText achieved an LcAF(1) of 0.893 on a single-labeled version o the RCV1 dataset. An analysis indicates that using word embeddings and its flavors is very promising approach for HTC. (C) 2018 Elsevier Inc. All rights reserved
引用
收藏
页码:216 / 232
页数:17
相关论文
共 50 条
  • [31] Genre Classification using Word Embeddings and Deep Learning
    Kumar, Akshi
    Rajpal, Arjun
    Rathore, Dushyant
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 2142 - 2146
  • [32] Combining Dual Word Embeddings with Open Directory Project based Text Classification
    Aliyeva, Dinara
    Kim, Kang-Min
    Choi, Byung-Ju
    Lee, SangKeun
    PROCEEDINGS OF 2018 IEEE 17TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC 2018), 2018, : 179 - 186
  • [33] Incorporating Word Embeddings in the Hierarchical Dirichlet Process for Query-Oriented Text Summarization
    Van Lierde, Hadrien
    Chow, Tommy W. S.
    2017 IEEE 15TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2017, : 1037 - 1042
  • [34] A survey of word embeddings for clinical text
    Khattak F.K.
    Jeblee S.
    Pou-Prom C.
    Abdalla M.
    Meaney C.
    Rudzicz F.
    Journal of Biomedical Informatics: X, 2019, 4
  • [35] Effect of Text Color on Word Embeddings
    Ikoma, Masaya
    Iwana, Brian Kenji
    Uchida, Seiichi
    DOCUMENT ANALYSIS SYSTEMS, 2020, 12116 : 341 - 355
  • [36] Hierarchical Image Classification using Entailment Cone Embeddings
    Dhall, Ankit
    Makarova, Anastasia
    Ganea, Octavian
    Pavllo, Dario
    Greeff, Michael
    Krause, Andreas
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 3649 - 3658
  • [37] Text classification with document embeddings
    Huang, Chaochao (chaochaohuang12@fudan.edu.cn), 1600, Springer Verlag (8801):
  • [38] Text Classification with Document Embeddings
    Huang, Chaochao
    Qiu, Xipeng
    Huang, Xuanjing
    CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2014, 2014, 8801 : 131 - 140
  • [39] Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification
    Aydogan, Murat
    Karci, Ali
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2020, 541
  • [40] HistorEx: Exploring Historical Text Corpora Using Word and Document Embeddings
    Mueller, Sven
    Brunzel, Michael
    Kaun, Daniela
    Biswas, Russa
    Koutraki, Maria
    Tietz, Tabea
    Sack, Harald
    SEMANTIC WEB: ESWC 2019 SATELLITE EVENTS, 2019, 11762 : 136 - 140