Arabic Text Classification Based on Word and Document Embeddings

被引:9
|
作者
El Mahdaouy, Abdelkader [1 ,2 ]
Gaussier, Eric [1 ]
El Alaoui, Said Ouatik [2 ]
机构
[1] Grenoble Alpes Univ, CNRS, LIG, AMA, Grenoble, France
[2] USMBA, FSDM, LIM, Dept Comp Sci, Fes, Morocco
关键词
Arabic text classification; Arabic natural language processing; Document embeddings; Word embeddings; SKIP-Gram; Continuous Bag-of-Word; Glove; Doc2vec;
D O I
10.1007/978-3-319-48308-5_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Word Embeddings have been introduced as a major breakthrough in Natural Language Processing (NLP) to learn viable representation of linguistic items based on contextual information or/and word co-occurrence. In this paper, we investigate Arabic document classification using Word and document Embeddings as representational basis rather than relying on text preprocessing and bag-of-words representation. We demonstrate that document Embeddings outperform text preprocessing techniques either by learning them using Doc2Vec or averaging word vectors using a simple method for document Embedding construction. Moreover, the results show that the classification accuracy is less sensitive to word and document vectors learning parameters.
引用
收藏
页码:32 / 41
页数:10
相关论文
共 50 条
  • [1] Text Classification with Document Embeddings
    Huang, Chaochao
    Qiu, Xipeng
    Huang, Xuanjing
    [J]. CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2014, 2014, 8801 : 131 - 140
  • [2] Text classification with document embeddings
    [J]. Huang, Chaochao (chaochaohuang12@fudan.edu.cn), 1600, Springer Verlag (8801):
  • [3] Text Classification Using Word Embeddings
    Helaskar, Mukund N.
    Sonawane, Sheetal S.
    [J]. 2019 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2019,
  • [4] Sequence-Based Word Embeddings for Effective Text Classification
    Gomes, Bruno Guilherme
    Murai, Fabricio
    Goussevskaia, Olga
    Couto da Silva, Ana Paula
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2021), 2021, 12801 : 135 - 146
  • [5] Text classification with semantically enriched word embeddings
    Pittaras, N.
    Giannakopoulos, G.
    Papadakis, G.
    Karkaletsis, V
    [J]. NATURAL LANGUAGE ENGINEERING, 2021, 27 (04) : 391 - 425
  • [6] Effect of Word Segmentation on Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Al-Subaie, Abdullah
    [J]. PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 127 - 131
  • [7] Hybrid Word/Part-of-Arabic-Word Language Models For Arabic Text Document Recognition
    BenZeghiba, Mohamed Faouzi
    Louradour, Jerome
    Kermorvant, Christopher
    [J]. 2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 671 - 675
  • [8] Knowledge-enhanced document embeddings for text classification
    Sinoara, Roberta A.
    Camacho-Collados, Jose
    Rossi, Rafael G.
    Navigli, Roberto
    Rezende, Solange O.
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 955 - 971
  • [9] Word-class embeddings for multiclass text classification
    Alejandro Moreo
    Andrea Esuli
    Fabrizio Sebastiani
    [J]. Data Mining and Knowledge Discovery, 2021, 35 : 911 - 963
  • [10] Word-class embeddings for multiclass text classification
    Moreo, Alejandro
    Esuli, Andrea
    Sebastiani, Fabrizio
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2021, 35 (03) : 911 - 963