Arabic Text Classification Based on Word and Document Embeddings

被引:10
|
作者
El Mahdaouy, Abdelkader [1 ,2 ]
Gaussier, Eric [1 ]
El Alaoui, Said Ouatik [2 ]
机构
[1] Grenoble Alpes Univ, CNRS, LIG, AMA, Grenoble, France
[2] USMBA, FSDM, LIM, Dept Comp Sci, Fes, Morocco
来源
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016 | 2017年 / 533卷
关键词
Arabic text classification; Arabic natural language processing; Document embeddings; Word embeddings; SKIP-Gram; Continuous Bag-of-Word; Glove; Doc2vec;
D O I
10.1007/978-3-319-48308-5_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Word Embeddings have been introduced as a major breakthrough in Natural Language Processing (NLP) to learn viable representation of linguistic items based on contextual information or/and word co-occurrence. In this paper, we investigate Arabic document classification using Word and document Embeddings as representational basis rather than relying on text preprocessing and bag-of-words representation. We demonstrate that document Embeddings outperform text preprocessing techniques either by learning them using Doc2Vec or averaging word vectors using a simple method for document Embedding construction. Moreover, the results show that the classification accuracy is less sensitive to word and document vectors learning parameters.
引用
收藏
页码:32 / 41
页数:10
相关论文
共 50 条
  • [31] Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
    Kapusta, Jozef
    Drzik, David
    Steflovic, Kirsten
    Nagy, Kitti Szabo
    IEEE ACCESS, 2024, 12 : 31538 - 31550
  • [32] Text sentiment classification based on a genetic algorithm and word and document co-clustering
    Kotelnikov, E. V.
    Pletneva, M. V.
    JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL, 2016, 55 (01) : 106 - 114
  • [33] Text sentiment classification based on a genetic algorithm and word and document co-clustering
    E. V. Kotelnikov
    M. V. Pletneva
    Journal of Computer and Systems Sciences International, 2016, 55 : 106 - 114
  • [34] Document Classification Based on Word Vectors
    Liu, Rong
    Wang, Dong
    Xing, Chao
    2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 413 - 413
  • [35] AraDIC: Arabic Document Classification using Image-Based Character Embeddings and Class-Balanced Loss
    Daif, Mahmoud
    Kitada, Shunsuke
    Iyatomi, Hitoshi
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 214 - 221
  • [36] Using Word N-Grams as Features in Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Alhoshan, Muneera
    Hazzaa, Itisam
    SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
  • [37] Combining Word Embeddings with Taxonomy Information for Multi-Label Document Classification
    Hirschmeier, Stefan
    Schoder, Detlef
    DOCENG'19: PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING 2019, 2019,
  • [38] Integrating word embeddings and document topics with deep learning in a video classification framework
    Kastrati, Zenun
    Imran, Ali Shariq
    Kurti, Arianit
    PATTERN RECOGNITION LETTERS, 2019, 128 : 85 - 92
  • [39] From Word Embeddings To Document Distances
    Kusner, Matt J.
    Sun, Yu
    Kolkin, Nicholas I.
    Weinberger, Kilian Q.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 957 - 966
  • [40] Classification of Arabic healthcare questions based on word embeddings learned from massive consultations: a deep learning approach
    Faris, Hossam
    Habib, Maria
    Faris, Mohammad
    Alomari, Alaa
    Castillo, Pedro A.
    Alomari, Manal
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2022, 13 (04) : 1811 - 1827