Combining n-grams and deep convolutional features for language variety classification

被引:5
|
作者
Martinc, Matej [1 ]
Pollak, Senja [1 ,2 ]
机构
[1] Jozef Stefan Inst, Dept Knowledge Technol, Ljubljana, Slovenia
[2] Univ Edinburgh, Usher Inst, Edinburgh Med Sch, Usher Inst Populat Hlth Sci & Informat, Edinburgh, Midlothian, Scotland
关键词
language variety; author profiling; text classification; convolutional neural network; bag-of-n-grams;
D O I
10.1017/S1351324919000299
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of-n-grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easily available also to the non-programming members of the research community.
引用
收藏
页码:607 / 632
页数:26
相关论文
共 50 条
  • [1] Sentence Classification Using N-Grams in Urdu Language Text
    Awan, Malik Daler Ali
    Ali, Sikandar
    Samad, Ali
    Iqbal, Nadeem
    Missen, Malik Muhammad Saad
    Ullah, Niamat
    [J]. SCIENTIFIC PROGRAMMING, 2021, 2021
  • [2] Using Word N-Grams as Features in Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Alhoshan, Muneera
    Hazzaa, Itisam
    [J]. SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
  • [3] N-grams Based Features for Indonesian Tweets Classification Problems
    Abidin, Taufik Fuadi
    Hasanuddin, Mauliana
    Mutiawani, Viska
    [J]. 2017 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICELTICS), 2017, : 307 - 310
  • [4] Syntactic N-grams as machine learning features for natural language processing
    Sidorov, Grigori
    Velasquez, Francisco
    Stamatatos, Efstathios
    Gelbukh, Alexander
    Chanona-Hernandez, Liliana
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2014, 41 (03) : 853 - 860
  • [5] Classification of Malware Families Based on N-grams Sequential Pattern Features
    Liangboonprakong, Chatchai
    Sornil, Ohm
    [J]. PROCEEDINGS OF THE 2013 IEEE 8TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2013, : 777 - 782
  • [6] Pixel N-grams for mammographic lesion classification
    Kulkarni, Pradnya
    Stranieri, Andrew
    Ugon, Julien
    Mittal, Manish
    Kulkarni, Siddhivinayak
    [J]. 2017 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS, COMPUTING AND IT APPLICATIONS (CSCITA), 2017, : 107 - 111
  • [7] Protein classification using modified n-grams and skip-grams
    Islam, S. M. Ashiqul
    Heil, Benjamin J.
    Kearney, Christopher Michel
    Baker, Erich J.
    [J]. BIOINFORMATICS, 2018, 34 (09) : 1481 - 1487
  • [8] Comparing Pixel N-grams and Bag of Visual Word Features for the Classification of Diabetic Retinopathy
    Kulkarni, Pradnya
    Stranieri, Andrew
    Jelinek, Herbert
    [J]. PROCEEDINGS OF THE AUSTRALASIAN COMPUTER SCIENCE WEEK MULTICONFERENCE (ACSW 2019), 2019,
  • [9] A Hierarchical n-Grams Extraction Approach for Classification Problem
    Mhamdi, Faouzi
    Rakotomalala, Ricco
    Elloumi, Mourad
    [J]. ADVANCED INTERNET BASED SYSTEMS AND APPLICATIONS, 2009, 4879 : 211 - +
  • [10] Language Distance using Common N-Grams Approach
    Kosmajac, Dijana
    Keselj, Vlado
    [J]. 2020 19TH INTERNATIONAL SYMPOSIUM INFOTEH-JAHORINA (INFOTEH), 2020,