Combining n-grams and deep convolutional features for language variety classification

被引：5

作者：

Martinc, Matej ^{[1
]}

Pollak, Senja ^{[1
,2
]}

机构：

[1] Jozef Stefan Inst, Dept Knowledge Technol, Ljubljana, Slovenia

[2] Univ Edinburgh, Usher Inst, Edinburgh Med Sch, Usher Inst Populat Hlth Sci & Informat, Edinburgh, Midlothian, Scotland

来源：

NATURAL LANGUAGE ENGINEERING | 2019年 / 25卷 / 05期

关键词：

language variety; author profiling; text classification; convolutional neural network; bag-of-n-grams;

D O I：

10.1017/S1351324919000299

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of-n-grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easily available also to the non-programming members of the research community.

引用

页码：607 / 632

页数：26

共 50 条

[1] Sentence Classification Using N-Grams in Urdu Language Text
Awan, Malik Daler Ali
Ali, Sikandar
Samad, Ali
Iqbal, Nadeem
Missen, Malik Muhammad Saad
Ullah, Niamat
[J]. SCIENTIFIC PROGRAMMING, 2021, 2021
[2] Using Word N-Grams as Features in Arabic Text Classification
Al-Thubaity, Abdulmohsen
Alhoshan, Muneera
Hazzaa, Itisam
[J]. SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
[3] N-grams Based Features for Indonesian Tweets Classification Problems
Abidin, Taufik Fuadi
Hasanuddin, Mauliana
Mutiawani, Viska
[J]. 2017 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICELTICS), 2017, : 307 - 310
[4] Syntactic N-grams as machine learning features for natural language processing
Sidorov, Grigori
Velasquez, Francisco
Stamatatos, Efstathios
Gelbukh, Alexander
Chanona-Hernandez, Liliana
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2014, 41 (03) : 853 - 860
[5] Classification of Malware Families Based on N-grams Sequential Pattern Features
Liangboonprakong, Chatchai
Sornil, Ohm
[J]. PROCEEDINGS OF THE 2013 IEEE 8TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2013, : 777 - 782
[6] Pixel N-grams for mammographic lesion classification
Kulkarni, Pradnya
Stranieri, Andrew
Ugon, Julien
Mittal, Manish
Kulkarni, Siddhivinayak
[J]. 2017 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS, COMPUTING AND IT APPLICATIONS (CSCITA), 2017, : 107 - 111
[7] Protein classification using modified n-grams and skip-grams
Islam, S. M. Ashiqul
Heil, Benjamin J.
Kearney, Christopher Michel
Baker, Erich J.
[J]. BIOINFORMATICS, 2018, 34 (09) : 1481 - 1487
[8] Comparing Pixel N-grams and Bag of Visual Word Features for the Classification of Diabetic Retinopathy
Kulkarni, Pradnya
Stranieri, Andrew
Jelinek, Herbert
[J]. PROCEEDINGS OF THE AUSTRALASIAN COMPUTER SCIENCE WEEK MULTICONFERENCE (ACSW 2019), 2019,
[9] A Hierarchical n-Grams Extraction Approach for Classification Problem
Mhamdi, Faouzi
Rakotomalala, Ricco
Elloumi, Mourad
[J]. ADVANCED INTERNET BASED SYSTEMS AND APPLICATIONS, 2009, 4879 : 211 - +
[10] Language Distance using Common N-Grams Approach
Kosmajac, Dijana
Keselj, Vlado
[J]. 2020 19TH INTERNATIONAL SYMPOSIUM INFOTEH-JAHORINA (INFOTEH), 2020,

← 1 2 3 4 5 →