Distributing N-Gram Graphs for Classification

被引:2
|
作者
Kontopoulos, Ioannis [1 ]
Giannakopoulos, George [1 ]
Varlamis, Iraklis [2 ]
机构
[1] NCSR Demokritos, Inst Informat & Telecommun, Aghia Paraskevi, Greece
[2] Harokopio Univ Athens, Dept Informat & Telemat, Kallithea, Greece
关键词
Distributed processing; N-gram graphs; Text classification; MAP REDUCE;
D O I
10.1007/978-3-319-67162-8_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
N-gram models have been an established choice for language modeling in machine translation, summarization and other tasks. Recently n-gram graphs managed to capture significant language characteristics that go beyond mere vocabulary and grammar, for tasks such as text classification. This work proposes an efficient distributed implementation of the n-gram graph framework on Apache Spark, named ARGOT. The implementation performance is evaluated on a demanding text classification task, where the n-gram graphs are used for extracting features for a supervised classifier. A provided experimental study shows the scalability of the proposed implementation to large text corpora and its ability to take advantage of a varying number of processing cores.
引用
收藏
页码:3 / 11
页数:9
相关论文
共 50 条
  • [1] Analysis and Classification of Constrained DNA Elements with N-gram Graphs and Genomic Signatures
    Polychronopoulos, Dimitris
    Krithara, Anastasia
    Nikolaou, Christoforos
    Paliouras, Giorgos
    Almirantis, Yannis
    Giannakopoulos, George
    [J]. ALGORITHMS FOR COMPUTATIONAL BIOLOGY, 2014, 8542 : 220 - 234
  • [2] Classification of facemarks using N-gram
    Yamada, Thichi
    Tsuchiya, Seiji
    Kuroiwa, Shiongo
    Ren, Fuji
    [J]. PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (NLP-KE'07), 2007, : 322 - +
  • [3] Are n-gram Categories Helpful in Text Classification?
    Kruczek, Jakub
    Kruczek, Paulina
    Kuta, Marcin
    [J]. COMPUTATIONAL SCIENCE - ICCS 2020, PT II, 2020, 12138 : 524 - 537
  • [4] A variant of n-gram based language classification
    Tomovic, Andrija
    Janicic, Predrag
    [J]. AI(ASTERISK)IA 2007: ARTIFICIAL INTELLIGENCE AND HUMAN-ORIENTED COMPUTING, 2007, 4733 : 410 - +
  • [5] Summarization system evaluation revisited: N-gram graphs
    Giannakopoulos, George
    Karkaletsis, Vangelis
    Vouros, George
    Stamatopoulos, Panagiotis
    [J]. ACM Transactions on Speech and Language Processing, 2008, 5 (03): : 1 - 39
  • [6] A Neural N-Gram Network for Text Classification
    Yan, Zhenguo
    Wu, Yue
    [J]. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2018, 22 (03) : 380 - 386
  • [7] Analysis of N-gram model on Telugu Document Classification
    Rani, B. Padmaja
    Vardhan, B. Vishnu
    Durga, A. Kanaka
    Reddy, L. Pratap
    Babu, A. Vinaya
    [J]. 2008 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-8, 2008, : 3199 - +
  • [8] An investigation of byte n-gram features for malware classification
    Raff, Edward
    Zak, Richard
    Cox, Russell
    Sylvester, Jared
    Yacci, Paul
    Ward, Rebecca
    Tracy, Anna
    McLean, Mark
    Nicholas, Charles
    [J]. JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2018, 14 (01): : 1 - 20
  • [9] Proposal of n-gram Based Algorithm for Malware Classification
    Pektas, Abdurrahman
    Eris, Mehmet
    Acarman, Tankut
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON EMERGING SECURITY INFORMATION, SYSTEMS AND TECHNOLOGIES (SECURWARE 2011), 2011, : 14 - 18
  • [10] Opcode n-gram based Malware Classification in Android
    Sihag, Vikas
    Mitharwal, Anita
    Vardhan, Manu
    Singh, Pradeep
    [J]. PROCEEDINGS OF THE 2020 FOURTH WORLD CONFERENCE ON SMART TRENDS IN SYSTEMS, SECURITY AND SUSTAINABILITY (WORLDS4 2020), 2020, : 645 - 650