Algorithm for Updating n-Grams Word Dictionary for Web Classification

被引:0
|
作者
Abidin, Taufik Fuadi [1 ]
Ferdhiana, Ridha [2 ]
机构
[1] Syiah Kuala Univ, Fac Math & Nat Sci, Dept Informat, Darussalam, Banda Aceh, Indonesia
[2] Syiah Kuala Univ, Fac Math & Nat Sci, Dept Stat, Darussalam, Banda Aceh, Indonesia
关键词
algorithm; n-grams dictionary; web classification; classification accuracy; f-measure;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we examine an algorithm to update n-grams word dictionary (thesaurus) and evaluate its effectiveness in binary classification problem. The thesaurus is used as a reference to generate the numerical feature attributes of web pages. Generally, the n-grams word dictionary is built once using a set of training data and its content is never updated. Hence, the content is static and its coverage is limited to the n-grams word found in the initial training set. Actually, the content of a thesaurus must be dynamic, especially because the n-grams word dictionary is used repeatedly as a reference in generating the numerical feature attributes of web pages. We argue that a dynamic thesaurus is better than a static one in a long-term. Thus, n-grams word dictionary should be updated frequently using new data without degrading the classification accuracy. We validate our proposed algorithm using several test sets, each of which contains one hundred web pages, except for the last one. The experimental results show that our proposed algorithm works well. On average, the accuracy of feature dataset generated using the existing (old) dictionary is 57.75%, while the accuracy of feature dataset generated using updated (new) dictionary is 76.75%. The proposed algorithm increases classification accuracy about 32.90%.
引用
收藏
页码:432 / 436
页数:5
相关论文
共 50 条
  • [1] IDF for Word N-grams
    Shirakawa, Masumi
    Hara, Takahiro
    Nishio, Shojiro
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (01)
  • [2] Using Word N-Grams as Features in Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Alhoshan, Muneera
    Hazzaa, Itisam
    [J]. SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
  • [3] The subjective frequency of word n-grams
    Shaoul, Cyrus
    Westbury, Chris F.
    Baayen, R. Harald
    [J]. PSIHOLOGIJA, 2013, 46 (04) : 497 - 537
  • [4] SPEECH RECOGNITION USING FUNCTION-WORD N-GRAMS AND CONTENT-WORD N-GRAMS
    ISOTANI, R
    MATSUNAGA, S
    SAGAYAMA, S
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 692 - 697
  • [5] Variable word rate n-grams
    Gotoh, Y
    Renals, S
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1591 - 1594
  • [6] Automated labeling of PDF mathematical exercises with word N-grams VSM classification
    Yamauchi, Taisei
    Flanagan, Brendan
    Nakamoto, Ryosuke
    Dai, Yiling
    Takami, Kyosuke
    Ogata, Hiroaki
    [J]. SMART LEARNING ENVIRONMENTS, 2023, 10 (01)
  • [7] Automated labeling of PDF mathematical exercises with word N-grams VSM classification
    Taisei Yamauchi
    Brendan Flanagan
    Ryosuke Nakamoto
    Yiling Dai
    Kyosuke Takami
    Hiroaki Ogata
    [J]. Smart Learning Environments, 10
  • [8] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
    Lecluze, Charlotte
    Rigouste, Lois
    Giguet, Emmanuel
    Lucas, Nadine
    [J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
  • [9] Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm
    Saloun, Petr
    Andrsic, David
    Cigankova, Barbora
    Anagnostopoulos, Ioannis
    [J]. 2020 15TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION AND PERSONALIZATION (SMAP 2020), 2020, : 162 - 167
  • [10] Pixel N-grams for mammographic lesion classification
    Kulkarni, Pradnya
    Stranieri, Andrew
    Ugon, Julien
    Mittal, Manish
    Kulkarni, Siddhivinayak
    [J]. 2017 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS, COMPUTING AND IT APPLICATIONS (CSCITA), 2017, : 107 - 111