An Ensemble Keyword Extraction Model for News Texts with Statistical and Graphical Features

被引:0
|
作者
Abibullayeva, Aiman [1 ]
Kilic, Huma [2 ]
Cetin, Aydin [2 ]
机构
[1] Akhmet Yassawi Univ, Fac Engn, Comp Engn Dept, Turkistan, Kazakhstan
[2] Gazi Univ, Fac Technol, Comp Engn Dept, TR-06500 Ankara, Turkiye
关键词
Keyword extraction; ensemble classification; statistical; graph-based;
D O I
10.1142/S0218194024500128
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keyword extraction is an essential tool for many text mining applications such as automatic indexing, summarizing, classification, clustering and automatic filtering. Automated keyword extraction is essential as the daily text data to be reached and processed have increased tremendously over the Internet, e.g. millions of news articles are published daily online. In this paper, a novel ensemble model for automatic extraction of keywords from news articles is proposed. The proposed model handles keyword extraction as a sequence labeling task. Two sub-modules representing the statistical and graphical features by their calculated scores for each input token were combined in the token classification module. The Ensemble Token Classification module was trained and tested separately with the ensemble algorithms Random Forest, XgBoost, Decision Tree and Voting Classification. For training, we collected two news datasets from Kazakh and Russian news sites published in Cyrillic alphabet. We also collected an Arabic news dataset, ArabianNews. The performance of the model was also compared with the widely used 500N-KPCrowd dataset in the literature, which consists of English news content in Latin alphabet. The proposed model achieved the best performance with an F1-score of 0.71 and 0.86 on the 500N-KPCrowd and Russian datasets, respectively. We attained the best F1-score (0.97) with the KazakhNews and ArabianNews datasets.
引用
收藏
页码:1047 / 1061
页数:15
相关论文
共 50 条
  • [1] A Novel Statistical and Linguistic Features Based Technique for Keyword Extraction
    Gupta, Ashlesha
    Dixit, Ashutosh
    Sharma, A. K.
    [J]. PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS AND COMPUTER NETWORKS (ISCON), 2014, : 55 - 59
  • [2] Automatic Keyword Extraction: An Ensemble Method
    Pay, Tayfun
    Lucci, Stephen
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4816 - 4818
  • [3] Automatic keyword extraction for news finder
    Martínez-Fernández, JL
    García-Serrano, A
    Martínez, P
    Villena, J
    [J]. ADAPTIVE MULTIMEDIA RETRIEVAL, 2004, 3094 : 99 - 119
  • [4] Keyword Extraction from Bengali News
    Showrov, Md Imran Hossain
    Sobhan, Masrur
    [J]. 2019 5TH INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL ENGINEERING (ICAEE), 2019, : 658 - 662
  • [5] News Keyword Extraction for Topic Tracking
    Lee, Sungjick
    Kim, Han-joon
    [J]. NCM 2008: 4TH INTERNATIONAL CONFERENCE ON NETWORKED COMPUTING AND ADVANCED INFORMATION MANAGEMENT, VOL 2, PROCEEDINGS, 2008, : 554 - 559
  • [6] PAKE: a supervised approach for Persian automatic keyword extraction using statistical features
    Lazemi, Soghra
    Ebrahimpour-Komleh, Hossein
    Noroozi, Nasser
    [J]. SN APPLIED SCIENCES, 2019, 1 (12)
  • [7] PAKE: a supervised approach for Persian automatic keyword extraction using statistical features
    Soghra Lazemi
    Hossein Ebrahimpour-Komleh
    Nasser Noroozi
    [J]. SN Applied Sciences, 2019, 1
  • [8] Keyword extraction from Arabic legal texts
    Rammal, Mahmoud
    Bahsoun, Zeinab
    Jabbour, Mona Al Achkar
    [J]. INTERACTIVE TECHNOLOGY AND SMART EDUCATION, 2015, 12 (01) : 62 - 71
  • [9] Automated Keyword Extraction and Summarization for Romanian Texts
    Lupea, M. I.
    Mocan, C. M.
    Nandra, C. I.
    Chifu, E. S.
    [J]. 2024 IEEE INTERNATIONAL CONFERENCE ON AUTOMATION, QUALITY AND TESTING, ROBOTICS, AQTR, 2024, : 329 - 334
  • [10] A Novel Graph-Based Ensemble Token Classification Model for Keyword Extraction
    Hüma Kılıç
    Aydın Çetin
    [J]. Arabian Journal for Science and Engineering, 2023, 48 : 10673 - 10680