Named Entities as Key Features for Detecting Semantically Similar News Articles

被引:0
|
作者
Novo, Anne Stockem [1 ]
Gedikli, Fatih [1 ]
机构
[1] Ruhr West Univ Appl Sci, Inst Comp Sci, Duisburger Str 100, D-45479 Mulheim, Germany
关键词
Near-duplicate detection; news articles; explainability; BERT; SHAP;
D O I
10.1142/S1793351X23300030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The focus of this work is detecting semantically similar news articles for search engines and recommender systems which is an important step towards processing and understanding natural language. Search engines and recommender systems typically filter out near-duplicate articles which are often just a paraphrasing of a previous article and therefore irrelevant for the users. Articles with a high level of overlapping content are not interesting to the reader and should be avoided. Here, we focus on named entities, such as people, organizations and places, and their role as a key feature for identifying near-duplicate articles. Since our dataset from the energy business contains a significant amount of paraphrased articles, standard techniques, e.g. based on the Jaccard coefficient, already serve quite well. A fine-tuned BERT model evaluated on named entities achieves best model results with more than 97% accuracy and highest True Positive Rates. The importance of individual words for the model decisions is evaluated by computing their Shapley values. It was found that the explanations are in overall good agreement with the human intuitive interpretation.
引用
收藏
页码:633 / 649
页数:17
相关论文
共 50 条
  • [21] FADE: Detecting Fake News Articles on the Web
    Jabiyev, Bahruz
    Pehlivanoglu, Sinan
    Onarlioglu, Kaan
    Kirda, Engin
    ARES 2021: 16TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY, 2021,
  • [22] Identification, Extraction and Population of Collective Named Entities From Business News
    Drury, Brett
    Almeida, J. J.
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : C19 - C22
  • [23] Detecting the Magnitude of Events from News Articles
    Agrawal, Ameeta
    Sahdev, Raghavender
    Davoudi, Heidar
    Khonsari, Forouq
    An, Aijun
    McGrath, Susan
    2016 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2016), 2016, : 177 - 184
  • [24] Detecting Fake News: Exploring Key Features in Multilingual Arabic Dialect Corpus
    Hocini, Abdelouahab
    Smaili, Kamel
    ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2023, PT II, 2025, 2340 : 236 - 248
  • [25] Investigation of features for extraction of named entities from texts in Russian
    V. A. Mozharova
    N. V. Lukashevich
    Automatic Documentation and Mathematical Linguistics, 2017, 51 (3) : 127 - 134
  • [26] The Past is Not a Foreign Country: Detecting Semantically Similar Terms across Time
    Zhang, Yating
    Jatowt, Adam
    Bhowmick, Sourav S.
    Tanaka, Katsumi
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (10) : 2793 - 2807
  • [27] Building semantically annotated corpus for text classification of Indian defence news articles
    Kanekar S.A.
    Sharma A.
    Patkar G.S.
    Tilve A.K.S.
    International Journal of Information Technology, 2021, 13 (4) : 1539 - 1544
  • [28] A Proposal to Find Fake News and Detecting Political Bias of News Articles
    Pandya, Kush Jayank
    Jaiswal, Ashi
    Rautaray, Siddharth Swarup
    Pandey, Manjusha
    ADVANCES IN DATA AND INFORMATION SCIENCES, 2022, 318 : 515 - 526
  • [29] Visualizing Trends of Key Roles in News Articles
    Xia, Chen
    Zhang, Haoxiang
    Moghtader, Jacob
    Wu, Allen
    Chang, Kai-Wei
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, 2019, : 247 - 252
  • [30] Identifying Salient Entities of News Articles Using Binary Salient Classifier
    Appiktala, Nirupama
    Huang, SansWord
    Sankar, Balachandar
    Tripathi, Shweta
    Goldman, Eyan
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 1541 - 1549