Named Entities as Key Features for Detecting Semantically Similar News Articles

被引:0
|
作者
Novo, Anne Stockem [1 ]
Gedikli, Fatih [1 ]
机构
[1] Ruhr West Univ Appl Sci, Inst Comp Sci, Duisburger Str 100, D-45479 Mulheim, Germany
关键词
Near-duplicate detection; news articles; explainability; BERT; SHAP;
D O I
10.1142/S1793351X23300030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The focus of this work is detecting semantically similar news articles for search engines and recommender systems which is an important step towards processing and understanding natural language. Search engines and recommender systems typically filter out near-duplicate articles which are often just a paraphrasing of a previous article and therefore irrelevant for the users. Articles with a high level of overlapping content are not interesting to the reader and should be avoided. Here, we focus on named entities, such as people, organizations and places, and their role as a key feature for identifying near-duplicate articles. Since our dataset from the energy business contains a significant amount of paraphrased articles, standard techniques, e.g. based on the Jaccard coefficient, already serve quite well. A fine-tuned BERT model evaluated on named entities achieves best model results with more than 97% accuracy and highest True Positive Rates. The importance of individual words for the model decisions is evaluated by computing their Shapley values. It was found that the explanations are in overall good agreement with the human intuitive interpretation.
引用
收藏
页码:633 / 649
页数:17
相关论文
共 50 条
  • [31] Automatic Discovering Success Factor Relationship Entities in Articles using Named Entity Recognition
    Niboonkit, Supattra
    Krathu, Worarat
    Padungweang, Praisan
    2017 9TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2017, : 238 - 241
  • [32] Named-Entity Recognition for Disaster Related Filipino News Articles
    Dela Cruz, Bern Maris
    Montalla, Cyril
    Manansala, Allysa
    Rodriguez, Ramon
    Octaviano, Manolito, Jr.
    Fabito, Bernie S.
    PROCEEDINGS OF TENCON 2018 - 2018 IEEE REGION 10 CONFERENCE, 2018, : 1633 - 1636
  • [33] Named Entity Oriented Difference Analysis of News Articles and Its Application
    Kiritoshi, Keisuke
    Ma, Qiang
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (04): : 906 - 917
  • [34] A knowledge-based approach to named entity disambiguation in news articles
    Nguyen, Hien T.
    Cao, Tru H.
    AI 2007: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2007, 4830 : 619 - +
  • [35] Multilingual news document clustering:: Two algorithms based on cognate named entities
    Montalvo, Soto
    Martinez, Raquel
    Casillas, Arantza
    Fresno, Victor
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2006, 4188 : 165 - 172
  • [36] Analyzing entities and topics in news articles using statistical topic models
    Newman, David
    Chemudugunta, Chaitanya
    Smyth, Padhraic
    Steyvers, Mark
    INTELLIGENCE AND SECURITY INFORMATICS, PROCEEDINGS, 2006, 3975 : 93 - 104
  • [37] Semantically find similar binary codes with mixed key instruction sequence
    Li, Yuancheng
    Wang, Boyan
    Hu, Baiji
    INFORMATION AND SOFTWARE TECHNOLOGY, 2020, 125
  • [38] Towards Detecting Political Bias in Hindi News Articles
    Agrawal, Samyak
    Gupta, Kshitij
    Gautam, Devansh
    Mamidi, Radhika
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): STUDENT RESEARCH WORKSHOP, 2022, : 239 - 244
  • [39] Dbias: detecting biases and ensuring fairness in news articles
    Raza, Shaina
    Reji, Deepak John
    Ding, Chen
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2024, 17 (01) : 39 - 59
  • [40] Dbias: detecting biases and ensuring fairness in news articles
    Shaina Raza
    Deepak John Reji
    Chen Ding
    International Journal of Data Science and Analytics, 2024, 17 : 39 - 59