Named Entities as Key Features for Detecting Semantically Similar News Articles

被引:0
|
作者
Novo, Anne Stockem [1 ]
Gedikli, Fatih [1 ]
机构
[1] Ruhr West Univ Appl Sci, Inst Comp Sci, Duisburger Str 100, D-45479 Mulheim, Germany
关键词
Near-duplicate detection; news articles; explainability; BERT; SHAP;
D O I
10.1142/S1793351X23300030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The focus of this work is detecting semantically similar news articles for search engines and recommender systems which is an important step towards processing and understanding natural language. Search engines and recommender systems typically filter out near-duplicate articles which are often just a paraphrasing of a previous article and therefore irrelevant for the users. Articles with a high level of overlapping content are not interesting to the reader and should be avoided. Here, we focus on named entities, such as people, organizations and places, and their role as a key feature for identifying near-duplicate articles. Since our dataset from the energy business contains a significant amount of paraphrased articles, standard techniques, e.g. based on the Jaccard coefficient, already serve quite well. A fine-tuned BERT model evaluated on named entities achieves best model results with more than 97% accuracy and highest True Positive Rates. The importance of individual words for the model decisions is evaluated by computing their Shapley values. It was found that the explanations are in overall good agreement with the human intuitive interpretation.
引用
收藏
页码:633 / 649
页数:17
相关论文
共 50 条
  • [1] NAMED ENTITIES DISTRIBUTION IN NEWSPAPER ARTICLES
    Matei, Liviu Sebastian
    Trausan Matu, Stefan
    ELEARNING VISION 2020!, VOL I, 2016, : 231 - 238
  • [2] NELasso: Group-Sparse Modeling for Characterizing Relations Among Named Entities in News Articles
    Tariq, Amara
    Karim, Asim
    Foroosh, Hassan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (10) : 2000 - 2014
  • [3] Exploiting Named Entities for Bilingual News Clustering
    Montalvo, Soto
    Martinez, Raquel
    Fresno, Victor
    Delgado, Agustin
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (02) : 363 - 376
  • [4] Geotagging Named Entities in News and Online Documents
    Yu, Jiangwei
    Rafiei, Davood
    CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, : 1321 - 1330
  • [5] Detecting Candidate Named Entities in Search Queries
    Alasiry, Areej
    Levene, Mark
    Poulovassilis, Alexandra
    SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2012, : 1049 - 1050
  • [6] Real-Time Claim Detection from News Articles and Retrieval of Semantically-Similar Factchecks
    Adler, Ben
    Boscaini-Gilroy, Giacomo
    CEUR Workshop Proceedings, 2019, 2411
  • [7] Detecting OOV Named Entities in Conversational Speech
    Kumar, Rohit
    Prasad, Rohit
    Ananthakrishnan, Sankaranarayanan
    Vembu, Aravind Namandi
    Stallard, Dave
    Tsakalidis, Stavros
    Natarajan, Prem
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2351 - 2354
  • [8] A Systematic Literature Mapping on the Similar Semantically Entities in Measurement Projects
    Sanchez-Reynoso, Maria Laura
    Divan, Mario Jose
    2019 INTERNATIONAL CONFERENCE ON VIRTUAL REALITY AND VISUALIZATION (ICVRV), 2019, : 142 - 145
  • [9] Detecting Fake News Articles
    Lin, Jun
    Tremblay-Taylor, Glenna
    Mou, Guanyi
    You, Di
    Lee, Kyumin
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 3021 - 3025
  • [10] DC Proposal: Model for News Filtering with Named Entities
    Lasek, Ivo
    SEMANTIC WEB - ISWC 2011, PT II, 2011, 7032 : 309 - 316