The PolitiFact-Oslo Corpus: A New Dataset for Fake News Analysis and Detection

被引:4
|
作者
Poldvere, Nele [1 ]
Uddin, Zia [2 ]
Thomas, Aleena [2 ]
机构
[1] Univ Oslo, Dept Literature Area Studies & European Languages, N-0315 Oslo, Norway
[2] Sintef Digital, N-0373 Oslo, Norway
关键词
corpus development; text type; sentiment; part-of-speech; Bi-LSTM; transformers;
D O I
10.3390/info14120627
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study presents a new dataset for fake news analysis and detection, namely, the PolitiFact-Oslo Corpus. The corpus contains samples of both fake and real news in English, collected from the fact-checking website PolitiFact.com. It grew out of a need for a more controlled and effective dataset for fake news analysis and detection model development based on recent events. Three features make it uniquely placed for this: (i) the texts have been individually labelled for veracity by experts, (ii) they are complete texts that strictly correspond to the claims in question, and (iii) they are accompanied by important metadata such as text type (e.g., social media, news and blog). In relation to this, we present a pipeline for collecting quality data from major fact-checking websites, a procedure which can be replicated in future corpus building efforts. An exploratory analysis based on sentiment and part-of-speech information reveals interesting differences between fake and real news as well as between text types, thus highlighting the importance of adding contextual information to fake news corpora. Since the main application of the PolitiFact-Oslo Corpus is in automatic fake news detection, we critically examine the applicability of the corpus and another PolitiFact dataset built based on less strict criteria for various deep learning-based efficient approaches, such as Bidirectional Long Short-Term Memory (Bi-LSTM), LSTM fine-tuned transformers such as Bidirectional Encoder Representations from Transformers (BERT) and RoBERTa, and XLNet.
引用
收藏
页数:32
相关论文
共 50 条
  • [1] Fake News Detection with the New German Dataset "GermanFakeNC"
    Vogel, Inna
    Jiang, Peter
    DIGITAL LIBRARIES FOR OPEN KNOWLEDGE, TPDL 2019, 2019, 11799 : 288 - 295
  • [2] FakeRecogna: A New Brazilian Corpus for Fake News Detection
    Garcia, Gabriel L.
    Afonso, Luis C. S.
    Papa, Joao P.
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 57 - 67
  • [3] Detection of fake news in a new corpus for the Spanish language
    Posadas-Duran, Juan-Pablo
    Gomez-Adorno, Helena
    Sidorov, Grigori
    Moreno Escobar, Jesus Jaime
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (05) : 4869 - 4876
  • [4] IFND: a benchmark dataset for fake news detection
    Dilip Kumar Sharma
    Sonal Garg
    Complex & Intelligent Systems, 2023, 9 : 2843 - 2863
  • [5] IFND: a benchmark dataset for fake news detection
    Sharma, Dilip Kumar
    Garg, Sonal
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (03) : 2843 - 2863
  • [6] Fake News vs Satire: A Dataset and Analysis
    Golbeck, Jennifer
    Mauriello, Matthew
    Auxier, Brooke
    Bhanushali, Keval H.
    Bonk, Christopher
    Bouzaghrane, Mohamed Amine
    Buntain, Cody
    Chanduka, Riya
    Cheakalos, Paul
    Everett, Jeannine B.
    Falak, Waleed
    Gieringer, Carl
    Graney, Jack
    Hoffman, Kelly M.
    Huth, Lindsay
    Ma, Zhenye
    Jha, Mayanka
    Khan, Misbah
    Kori, Varsha
    Lewis, Elo
    Mirano, George
    Mohn, William T.
    Mussenden, Sean
    Nelson, Tammie M.
    Mcwillie, Sean
    Pant, Akshat
    Shetye, Priya
    Shrestha, Rusha
    Steinheimer, Alexandra
    Subramanian, Aditya
    Visnansky, Gina
    WEBSCI'18: PROCEEDINGS OF THE 10TH ACM CONFERENCE ON WEB SCIENCE, 2018, : 17 - 21
  • [7] "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection
    Wang, William Yang
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 422 - 426
  • [8] LIMESODA: Dataset for Fake News Detection in Healthcare Domain
    Payoungkhamdee, Patomporn
    Porkaew, Peerachet
    Sinthunyathum, Atthasith
    Songphum, Phattharaphon
    Kawidam, Witsarut
    Loha-Udom, Wichayut
    Boonkwan, Prachya
    Sutantayawalee, Vipas
    16TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2021), 2021,
  • [9] Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results
    Monteiro, Rafael A.
    Santos, Roney L. S.
    Pardo, Thiago A. S.
    de Almeida, Tiago A.
    Ruiz, Evandro E. S.
    Vale, Oto A.
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2018, 2018, 11122 : 324 - 334
  • [10] Dataset for multimodal fake news detection and verification tasks
    Bondielli, Alessandro
    Dell'Oglio, Pietro
    Lenci, Alessandro
    Marcelloni, Francesco
    Passaro, Lucia
    DATA IN BRIEF, 2024, 54