MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset

被引:29
|
作者
Nielsen, Dan S. [1 ]
McConville, Ryan [1 ]
机构
[1] Univ Bristol, Dept Engn Math, Bristol, Avon, England
关键词
dataset; misinformation; graph; twitter; social network; fake news; NEWS;
D O I
10.1145/3477495.3531744
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Misinformation is becoming increasingly prevalent on social media and in news articles. It has become so widespread that we require algorithmic assistance utilising machine learning to detect such content. Training these machine learning models require datasets of sufficient scale, diversity and quality. However, datasets in the field of automatic misinformation detection are predominantly monolingual, include a limited amount of modalities and are not of sufficient scale and quality. Addressing this, we develop a data collection and linking system ( MuMiN-trawl), to build a public misinformation graph dataset (MuMiN), containing rich social media data (tweets, replies, users, images, articles, hashtags) spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade. The dataset is made available as a heterogeneous graph via a Python package (mumin). We provide baseline results for two node classification tasks related to the veracity of a claim involving social media, and demonstrate that these are challenging tasks, with the highest macro-average F1score being 62.55% and 61.45% for the two tasks, respectively. The MuMiN ecosystem is available at https://mumin- dataset.github.io/, including the data, documentation, tutorials and leaderboards.
引用
收藏
页码:3141 / 3153
页数:13
相关论文
共 50 条
  • [1] Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims
    Srba, Ivan
    Pecher, Branislav
    Tomlein, Matus
    Moro, Robert
    Stefancova, Elena
    Simko, Jakub
    Bielikova, Maria
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 2949 - 2959
  • [2] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [3] A Multimodal Dataset of Fact-Checked News from Chile's Constitutional Processes: Collection, Processing, and Analysis
    Molina, Ignacio
    Keith, Brian
    Matus, Mauricio
    DATA, 2025, 10 (02)
  • [4] Large-scale analysis of fact-checked stories on Twitter reveals graded effects of ambiguity and falsehood on information reappearance
    Kauk, Julian
    Kreysa, Helene
    Schweinberger, Stefan R.
    PNAS NEXUS, 2025, 4 (02):
  • [5] FbMultiLingMisinfo: Challenging Large-Scale Multilingual Benchmark for Misinformation Detection
    Barnabo, Giorgio
    Siciliano, Federico
    Castillo, Carlos
    Leonardi, Stefano
    Nakov, Preslav
    Martino, Giovanni Da San
    Silvestri, Fabrizio
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [6] MLS: A Large-Scale Multilingual Dataset for Speech Research
    Pratap, Vineel
    Xu, Qiantong
    Sriram, Anuroop
    Synnaeve, Gabriel
    Collobert, Ronan
    INTERSPEECH 2020, 2020, : 2757 - 2761
  • [7] Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
    Duquenne, Paul-Ambroise
    Gong, Hongyu
    Schwenk, Holger
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [8] SiDi KWS: A Large-Scale Multilingual Dataset for Keyword Spotting
    Meneses, Michel
    Holanda, Rafael
    Peres, Luis
    Rocha, Gabriela
    INTERSPEECH 2022, 2022, : 4616 - 4620
  • [9] MINION: a Large-Scale and Diverse Dataset for Multilingual Event Detection
    Ben Veyseh, Amir Pouran
    Minh Van Nguyen
    Dernoncourt, Franck
    Thien Huu Nguyen
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2286 - 2299
  • [10] BjTT: A Large-Scale Multimodal Dataset for Traffic Prediction
    Zhang, Chengyang
    Zhang, Yong
    Shao, Qitan
    Feng, Jiangtao
    Li, Bo
    Lv, Yisheng
    Piao, Xinglin
    Yin, Baocai
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (11) : 18992 - 19003