M3S: Scene Graph Driven Multi-Granularity Multi-Task Learning for Multi-Modal NER

被引:14
|
作者
Wang, Jie [1 ,2 ]
Yang, Yan [1 ,2 ]
Liu, Keyu [1 ,2 ]
Zhu, Zhiping [1 ,2 ]
Liu, Xiaorong [1 ,2 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu 611756, Peoples R China
[2] Southwest Jiaotong Univ, Mfg Ind Chains Collaborat & Informat Support Tech, Chengdu 611756, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Task analysis; Semantics; Feature extraction; Multitasking; Social networking (online); Image segmentation; Named entity recognition; multi-modal learning; scene graph;
D O I
10.1109/TASLP.2022.3221017
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Multi-modal Named Entity Recognition (MNER), which mainly focuses on enhancing text-only NER with visual information, has recently attracted considerable attention. Most current MNER models have made significant progress by jointly understanding visual and language modalities through layers of cross-modality attention. However, these approaches largely ignore the visual bias brought by the image contents and barely consider exploiting the multi-granularity representations and the interactions between visual objects, which are essential in recognizing ambiguous entities. In this paper, we propose a Scene graph driven Multi-modal Multi-granularity Multi-task learning (M3S) framework to better exploit visual and textual information in MNER. Specifically, to explicitly alleviate visual bias, we present a novel multi-task approach by employing the task of Named Entity Segmentation (NES) cascade with Named Entity Categorization (NEC). To obtain detailed visual semantics by explicitly modeling objects and relationships between paired objects, we construct scene graphs as a structured representation of the visual contents. Furthermore, a well-designed Multi-granularity Gated Aggregation (MGA) mechanism is introduced to capture inter-modality interactions and extract critical features for named entity recognition. Extensive experiments on two real public datasets demonstrate the effectiveness of our proposed M3S.
引用
收藏
页码:111 / 120
页数:10
相关论文
共 50 条
  • [1] Multi-modal microblog classification via multi-task learning
    Sicheng Zhao
    Hongxun Yao
    Sendong Zhao
    Xuesong Jiang
    Xiaolei Jiang
    [J]. Multimedia Tools and Applications, 2016, 75 : 8921 - 8938
  • [2] MultiNet: Multi-Modal Multi-Task Learning for Autonomous Driving
    Chowdhuri, Sauhaarda
    Pankaj, Tushar
    Zipser, Karl
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1496 - 1504
  • [3] Multi-Modal Multi-Task Learning for Automatic Dietary Assessment
    Liu, Qi
    Zhang, Yue
    Liu, Zhenguang
    Yuan, Ye
    Cheng, Li
    Zimmermann, Roger
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2347 - 2354
  • [4] Multi-modal microblog classification via multi-task learning
    Zhao, Sicheng
    Yao, Hongxun
    Zhao, Sendong
    Jiang, Xuesong
    Jiang, Xiaolei
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (15) : 8921 - 8938
  • [5] Multi-Task and Multi-Modal Learning for RGB Dynamic Gesture Recognition
    Fan, Dinghao
    Lu, Hengjie
    Xu, Shugong
    Cao, Shan
    [J]. IEEE SENSORS JOURNAL, 2021, 21 (23) : 27026 - 27036
  • [6] Multi-modal embeddings using multi-task learning for emotion recognition
    Khare, Aparna
    Parthasarathy, Srinivas
    Sundaram, Shiva
    [J]. INTERSPEECH 2020, 2020, : 384 - 388
  • [7] Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis
    Akhtar, Md Shad
    Chauhan, Dushyant Singh
    Ghosal, Deepanway
    Poria, Soujanya
    Ekbal, Asif
    Bhattacharyya, Pushpak
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 370 - 379
  • [8] MultiMAE: Multi-modal Multi-task Masked Autoencoders
    Bachmann, Roman
    Mizrahi, David
    Atanov, Andrei
    Zamir, Amir
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 348 - 367
  • [9] Multi-Modal Multi-Task (3MT) Road Segmentation
    Milli, Erkan
    Erkent, Ozgur
    Ylmaz, Asm Egemen
    [J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (09) : 5408 - 5415
  • [10] Twitter Demographic Classification Using Deep Multi-modal Multi-task Learning
    Vijayaraghavan, Prashanth
    Vosoughi, Soroush
    Roy, Deb
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 478 - 483