Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment

被引:0
|
作者
Rong, Huan [1 ]
Chen, Zhongfeng [1 ]
Lu, Zhenyu [1 ]
Xu, Fan [2 ]
Sheng, Victor S. [3 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Artificial Intelligence, 219 Ningliu Rd, Nanjing 210044, Jiangsu, Peoples R China
[2] Jiangxi Normal Univ, Sch Comp Informat Engn, Nanchang, Jiangxi, Peoples R China
[3] Texas Tech Univ, Dept Comp Sci, Rono Hills, Lubbock, TX 79430 USA
基金
中国国家自然科学基金;
关键词
Business intelligence; multi-modal summarization; semantic enhancement; and attention; multi-modal cross learning;
D O I
10.1145/3651983
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article focuses on the task of Multi-Modal Summarization with Multi-Modal Output for China JD.COM e-commerce product description containing both source text and source images. In the context learning of multi-modal (text and image) input, there exists a semantic gap between text and image, especially in the cross-modal semantics of text and image. As a result, capturing shared cross-modal semantics earlier becomes crucial for multi-modal summarization. However, when generating the multi-modal summarization, based on the different contributions of input text and images, the relevance and irrelevance of multi-modal contexts to the target summary should be considered, so as to optimize the process of learning cross-modal context to guide the summary generation process and to emphasize the significant semantics within each modality. To address the aforementioned challenges, Multization has been proposed to enhance multi-modal semantic information by multi-contextually relevant and irrelevant attention alignment. Specifically, a Semantic Alignment Enhancement mechanism is employed to capture shared semantics between different modalities (text and image), so as to enhance the importance of crucial multi-modal information in the encoding stage. Additionally, the IR-Relevant Multi-Context Learning mechanism is utilized to observe the summary generation process from both relevant and irrelevant perspectives, so as to form a multi-modal context that incorporates both text and image semantic information. The experimental results in the China JD.COM e-commerce dataset demonstrate that the proposedMultizationmethod effectively captures the shared semantics between the input source text and source images, and highlights essential semantics. It also successfully generates themulti-modal summary (including image and text) that comprehensively considers the semantics information of both text and image.
引用
收藏
页数:29
相关论文
共 50 条
  • [41] A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement
    Yuan, Song
    Lu, Zexin
    Li, Qiyuan
    Gu, Jinguang
    BIG DATA AND COGNITIVE COMPUTING, 2023, 7 (02)
  • [42] Fine-Grained Image Classification Based on Multi-Modal Features and Enhanced Alignment
    Han, Jing
    Zhang, Tianpeng
    Lyu, Xueqiang
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2024, 47 (04): : 130 - 135
  • [43] Multi-modal Rumor Detection on Modality Alignment and Multi-perspective Structures
    Li, Boqun
    Qian, Zhong
    Li, Peifeng
    Zhu, Qiaoming
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT IV, 2023, 14089 : 472 - 483
  • [44] ATTENTION DRIVEN FUSION FOR MULTI-MODAL EMOTION RECOGNITION
    Priyasad, Darshana
    Fernando, Tharindu
    Denman, Simon
    Sridharan, Sridha
    Fookes, Clinton
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3227 - 3231
  • [45] Fraud Detection with Multi-Modal Attention and Correspondence Learning
    Park, Jongchan
    Kim, Min-Hyun
    Choi, Seibum
    Kweon, In So
    Choi, Dong-Geol
    2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 278 - 284
  • [46] A multi-modal object attention system for a mobile robot
    Haasch, A
    Hofemann, N
    Fritsch, J
    Sagerer, G
    2005 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-4, 2005, : 1499 - 1504
  • [47] Contextual Inter-modal Attention for Multi-modal Sentiment Analysis
    Ghosal, Deepanway
    Akhtar, Md Shad
    Chauhan, Dushyant
    Poria, Soujanya
    Ekbalt, Asif
    Bhattacharyyat, Pushpak
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3454 - 3466
  • [48] Mixture of Attention Variants for Modal Fusion in Multi-Modal Sentiment Analysis
    He, Chao
    Zhang, Xinghua
    Song, Dongqing
    Shen, Yingshan
    Mao, Chengjie
    Wen, Huosheng
    Zhu, Dingju
    Cai, Lihua
    BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (02)
  • [49] A New Design on Multi-Modal Robotic Focus Attention
    Lin, Chia-How
    Yang, Chia-Hsing
    Wang, Cheng-Kang
    Song, Kai-Tai
    Hu, Jwu-Sheng
    2008 17TH IEEE INTERNATIONAL SYMPOSIUM ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, VOLS 1 AND 2, 2008, : 598 - 603
  • [50] Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization
    Li, Manling
    Zhang, Lingyu
    Ji, Heng
    Radke, Richard J.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2190 - 2196