Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment

被引:0
|
作者
Rong, Huan [1 ]
Chen, Zhongfeng [1 ]
Lu, Zhenyu [1 ]
Xu, Fan [2 ]
Sheng, Victor S. [3 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Artificial Intelligence, 219 Ningliu Rd, Nanjing 210044, Jiangsu, Peoples R China
[2] Jiangxi Normal Univ, Sch Comp Informat Engn, Nanchang, Jiangxi, Peoples R China
[3] Texas Tech Univ, Dept Comp Sci, Rono Hills, Lubbock, TX 79430 USA
基金
中国国家自然科学基金;
关键词
Business intelligence; multi-modal summarization; semantic enhancement; and attention; multi-modal cross learning;
D O I
10.1145/3651983
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article focuses on the task of Multi-Modal Summarization with Multi-Modal Output for China JD.COM e-commerce product description containing both source text and source images. In the context learning of multi-modal (text and image) input, there exists a semantic gap between text and image, especially in the cross-modal semantics of text and image. As a result, capturing shared cross-modal semantics earlier becomes crucial for multi-modal summarization. However, when generating the multi-modal summarization, based on the different contributions of input text and images, the relevance and irrelevance of multi-modal contexts to the target summary should be considered, so as to optimize the process of learning cross-modal context to guide the summary generation process and to emphasize the significant semantics within each modality. To address the aforementioned challenges, Multization has been proposed to enhance multi-modal semantic information by multi-contextually relevant and irrelevant attention alignment. Specifically, a Semantic Alignment Enhancement mechanism is employed to capture shared semantics between different modalities (text and image), so as to enhance the importance of crucial multi-modal information in the encoding stage. Additionally, the IR-Relevant Multi-Context Learning mechanism is utilized to observe the summary generation process from both relevant and irrelevant perspectives, so as to form a multi-modal context that incorporates both text and image semantic information. The experimental results in the China JD.COM e-commerce dataset demonstrate that the proposedMultizationmethod effectively captures the shared semantics between the input source text and source images, and highlights essential semantics. It also successfully generates themulti-modal summary (including image and text) that comprehensively considers the semantics information of both text and image.
引用
收藏
页数:29
相关论文
共 50 条
  • [1] A Modality-Enhanced Multi-Channel Attention Network for Multi-Modal Dialogue Summarization
    Lu, Ming
    Liu, Yang
    Zhang, Xiaoming
    APPLIED SCIENCES-BASEL, 2024, 14 (20):
  • [2] Multi-modal anchor adaptation learning for multi-modal summarization
    Chen, Zhongfeng
    Lu, Zhenyu
    Rong, Huan
    Zhao, Chuanjun
    Xu, Fan
    NEUROCOMPUTING, 2024, 570
  • [3] Multi-modal Video Summarization
    Huang, Jia-Hong
    ICMR 2024 - Proceedings of the 2024 International Conference on Multimedia Retrieval, 2024, : 1214 - 1218
  • [4] Multi-modal Video Summarization
    Huang, Jia-Hong
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1214 - 1218
  • [5] A Survey on Multi-modal Summarization
    Jangra, Anubhav
    Mukherjee, Sourajit
    Jatowt, Adam
    Saha, Sriparna
    Hasanuzzaman, Mohammad
    ACM COMPUTING SURVEYS, 2023, 55 (13S)
  • [6] Multi-modal Sentence Summarization with Modality Attention and Image Filtering
    Li, Haoran
    Zhu, Junnan
    Liu, Tianshang
    Zhang, Jiajun
    Zong, Chengqing
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4152 - 4158
  • [7] Multi-Modal Knowledge Graph Transformer Framework for Multi-Modal Entity Alignment
    Li, Qian
    Ji, Cheng
    Guo, Shu
    Liang, Zhaoji
    Wang, Lihong
    Li, Jianxin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 987 - 999
  • [8] Enhanced Entity Interaction Modeling for Multi-Modal Entity Alignment
    Li, Jinxu
    Zhou, Qian
    Chen, Wei
    Zhao, Lei
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, KSEM 2023, 2023, 14118 : 214 - 227
  • [9] Multi-modal and multi-scale photo collection summarization
    Xu Shen
    Xinmei Tian
    Multimedia Tools and Applications, 2016, 75 : 2527 - 2541
  • [10] Multi-modal and multi-scale photo collection summarization
    Shen, Xu
    Tian, Xinmei
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (05) : 2527 - 2541