Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment

被引:0
|
作者
Rong, Huan [1 ]
Chen, Zhongfeng [1 ]
Lu, Zhenyu [1 ]
Xu, Fan [2 ]
Sheng, Victor S. [3 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Artificial Intelligence, 219 Ningliu Rd, Nanjing 210044, Jiangsu, Peoples R China
[2] Jiangxi Normal Univ, Sch Comp Informat Engn, Nanchang, Jiangxi, Peoples R China
[3] Texas Tech Univ, Dept Comp Sci, Rono Hills, Lubbock, TX 79430 USA
基金
中国国家自然科学基金;
关键词
Business intelligence; multi-modal summarization; semantic enhancement; and attention; multi-modal cross learning;
D O I
10.1145/3651983
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article focuses on the task of Multi-Modal Summarization with Multi-Modal Output for China JD.COM e-commerce product description containing both source text and source images. In the context learning of multi-modal (text and image) input, there exists a semantic gap between text and image, especially in the cross-modal semantics of text and image. As a result, capturing shared cross-modal semantics earlier becomes crucial for multi-modal summarization. However, when generating the multi-modal summarization, based on the different contributions of input text and images, the relevance and irrelevance of multi-modal contexts to the target summary should be considered, so as to optimize the process of learning cross-modal context to guide the summary generation process and to emphasize the significant semantics within each modality. To address the aforementioned challenges, Multization has been proposed to enhance multi-modal semantic information by multi-contextually relevant and irrelevant attention alignment. Specifically, a Semantic Alignment Enhancement mechanism is employed to capture shared semantics between different modalities (text and image), so as to enhance the importance of crucial multi-modal information in the encoding stage. Additionally, the IR-Relevant Multi-Context Learning mechanism is utilized to observe the summary generation process from both relevant and irrelevant perspectives, so as to form a multi-modal context that incorporates both text and image semantic information. The experimental results in the China JD.COM e-commerce dataset demonstrate that the proposedMultizationmethod effectively captures the shared semantics between the input source text and source images, and highlights essential semantics. It also successfully generates themulti-modal summary (including image and text) that comprehensively considers the semantics information of both text and image.
引用
收藏
页数:29
相关论文
共 50 条
  • [21] Multi-modal Alignment using Representation Codebook
    Duan, Jiali
    Chen, Liqun
    Tran, Son
    Yang, Jinyu
    Xu, Yi
    Zeng, Belinda
    Chilimbi, Trishul
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15630 - 15639
  • [22] Multi-modal Siamese Network for Entity Alignment
    Chen, Liyi
    Li, Zhi
    Xu, Tong
    Wu, Han
    Wang, Zhefeng
    Yuan, Nicholas Jing
    Chen, Enhong
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 118 - 126
  • [23] Multi-modal entity alignment in hyperbolic space
    Guo, Hao
    Tang, Jiuyang
    Zeng, Weixin
    Zhao, Xiang
    Liu, Li
    NEUROCOMPUTING, 2021, 461 : 598 - 607
  • [24] LCEMH: Label Correlation Enhanced Multi-modal Hashing for efficient multi-modal retrieval
    Zheng, Chaoqun
    Zhu, Lei
    Zhang, Zheng
    Duan, Wenjun
    Lu, Wenpeng
    INFORMATION SCIENCES, 2024, 659
  • [25] FEFA: Frequency Enhanced Multi-Modal MRI Reconstruction With Deep Feature Alignment
    Chen, Xuanmin
    Ma, Liyan
    Ying, Shihui
    Shen, Dinggang
    Zeng, Tieyong
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (11) : 6751 - 6763
  • [26] Multi-modal Attention for Speech Emotion Recognition
    Pan, Zexu
    Luo, Zhaojie
    Yang, Jichen
    Li, Haizhou
    INTERSPEECH 2020, 2020, : 364 - 368
  • [27] A Multi-modal Attention System for Smart Environments
    Schauerte, B.
    Ploetz, T.
    Fink, G. A.
    COMPUTER VISION SYSTEMS, PROCEEDINGS, 2009, 5815 : 73 - +
  • [28] Attention driven multi-modal similarity learning
    Gao, Xinjian
    Mu, Tingting
    Goulermas, John Y.
    Wang, Meng
    INFORMATION SCIENCES, 2018, 432 : 530 - 542
  • [29] Cross-modal attention for multi-modal image registration
    Song, Xinrui
    Chao, Hanqing
    Xu, Xuanang
    Guo, Hengtao
    Xu, Sheng
    Turkbey, Baris
    Wood, Bradford J.
    Sanford, Thomas
    Wang, Ge
    Yan, Pingkun
    MEDICAL IMAGE ANALYSIS, 2022, 82
  • [30] MMSS: Multi-modal story-oriented video summarization
    Pan, JY
    Yang, H
    Faloutsos, C
    FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, : 491 - 494