CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

被引:0
|
作者
Zhang, Litian [1 ]
Zhang, Xiaoming [1 ]
Guo, Ziming [1 ]
Liu, Zhipeng [1 ]
机构
[1] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Summarization; Mulitmodal; Semantic coverage; Multi-task;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal summarization (MS) aims to generate a summary from multimodal input. Previous works mainly focus on textual semantic coverage metrics such as ROUGE, which considers the visual content as supplemental data. Therefore, the summary is ineffective to cover the semantics of different modalities. This paper proposes a multi-task cross-modality learning framework (CISum) to improve multimodal semantic coverage by learning the cross-modality interaction in the multimodal article. To obtain the visual semantics, we translate images into visual descriptions based on the correlation with text content. Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content, and the most relevant image is selected as the visual summary. Furthermore, we design an automatic multimodal semantics coverage metric to evaluate the performance. Experimental results show that CISum outperforms baselines in multimodal semantics coverage metrics while maintaining the excellent performance of ROUGE and BLEU.
引用
收藏
页码:370 / 378
页数:9
相关论文
共 50 条
  • [1] Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization
    Zhang, Litian
    Zhang, Xiaoming
    Pan, Junshu
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11676 - 11684
  • [2] Cross-modality Representation Interactive Learning For Multimodal Sentiment Analysis
    Huang, Jian
    Ji, Yanli
    Yang, Yang
    Shen, Heng Tao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 426 - 434
  • [3] Cross-Modality Semantic Integration With Hypothesis Rescoring for Robust Interpretation of Multimodal User Interactions
    Hui, Pui-Yu
    Meng, Helen M.
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (03): : 486 - 500
  • [4] Reasoning with Multimodal Sarcastic Tweets via Modeling Cross-Modality Contrast and Semantic Association
    Xu, Nan
    Zeng, Zhixiong
    Mao, Wenji
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3777 - 3786
  • [5] CMOT: Cross-Modality Optimal Transport for multimodal inference
    Sayali Anil Alatkar
    Daifeng Wang
    Genome Biology, 24
  • [6] CMOT: Cross-Modality Optimal Transport for multimodal inference
    Alatkar, Sayali Anil
    Wang, Daifeng
    GENOME BIOLOGY, 2023, 24 (01)
  • [7] Multimodal Pedestrian Detection Based on Cross-Modality Reference Search
    Lee, Wei-Yu
    Jovanov, Ljubomir
    Philips, Wilfried
    IEEE SENSORS JOURNAL, 2024, 24 (10) : 17291 - 17306
  • [8] Multimodal Speech Summarization through Semantic Concept Learning
    Palaskar, Shruti
    Salakhutdinov, Ruslan
    Black, Alan W.
    Metze, Florian
    INTERSPEECH 2021, 2021, : 791 - 795
  • [9] Cross-Modality Microblog Sentiment Prediction via Bi-Layer Multimodal Hypergraph Learning
    Ji, Rongrong
    Chen, Fuhai
    Cao, Liujuan
    Gao, Yue
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (04) : 1062 - 1075
  • [10] Leveraging hierarchy in multimodal generative models for effective cross-modality inference
    Vasco, Miguel
    Yin, Hang
    Melo, Francisco S.
    Paiva, Ana
    NEURAL NETWORKS, 2022, 146 : 238 - 255