Multimodal Summarization with Guidance of Multimodal Reference

被引:0
|
作者
Zhu, Junnan [1 ,2 ]
Zhou, Yu [1 ,2 ]
Zhang, Jiajun [1 ,2 ]
Li, Haoran [4 ]
Zong, Chengqing [1 ,2 ,3 ]
Li, Changliang [5 ]
机构
[1] Chinese Acad Sci, Natl Lab Pattern Recognit, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China
[4] JD AI Res, Huila, Colombia
[5] Kingsoft AI Lab, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal summarization with multimodal output (MSMO) is to generate a multimodal summary for a multimodal news report, which has been proven to effectively improve users' satisfaction. The existing MSMO methods are trained by the target of text modality, leading to the modality-bias problem that ignores the quality of model-selected image during training. To alleviate this problem, we propose a multimodal objective function with the guidance of multimodal reference to use the loss from the summary generation and the image selection. Due to the lack of multimodal reference data, we present two strategies, i.e., ROUGE-ranking and Order-ranking, to construct the multimodal reference by extending the text reference. Meanwhile, to better evaluate multimodal outputs, we propose a novel evaluation metric based on joint multimodal representation, projecting the model output and multimodal reference into a joint semantic space during evaluation. Experimental results have shown that our proposed model achieves the new state-of-the-art on both automatic and manual evaluation metrics. Besides, our proposed evaluation method can effectively improve the correlation with human judgments.
引用
收藏
页码:9749 / 9756
页数:8
相关论文
共 50 条
  • [31] Interactive System for Video Summarization Based on Multimodal Fusion
    Zheng Li
    Xiaobing Du
    Cuixia Ma
    Yanfeng Li
    Hongan Wang
    Journal of Beijing Institute of Technology, 2019, 28 (01) : 27 - 34
  • [32] Interactive System for Video Summarization Based on Multimodal Fusion
    Li Z.
    Du X.
    Ma C.
    Li Y.
    Wang H.
    Journal of Beijing Institute of Technology (English Edition), 2019, 28 (01): : 27 - 34
  • [33] Align and Attend: Multimodal Summarization with Dual Contrastive Losses
    He, Bo
    Wang, Jun
    Qiu, Jielin
    Bui, Trung
    Shrivastava, Abhinav
    Wang, Zhaowen
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14867 - 14878
  • [34] CGSMP: Controllable Generative Summarization via Multimodal Prompt
    Yong, Qian
    Wei, Jueqi
    Zhang, Yiren
    Zhang, Xilun
    Wei, Chao
    Chen, Simiao
    Li, Yunhe
    Ye, Cheng
    Huang, Bing
    Wang, Hao
    PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 45 - 50
  • [35] Multimodal Speech Summarization through Semantic Concept Learning
    Palaskar, Shruti
    Salakhutdinov, Ruslan
    Black, Alan W.
    Metze, Florian
    INTERSPEECH 2021, 2021, : 791 - 795
  • [36] Multimodal Video Summarization based on Fuzzy Similarity Features
    Psallidas, Theodoros
    Vasilakakis, Michael D.
    Spyrou, Evaggelos
    Iakovidis, Dimitris K.
    2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
  • [37] Multimodal Stereoscopic Movie Summarization Conforming to Narrative Characteristics
    Mademlis, Ioannis
    Tefas, Anastasios
    Nikolaidis, Nikos
    Pitas, Ioannis
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (12) : 5828 - 5840
  • [38] EPICURE - Aspect-based Multimodal Review Summarization
    Kashyap, Abhinav Ramesh
    von der Weth, Christian
    Cheng, Zhiyong
    Kankanhalli, Mohan
    WEBSCI'18: PROCEEDINGS OF THE 10TH ACM CONFERENCE ON WEB SCIENCE, 2018, : 365 - 369
  • [39] Deep Multimodal Guidance for Medical Image Classification
    Mallya, Mayur
    Hamarneh, Ghassan
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 298 - 308
  • [40] Personalized application for multimodal route guidance for travellers
    Maria Panou
    European Transport Research Review, 2012, 4 (1) : 19 - 26