Multimodal Summarization with Guidance of Multimodal Reference

被引:0
|
作者
Zhu, Junnan [1 ,2 ]
Zhou, Yu [1 ,2 ]
Zhang, Jiajun [1 ,2 ]
Li, Haoran [4 ]
Zong, Chengqing [1 ,2 ,3 ]
Li, Changliang [5 ]
机构
[1] Chinese Acad Sci, Natl Lab Pattern Recognit, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China
[4] JD AI Res, Huila, Colombia
[5] Kingsoft AI Lab, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal summarization with multimodal output (MSMO) is to generate a multimodal summary for a multimodal news report, which has been proven to effectively improve users' satisfaction. The existing MSMO methods are trained by the target of text modality, leading to the modality-bias problem that ignores the quality of model-selected image during training. To alleviate this problem, we propose a multimodal objective function with the guidance of multimodal reference to use the loss from the summary generation and the image selection. Due to the lack of multimodal reference data, we present two strategies, i.e., ROUGE-ranking and Order-ranking, to construct the multimodal reference by extending the text reference. Meanwhile, to better evaluate multimodal outputs, we propose a novel evaluation metric based on joint multimodal representation, projecting the model output and multimodal reference into a joint semantic space during evaluation. Experimental results have shown that our proposed model achieves the new state-of-the-art on both automatic and manual evaluation metrics. Besides, our proposed evaluation method can effectively improve the correlation with human judgments.
引用
收藏
页码:9749 / 9756
页数:8
相关论文
共 50 条
  • [21] CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization
    Zhang, Litian
    Zhang, Xiaoming
    Guo, Ziming
    Liu, Zhipeng
    PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 370 - 378
  • [22] A model for multimodal reference resolution
    Pineda, L
    Garza, G
    COMPUTATIONAL LINGUISTICS, 2000, 26 (02) : 139 - 193
  • [23] Multimodal system for the planning and guidance of bronchoscopy
    Higgins, William E.
    Cheirsilp, Ronnarit
    Zang, Xiaonan
    Byrnes, Patrick
    MEDICAL IMAGING 2015: IMAGE-GUIDED PROCEDURES, ROBOTIC INTERVENTIONS, AND MODELING, 2015, 9415
  • [24] The Promise of Multimodal Image Guidance in Neurosurgery
    Tomasello, Francesco
    Conti, Alfredo
    WORLD NEUROSURGERY, 2014, 82 (1-2) : E183 - E184
  • [25] Multimodal interaction for mobile robot guidance
    Iannizzotto, Giancarlo
    Lanzafame, Pietro
    La Rosa, Francesco
    Costanzo, Carlo
    PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON HUMAN-COMPUTER INTERACTION, 2005, : 173 - 178
  • [26] UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation
    Zhang, Zhengkun
    Meng, Xiaojun
    Wang, Yasheng
    Jiang, Xin
    Liu, Qun
    Yang, Zhenglu
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11757 - 11764
  • [27] Multimodal latent topic analysis for image collection summarization
    Camargo, Jorge E.
    Gonzalez, Fabio A.
    INFORMATION SCIENCES, 2016, 328 : 270 - 287
  • [28] Multimodal summarization with modality features alignment and features filtering
    Tang, Binghao
    Lin, Boda
    Chang, Zheng
    Li, Si
    NEUROCOMPUTING, 2024, 603
  • [29] Multimodal Local Feature Enhancement Network for Video Summarization
    Li, Zhaoyun
    Ren, Xiwei
    Du, Fengyi
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VI, 2024, 14430 : 158 - 169
  • [30] Multimodal Abstractive Summarization for How2 Videos
    Palaskar, Shruti
    Libovicky, Jindrich
    Gella, Spandana
    Metze, Florian
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6587 - 6596