Align and Attend: Multimodal Summarization with Dual Contrastive Losses

被引:12
|
作者
He, Bo [1 ]
Wang, Jun [1 ]
Qiu, Jielin [2 ]
Bui, Trung [3 ]
Shrivastava, Abhinav [1 ]
Wang, Zhaowen [3 ]
机构
[1] Univ Maryland, College Pk, MD 20742 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
[3] Adobe Res, San Francisco, CA USA
关键词
D O I
10.1109/CVPR52729.2023.01428
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at https://boheumd.github.io/A2Summ/.
引用
收藏
页码:14867 / 14878
页数:12
相关论文
共 50 条
  • [31] Multimodal text summarization with evaluation approaches
    Abdullah Faiz Ur Rahman Khilji
    Utkarsh Sinha
    Pintu Singh
    Adnan Ali
    Sahinur Rahman Laskar
    Pankaj Dadure
    Riyanka Manna
    Partha Pakray
    Benoit Favre
    Sivaji Bandyopadhyay
    Sādhanā, 48
  • [32] Graph Enhanced Contrastive Learning for Radiology Findings Summarization
    Hu, Jinpeng
    Li, Zhuo
    Chen, Zhihong
    Li, Zhen
    Wan, Xiang
    Chang, Tsung-Hui
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4677 - 4688
  • [33] SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization
    Liu, Yixin
    Liu, Pengfei
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 1065 - 1072
  • [34] Topic-guided abstractive multimodal summarization with multimodal output
    Rafi, Shaik
    Das, Ranjita
    NEURAL COMPUTING & APPLICATIONS, 2023,
  • [35] Joint Reinforcement and Contrastive Learning for Unsupervised Video Summarization
    Zhang, Yunzuo
    Liu, Yameng
    Zhu, Pengfei
    Kang, Weili
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2587 - 2591
  • [36] Graph-based Multimodal Ranking Models for Multimodal Summarization
    Zhu, Junnan
    Xiang, Lu
    Zhou, Yu
    Zhang, Jiajun
    Zong, Chengqing
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (04)
  • [37] Heterogeneous graphormer for extractive multimodal summarization
    Jiang, Xiankai
    Chen, Jingqiang
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024, : 355 - 373
  • [38] Align-then-abstract representation learning for low-resource summarization
    Moro, Gianluca
    Ragazzi, Luca
    NEUROCOMPUTING, 2023, 548
  • [39] STRUM: Extractive Aspect-Based Contrastive Summarization
    Gunel, Beliz
    Tata, Sandeep
    Najork, Marc
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 28 - 31
  • [40] Multimodal Graph Meta Contrastive Learning
    Zhao, Feng
    Wang, Donglin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3657 - 3661