Align and Attend: Multimodal Summarization with Dual Contrastive Losses

被引:12
|
作者
He, Bo [1 ]
Wang, Jun [1 ]
Qiu, Jielin [2 ]
Bui, Trung [3 ]
Shrivastava, Abhinav [1 ]
Wang, Zhaowen [3 ]
机构
[1] Univ Maryland, College Pk, MD 20742 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
[3] Adobe Res, San Francisco, CA USA
关键词
D O I
10.1109/CVPR52729.2023.01428
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at https://boheumd.github.io/A2Summ/.
引用
收藏
页码:14867 / 14878
页数:12
相关论文
共 50 条
  • [21] Align MacridVAE: Multimodal Alignment for Disentangled Recommendations
    Avas, Ignacio
    Allein, Liesbeth
    Laenen, Katrien
    Moens, Marie-Francine
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 73 - 89
  • [22] Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization
    Sotudeh, Sajad
    Goharian, Nazli
    Filice, Ross W.
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 1899 - 1905
  • [23] Learning to Attend and to Ignore Is a Matter of Gains and Losses
    Della Libera, Chiara
    Chelazzi, Leonardo
    PSYCHOLOGICAL SCIENCE, 2009, 20 (06) : 778 - 784
  • [24] Attend and Align: Improving Deep Representations with Feature Alignment Layer for Person Retrieval
    Xu, Qin
    Sun, Yifan
    Li, Yali
    Wang, Shengjin
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 2148 - 2153
  • [25] Multimodal text summarization with evaluation approaches
    Khilji, Abdullah Faiz Ur Rahman
    Sinha, Utkarsh
    Singh, Pintu
    Ali, Adnan
    Laskar, Sahinur Rahman
    Dadure, Pankaj
    Manna, Riyanka
    Pakray, Partha
    Favre, Benoit
    Bandyopadhyay, Sivaji
    SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2023, 48 (04):
  • [26] Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering
    Li, Pengfei
    Liu, Gang
    He, Jinlong
    Zhao, Zixu
    Zhong, Shenjun
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT I, 2023, 14220 : 374 - 383
  • [27] Video Summarization Based on Multimodal Features
    Zhang, Yu
    Liu, Ju
    Liu, Xiaoxi
    Gao, Xuesong
    INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2020, 11 (04): : 60 - 76
  • [28] Leveraging multimodal content for podcast summarization
    Vaiani, Lorenzo
    La Quatra, Moreno
    Cagliero, Luca
    Garza, Paolo
    37TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2022, : 863 - 870
  • [29] Enhancing abstractive summarization of implicit datasets with contrastive attention
    Kwon S.
    Lee Y.
    Neural Computing and Applications, 2024, 36 (25) : 15337 - 15351
  • [30] Sentence salience contrastive learning for abstractive text summarization
    Huang, Ying
    Li, Zhixin
    Chen, Zhenbin
    Zhang, Canlong
    Ma, Huifang
    NEUROCOMPUTING, 2024, 593