ATTENTION-BASED MULTI-HYPOTHESIS FUSION FOR SPEECH SUMMARIZATION

被引:1
|
作者
Kano, Takatomo [1 ]
Ogawa, Atsunori [1 ]
Delcroix, Marc [1 ]
Watanabe, Shinji [2 ]
机构
[1] NTT Corp, Tokyo, Japan
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
Speech Summarization; Automatic Speech Recognition; BERT; Attention-based Fusion;
D O I
10.1109/ASRU51503.2021.9687977
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech summarization, which generates a text summary from speech, can be achieved by combining automatic speech recognition (ASR) and text summarization (TS). With this cascade approach, we can exploit state-of-the-art models and large training datasets for both subtasks, i.e., Transformer for ASR and Bidirectional Encoder Representations from Transformers (BERT) for TS. However, ASR errors directly affect the quality of the output summary in the cascade approach. We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary. We investigate several schemes to combine ASR hypotheses. First, we propose using the sum of sub-word embedding vectors weighted by their posterior values provided by an ASR system as an input to a BERT-based TS system. Then, we introduce a more general scheme that uses an attention-based fusion module added to a pre-trained BERT module to align and combine several ASR hypotheses. Finally, we perform speech summarization experiments on the How2 dataset and a newly assembled TED-based dataset that we will release with this paper(1). These experiments show that retraining the BERT-based TS system with these schemes can improve summarization performance and that the attention-based fusion module is particularly effective.
引用
收藏
页码:487 / 494
页数:8
相关论文
共 50 条
  • [1] Attention-Based Audio-Visual Fusion for Video Summarization
    Fang, Yinghong
    Zhang, Junpeng
    Lu, Cewu
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
  • [2] Attention-based Clinical Note Summarization
    Kanwal, Neel
    Rizzo, Giuseppe
    [J]. 37TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2022, : 813 - 820
  • [3] Multi-hypothesis structures and taxonomies for combat identification fusion
    Schuck, TM
    Hunter, JB
    Wilson, DD
    [J]. 2004 IEEE AEROSPACE CONFERENCE PROCEEDINGS, VOLS 1-6, 2004, : 2017 - 2026
  • [4] Attention-based multi-modal fusion sarcasm detection
    Liu, Jing
    Tian, Shengwei
    Yu, Long
    Long, Jun
    Zhou, Tiejun
    Wang, Bo
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 44 (02) : 2097 - 2108
  • [5] Multi-hypothesis database for large-scale data fusion
    McDaniel, D
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON INFORMATION FUSION, VOL II, 2002, : 1168 - 1175
  • [6] Multi-sensor fusion using an adaptive multi-hypothesis tracking algorithm
    Kester, LJHM
    [J]. MULTISENSOR, MULTISOURCE INFORMATION FUSION: ARCHITECTURES, ALGORITHMS, AND APPLICATIONS 2003, 2003, 5099 : 164 - 172
  • [7] Attention-Based Models for Speech Recognition
    Chorowski, Jan
    Bahdanau, Dzmitry
    Serdyuk, Dmitriy
    Cho, Kyunghyun
    Bengio, Yoshua
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [8] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [9] Attention-based Fusion for Multi-source Human Image Generation
    Lathuiliere, Stephane
    Sangineto, Enver
    Siarohin, Aliaksandr
    Sebe, Nicu
    [J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 428 - 437
  • [10] Multi-Source Attention-Based Fusion for Segmentation of Natural Disasters
    El Rai, Marwa Chendeb
    Darweesh, Muna
    Far, Aicha Beya
    Gawanmeh, Amjad
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21