MULTI-MODAL PRE-TRAINING FOR AUTOMATED SPEECH RECOGNITION

被引:4
|
作者
Chan, David M. [1 ,2 ]
Ghosh, Shalini [2 ]
Chakrabarty, Debmalya [2 ]
Hoffmeister, Bjorn [2 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] Amazon Alexa AI, Seattle, WA 98121 USA
关键词
Automated Speech Recognition; Multi-Modal Learning; BERT; Conformer; Video;
D O I
10.1109/ICASSP43922.2022.9746449
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach that leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets range from 6% (on larger models) to 45% (on smaller models).
引用
收藏
页码:246 / 250
页数:5
相关论文
共 50 条
  • [1] TableVLM: Multi-modal Pre-training for Table Structure Recognition
    Chen, Leiyuan
    Huang, Chengsong
    Zheng, Xiaoqing
    Lin, Jinshu
    Huang, Xuanjing
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2437 - 2449
  • [2] Multi-Modal Contrastive Pre-training for Recommendation
    Liu, Zhuang
    Ma, Yunpu
    Schubert, Matthias
    Ouyang, Yuanxin
    Xiong, Zhang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 99 - 108
  • [3] MGeo: Multi-Modal Geographic Language Model Pre-Training
    Ding, Ruixue
    Chen, Boli
    Xie, Pengjun
    Huang, Fei
    Li, Xin
    Zhang, Qiang
    Xu, Yao
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 185 - 194
  • [4] Dynamic facial expression recognition with pseudo-label guided multi-modal pre-training
    Yin, Bing
    Yin, Shi
    Liu, Cong
    Zhang, Yanyong
    Xi, Changfeng
    Yin, Baocai
    Ling, Zhenhua
    [J]. IET COMPUTER VISION, 2024, 18 (01) : 33 - 45
  • [5] Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training
    Ju, Xincheng
    Zhang, Dong
    Zhu, Suyang
    Li, Junhui
    Li, Shoushan
    Zhou, Guodong
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 1045 - 1055
  • [6] Versatile Multi-Modal Pre-Training for Human-Centric Perception
    Hong, Fangzhou
    Pan, Liang
    Cai, Zhongang
    Liu, Ziwei
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16135 - 16145
  • [7] Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion
    Yan, Zhiqiang
    Li, Xiang
    Wang, Kun
    Zhang, Zhenyu
    Li, Jun
    Yang, Jian
    [J]. COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 378 - 395
  • [8] MMPT'21: International JointWorkshop on Multi-Modal Pre-Training for Multimedia Understanding
    Liu, Bei
    Fu, Jianlong
    Chen, Shizhe
    Jin, Qin
    Hauptmann, Alexander
    Rui, Yong
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 694 - 695
  • [9] Graph-Text Multi-Modal Pre-training for Medical Representation Learning
    Park, Sungjin
    Bae, Seongsu
    Kim, Jiho
    Kim, Tackeun
    Choi, Edward
    [J]. CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 174, 2022, 174 : 261 - 281
  • [10] The Effectiveness of Self-supervised Pre-training for Multi-modal Endometriosis Classification
    Butler, David
    Wang, Hu
    Zhang, Yuan
    To, Minh-Son
    Condous, George
    Leonardi, Mathew
    Knox, Steven
    Avery, Jodie
    Hull, M. Louise
    Carneiro, Gustavo
    [J]. 2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,