Cross-Domain Deep Visual Feature Generation for Mandarin Audio-Visual Speech Recognition

被引:13
|
作者
Su, Rongfeng [1 ,2 ]
Liu, Xunying [3 ]
Wang, Lan [1 ]
Yang, Jingzhou [4 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 518055, Peoples R China
[2] Univ Chinese Acad Sci, Shenzhen Coll Adv Technol, Shenzhen 518055, Peoples R China
[3] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China
[4] Microsoft China, Beijing 100080, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Lips; Acoustics; Speech recognition; Training; Three-dimensional displays; Adaptation models; Audio-visual speech recognition (AVSR); visual feature generation; domain adaptation; NEURAL-NETWORKS; INVERSION; SYSTEM;
D O I
10.1109/TASLP.2019.2950602
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
There has been a long term interest in using visual information to improve automatic speech recognition (ASR) system performance. Both audio and visual information are required in conventional audio visual speech recognition (AVSR) systems. This limits their wider applications when visual modality is not present. To this end, one possible solution is to use acoustic-to-visual (A2V) inversion techniques to generate visual features. Previous research in this direction used synthetic acoustic-articulatory parallel data in inversion model training. The acoustic mismatch between the audio-visual (AV) parallel data and target data was not considered. In addition, the target language to apply these technologies has been focused on English. In this article, a real 3D Audio-Visual Mandarin Continuous Speech (3DAV-MCS) corpus was used to train deep neural network based A2V inversion models. Cross-domain adaptation of the inversion models allows suitable visual features to be generated from acoustic data of mismatched domains. The proposed cross-domain deep visual feature generation techniques were evaluated on two state-of-the-art Mandarin speech recognition tasks: DAPRA GALE broadcast transcription and BOLT conversational telephone speech recognition. The AVSR systems constructed using the cross-domain generated visual features consistently outperformed the baseline convolutional neural network (CNN) ASR systems by up to 3.3% absolute (9.1% relative) character error rate (CER) reductions after both speaker adaptive training and sequence discriminative training were performed.
引用
收藏
页码:185 / 197
页数:13
相关论文
共 50 条
  • [1] Automatic Visual Feature Extraction for Mandarin Audio-Visual Speech Recognition
    Pao, Tsang-Long
    Liao, Wen-Yuan
    Wu, Tsan-Nung
    Lin, Ching-Yi
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2009), VOLS 1-9, 2009, : 2936 - 2940
  • [2] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    [J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [3] Semi-supervised Cross-domain Visual Feature Learning for Audio-Visual Broadcast Speech Transcription
    Su, Rongfeng
    Liu, Xunying
    Wang, Lan
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3509 - 3513
  • [4] Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition
    Liu, Shansong
    Xie, Xurong
    Yu, Jianwei
    Hu, Shoukang
    Geng, Mengzhe
    Su, Rongfeng
    Zhang, Shi-Xiong
    Liu, Xunying
    Meng, Helen
    [J]. INTERSPEECH 2020, 2020, : 711 - 715
  • [5] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [6] Cross-Domain Deep Feature Combination for Bird Species Classification with Audio-Visual Data
    Bold, Naranchimeg
    Zhang, Chao
    Akashi, Takuya
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (10) : 2033 - 2042
  • [7] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
    Wei, Jie
    Hu, Guanyu
    Yang, Xinyu
    Luu, Anh Tuan
    Dong, Yizhuo
    [J]. INTERSPEECH 2022, 2022, : 1988 - 1992
  • [8] MANDARIN AUDIO-VISUAL SPEECH RECOGNITION WITH EFFECTS TO THE NOISE AND EMOTION
    Pao, Tsang-Long
    Liao, Wen-Yuan
    Chen, Yu-Te
    Wu, Tsan-Nung
    [J]. INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2010, 6 (02): : 711 - 723
  • [9] A HYBRID VISUAL FEATURE EXTRACTION METHOD FOR AUDIO-VISUAL SPEECH RECOGNITION
    Wu, Guanyong
    Zhu, Jie
    Xu, Haihua
    [J]. 2009 16TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-6, 2009, : 1829 - 1832
  • [10] Relevant feature selection for audio-visual speech recognition
    Drugman, Thomas
    Gurban, Mihai
    Thiran, Jean-Philippe
    [J]. 2007 IEEE NINTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2007, : 179 - +