AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING

被引:14
|
作者
Morrone, Giovanni [1 ]
Michelsanti, Daniel [2 ]
Tan, Zheng-Hua [2 ]
Jensen, Jesper [2 ,3 ]
机构
[1] Univ Modena & Reggio Emilia, Dept Engn Enzo Ferrari, Modena, Italy
[2] Aalborg Univ, Dept Elect Syst, Aalborg, Denmark
[3] Oticon AS, Copenhagen, Denmark
关键词
speech inpainting; audio-visual; deep learning; face-landmarks; multi-task learning;
D O I
10.1109/ICASSP39728.2021.9413488
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.
引用
收藏
页码:6653 / 6657
页数:5
相关论文
共 50 条
  • [1] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [2] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    [J]. APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
  • [3] Audio-visual speech recognition using deep learning
    Kuniaki Noda
    Yuki Yamaguchi
    Kazuhiro Nakadai
    Hiroshi G. Okuno
    Tetsuya Ogata
    [J]. Applied Intelligence, 2015, 42 : 722 - 737
  • [4] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [5] Scope for Deep Learning:A Study in Audio-Visual Speech Recognition
    Bhaskar, Shabina
    Thasleema, T. M.
    [J]. PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND KNOWLEDGE ECONOMY (ICCIKE' 2019), 2019, : 72 - 77
  • [6] AUDIO-VISUAL DEEP LEARNING FOR NOISE ROBUST SPEECH RECOGNITION
    Huang, Jing
    Kingsbury, Brian
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7596 - 7599
  • [7] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [8] Audio-Visual Deep Clustering for Speech Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1697 - 1712
  • [9] Deep Audio-visual Learning: A Survey
    Hao Zhu
    Man-Di Luo
    Rui Wang
    Ai-Hua Zheng
    Ran He
    [J]. Machine Intelligence Research, 2021, 18 (03) : 351 - 376
  • [10] Deep Audio-visual Learning: A Survey
    Hao Zhu
    Man-Di Luo
    Rui Wang
    Ai-Hua Zheng
    Ran He
    [J]. International Journal of Automation and Computing, 2021, 18 : 351 - 376