Audiovisual Speech Activity Detection with Advanced Long Short-Term Memory

被引:0
|
作者
Tao, Fei [1 ]
Busso, Carlos [1 ]
机构
[1] Univ Texas Dallas, Dept Elect & Comp Engn, Multimodal Signal Proc MSP Lab, Richardson, TX 75080 USA
关键词
speech activity activation; advanced LSTM; bimodal RNN; audiovisual speech processing; deep learning;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech activity detection (SAD) is a key pre-processing step for a speech-based system. The performance of conventional audio-only SAD (A-SAD) systems is impaired by acoustic noise when they are used in practical applications. An alternative approach to address this problem is to include visual information, creating audiovisual speech activity detection (AV-SAD) solutions. In our previous work, we proposed to build an AV-SAD system using bimodal recurrent neural network (BRNN). This framework was able to capture the task-related characteristics in the audio and visual inputs, and model the temporal information within and across modalities. The approach relied on long short-term memory (LSTM). Although LSTM can model longer temporal dependencies with the cells, the effective memory of the units is limited to a few frames. since the recurrent connection only considers the previous frame. For SAD systems, it is important to model longer temporal dependencies to capture the semi-periodic nature of speech conveyed in acoustic and orofacial features. This study proposes to implement a BRNN-based AV-SAD system with advanced LSTMs (A-LSTMs), which overcomes this limitation by including multiple connections to frames in the past. The results show that the proposed framework can significantly outperform the BRNN system trained with the original LSTM layers.
引用
收藏
页码:1244 / 1248
页数:5
相关论文
共 50 条
  • [1] Time Series-based Spoof Speech Detection Using Long Short-term Memory and Bidirectional Long Short-term Memory
    Mirza, Arsalan R.
    Al-Talabani, Abdulbasit K.
    [J]. ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 2024, 12 (02): : 119 - 129
  • [2] Speech Dereverberation Using Long Short-Term Memory
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2435 - 2439
  • [3] Long Short-term Memory for Tibetan Speech Recognition
    Wang, Weizhe
    Chen, Ziyan
    Yang, Hongwu
    [J]. PROCEEDINGS OF 2020 IEEE 4TH INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2020), 2020, : 1059 - 1063
  • [4] Deep Long Short-Term Memory Networks for Speech Recognition
    Chien, Jen-Tzung
    Misbullah, Alim
    [J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [5] A Low Complexity Long Short-Term Memory Based Voice Activity Detection
    Yang, Ruiting
    Liu, Jie
    Deng, Xiang
    Zheng, Zhuochao
    [J]. 2020 IEEE 22ND INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2020,
  • [6] Insider Threat Detection with Long Short-Term Memory
    Lu, Jiuming
    Wong, Raymond K.
    [J]. PROCEEDINGS OF THE AUSTRALASIAN COMPUTER SCIENCE WEEK MULTICONFERENCE (ACSW 2019), 2019,
  • [7] Endpoint Detection using Grid Long Short-Term Memory Networks for Streaming Speech Recognition
    Chang, Shuo-Yiin
    Li, Bo
    Sainath, Tara N.
    Simko, Gabor
    Parada, Carolina
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3812 - 3816
  • [8] Audiovisual integration facilitates monkeys' short-term memory
    Bigelow, James
    Poremba, Amy
    [J]. ANIMAL COGNITION, 2016, 19 (04) : 799 - 811
  • [9] Audiovisual integration facilitates monkeys’ short-term memory
    James Bigelow
    Amy Poremba
    [J]. Animal Cognition, 2016, 19 : 799 - 811
  • [10] Long short-term memory
    Hochreiter, S
    Schmidhuber, J
    [J]. NEURAL COMPUTATION, 1997, 9 (08) : 1735 - 1780