Audiovisual Speech Activity Detection with Advanced Long Short-Term Memory

被引：0

作者：

Tao, Fei ^{[1
]}

Busso, Carlos ^{[1
]}

机构：

[1] Univ Texas Dallas, Dept Elect & Comp Engn, Multimodal Signal Proc MSP Lab, Richardson, TX 75080 USA

来源：

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES | 2018年

关键词：

speech activity activation; advanced LSTM; bimodal RNN; audiovisual speech processing; deep learning;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech activity detection (SAD) is a key pre-processing step for a speech-based system. The performance of conventional audio-only SAD (A-SAD) systems is impaired by acoustic noise when they are used in practical applications. An alternative approach to address this problem is to include visual information, creating audiovisual speech activity detection (AV-SAD) solutions. In our previous work, we proposed to build an AV-SAD system using bimodal recurrent neural network (BRNN). This framework was able to capture the task-related characteristics in the audio and visual inputs, and model the temporal information within and across modalities. The approach relied on long short-term memory (LSTM). Although LSTM can model longer temporal dependencies with the cells, the effective memory of the units is limited to a few frames. since the recurrent connection only considers the previous frame. For SAD systems, it is important to model longer temporal dependencies to capture the semi-periodic nature of speech conveyed in acoustic and orofacial features. This study proposes to implement a BRNN-based AV-SAD system with advanced LSTMs (A-LSTMs), which overcomes this limitation by including multiple connections to frames in the past. The results show that the proposed framework can significantly outperform the BRNN system trained with the original LSTM layers.

引用

页码：1244 / 1248

页数：5

共 50 条

[1] Time Series-based Spoof Speech Detection Using Long Short-term Memory and Bidirectional Long Short-term Memory
Mirza, Arsalan R.
Al-Talabani, Abdulbasit K.
[J]. ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 2024, 12 (02): : 119 - 129
[2] Speech Dereverberation Using Long Short-Term Memory
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2435 - 2439
[3] Long Short-term Memory for Tibetan Speech Recognition
Wang, Weizhe
Chen, Ziyan
Yang, Hongwu
[J]. PROCEEDINGS OF 2020 IEEE 4TH INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2020), 2020, : 1059 - 1063
[4] Deep Long Short-Term Memory Networks for Speech Recognition
Chien, Jen-Tzung
Misbullah, Alim
[J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[5] A Low Complexity Long Short-Term Memory Based Voice Activity Detection
Yang, Ruiting
Liu, Jie
Deng, Xiang
Zheng, Zhuochao
[J]. 2020 IEEE 22ND INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2020,
[6] Insider Threat Detection with Long Short-Term Memory
Lu, Jiuming
Wong, Raymond K.
[J]. PROCEEDINGS OF THE AUSTRALASIAN COMPUTER SCIENCE WEEK MULTICONFERENCE (ACSW 2019), 2019,
[7] Endpoint Detection using Grid Long Short-Term Memory Networks for Streaming Speech Recognition
Chang, Shuo-Yiin
Li, Bo
Sainath, Tara N.
Simko, Gabor
Parada, Carolina
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3812 - 3816
[8] Audiovisual integration facilitates monkeys' short-term memory
Bigelow, James
Poremba, Amy
[J]. ANIMAL COGNITION, 2016, 19 (04) : 799 - 811
[9] Audiovisual integration facilitates monkeys’ short-term memory
James Bigelow
Amy Poremba
[J]. Animal Cognition, 2016, 19 : 799 - 811
[10] Long short-term memory
Hochreiter, S
Schmidhuber, J
[J]. NEURAL COMPUTATION, 1997, 9 (08) : 1735 - 1780

← 1 2 3 4 5 →