Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation

被引：0

作者：

Liu, Debang ^{[1
]}

Zhang, Tianqi ^{[1
]}

Christensen, Mads Graesboll ^{[2
]}

Wei, Ying ^{[1
]}

An, Zeliang ^{[1
]}

机构：

[1] Chongqing Univ Posts & Telecommun, Sch Commun & Informat Engn, Chongqing 400065, Peoples R China

[2] Aalborg Univ, Audio Anal Lab, CREATE, DK-9000 Aalborg, Denmark

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

audio-visual fusion; time-domain; speech separation; temporal convolutional attention; training cost;

D O I：

10.21437/Interspeech.2023-801

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.

引用

页码：3694 / 3698

页数：5

共 50 条

[31] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
[32] Speech Pattern Discovery using Audio-Visual Fusion and Canonical Correlation Analysis
Xie, Lei
Xu, Yinqing
Zheng, Lilei
Huang, Qiang
Li, Bingfeng
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2371 - 2374
[33] Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training
Zhang, Peng
Xu, Jiaming
Shi, Jing
Hao, Yunzhe
Qin, Lei
Xu, Bo
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[34] AUDIO-VISUAL SPEECH SEPARATION USING CROSS-MODAL CORRESPONDENCE LOSS
Makishima, Naoki
Ihori, Mana
Takashima, Akihiko
Tanaka, Tomohiro
Orihashi, Shota
Masumura, Ryo
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6673 - 6677
[35] AUDIO-VISUAL SPEECH RECOGNITION WITH A HYBRID CTC/ATTENTION ARCHITECTURE
Petridis, Stavros
Stafylakis, Themos
Ma, Pingchuan
Tzimiropoulos, Georgios
Pantic, Maja
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 513 - 520
[36] Multi-Kernel Attention Encoder For Time-Domain Speech Separation
Liu, Zengrun
Shi, Diya
Wei, Ying
2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
[37] Audio-Visual Fusion Based on Interactive Attention for Person Verification
Jing, Xuebin
He, Liang
Song, Zhida
Wang, Shaolei
SENSORS, 2023, 23 (24)
[38] Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
Liu, Hong
Li, Wenhao
Yang, Bing
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7580 - 7586
[39] Audio-Visual Fusion for Sound Source Localization and Improved Attention
Lee, Byoung-gi
Choi, JongSuk
Yoon, SangSuk
Choi, Mun-Taek
Kim, Munsang
Kim, Daijin
TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2011, 35 (07) : 737 - 743
[40] Attention-Based Audio-Visual Fusion for Video Summarization
Fang, Yinghong
Zhang, Junpeng
Lu, Cewu
NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340

← 1 2 3 4 5 →