BEATS: Bengali Speech Acts Recognition using Multimodal Attention Fusion

被引：0

作者：

Deb, Ahana ^{[1
]}

Nag, Sayan ^{[2
]}

Mahapatra, Ayan ^{[1
]}

Chattopadhyay, Soumitri ^{[1
]}

Marik, Aritra ^{[1
]}

Gayen, Pijush Kanti ^{[1
]}

Sanyal, Shankha ^{[1
]}

Banerjee, Archi ^{[3
]}

Karmakar, Samir ^{[1
]}

机构：

[1] Jadavpur Univ, Kolkata, India

[2] Univ Toronto, Toronto, ON, Canada

[3] IIT Kharagpur, Kharagpur, W Bengal, India

来源：

INTERSPEECH 2023 | 2023年

关键词：

speech act; multimodal fusion; transformer; low-resource language; EMOTION; EXPRESSION; FEATURES;

D O I：

10.21437/Interspeech.2023-1146

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful representations from multilingual datasets, have performed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts (Bengali speech acts recognition using Multimodal Attention Fusion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data. Project page: https://soumitri2001.github.io/BeAts

引用

页码：3392 / 3396

页数：5

共 50 条

[31] Audio-Video Fusion with Double Attention for Multimodal Emotion Recognition
Mocanu, Bogdan
Tapu, Ruxandra
2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
[32] Multi-level Attention Fusion for Multimodal Driving Maneuver Recognition
Liu, Jing
Liu, Yang
Tian, Chengwen
Zhao, Mengyang
Zeng, Xinhua
Song, Liang
2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022, : 2609 - 2613
[33] Student Attention Detection Using Multimodal Data Fusion
Mallibhat, Kaushik
2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, ICALT 2024, 2024, : 295 - 297
[34] Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network
Ngoc-Huynh Ho
Yang, Hyung-Jeong
Kim, Soo-Hyung
Lee, Gueesang
IEEE ACCESS, 2020, 8 : 61672 - 61686
[35] Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion
Yu, Shaode
Meng, Jiajian
Fan, Wenqing
Chen, Ye
Zhu, Bing
Yu, Hang
Xie, Yaoqin
Sun, Qiuirui
ELECTRONICS, 2024, 13 (11)
[36] Improving Recognition of Speech System Using Multimodal Approach
Radha, N.
Shahina, A.
Khan, A. Nayeemulla
INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING AND COMMUNICATIONS, VOL 2, 2019, 56 : 397 - 410
[37] Performance improvement in speech recognition using multimodal features
Kim, Myung Won
Song, Won Moon
Kim, Young Jin
Kim, Eun Ju
ICNC 2007: THIRD INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 2, PROCEEDINGS, 2007, : 686 - +
[38] Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data
Lee, Chan Woo
Song, Kyu Ye
Jeong, Jihoon
Choi, Woo Yong
FIRST GRAND CHALLENGE AND WORKSHOP ON HUMAN MULTIMODAL LANGUAGE (CHALLENGE-HML), 2018, : 28 - 34
[39] MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT
Yoon, Seunghyun
Byun, Seokhyun
Jung, Kyomin
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 112 - 118
[40] MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION
Sun, Licai
Liu, Bin
Tao, Jianhua
Lian, Zheng
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4275 - 4279

← 1 2 3 4 5 →