BEATS: Bengali Speech Acts Recognition using Multimodal Attention Fusion

被引:0
|
作者
Deb, Ahana [1 ]
Nag, Sayan [2 ]
Mahapatra, Ayan [1 ]
Chattopadhyay, Soumitri [1 ]
Marik, Aritra [1 ]
Gayen, Pijush Kanti [1 ]
Sanyal, Shankha [1 ]
Banerjee, Archi [3 ]
Karmakar, Samir [1 ]
机构
[1] Jadavpur Univ, Kolkata, India
[2] Univ Toronto, Toronto, ON, Canada
[3] IIT Kharagpur, Kharagpur, W Bengal, India
来源
关键词
speech act; multimodal fusion; transformer; low-resource language; EMOTION; EXPRESSION; FEATURES;
D O I
10.21437/Interspeech.2023-1146
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful representations from multilingual datasets, have performed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts (Bengali speech acts recognition using Multimodal Attention Fusion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data. Project page: https://soumitri2001.github.io/BeAts
引用
收藏
页码:3392 / 3396
页数:5
相关论文
共 50 条
  • [31] Audio-Video Fusion with Double Attention for Multimodal Emotion Recognition
    Mocanu, Bogdan
    Tapu, Ruxandra
    2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
  • [32] Multi-level Attention Fusion for Multimodal Driving Maneuver Recognition
    Liu, Jing
    Liu, Yang
    Tian, Chengwen
    Zhao, Mengyang
    Zeng, Xinhua
    Song, Liang
    2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022, : 2609 - 2613
  • [33] Student Attention Detection Using Multimodal Data Fusion
    Mallibhat, Kaushik
    2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, ICALT 2024, 2024, : 295 - 297
  • [34] Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network
    Ngoc-Huynh Ho
    Yang, Hyung-Jeong
    Kim, Soo-Hyung
    Lee, Gueesang
    IEEE ACCESS, 2020, 8 : 61672 - 61686
  • [35] Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion
    Yu, Shaode
    Meng, Jiajian
    Fan, Wenqing
    Chen, Ye
    Zhu, Bing
    Yu, Hang
    Xie, Yaoqin
    Sun, Qiuirui
    ELECTRONICS, 2024, 13 (11)
  • [36] Improving Recognition of Speech System Using Multimodal Approach
    Radha, N.
    Shahina, A.
    Khan, A. Nayeemulla
    INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING AND COMMUNICATIONS, VOL 2, 2019, 56 : 397 - 410
  • [37] Performance improvement in speech recognition using multimodal features
    Kim, Myung Won
    Song, Won Moon
    Kim, Young Jin
    Kim, Eun Ju
    ICNC 2007: THIRD INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 2, PROCEEDINGS, 2007, : 686 - +
  • [38] Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data
    Lee, Chan Woo
    Song, Kyu Ye
    Jeong, Jihoon
    Choi, Woo Yong
    FIRST GRAND CHALLENGE AND WORKSHOP ON HUMAN MULTIMODAL LANGUAGE (CHALLENGE-HML), 2018, : 28 - 34
  • [39] MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT
    Yoon, Seunghyun
    Byun, Seokhyun
    Jung, Kyomin
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 112 - 118
  • [40] MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION
    Sun, Licai
    Liu, Bin
    Tao, Jianhua
    Lian, Zheng
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4275 - 4279