CLASSIFYING LAUGHTER AND SPEECH USING AUDIO-VISUAL FEATURE PREDICTION

被引:9
|
作者
Petridis, Stavros [1 ]
Asghar, Ali [1 ]
Pantic, Maja [1 ]
机构
[1] Univ London Imperial Coll Sci Technol & Med, Dept Comp, London, England
关键词
laughter-vs-speech discrimination; audiovisual speech / laughter feature relationship; prediction-based classification;
D O I
10.1109/ICASSP.2010.5494992
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio features mapping for both classes. Classification of a new frame is performed via prediction. All the networks produce a prediction of the expected audio / visual features and the network with the best prediction, i.e., the model which best describes the audiovisual feature relationship, provides its label to the input frame. When trained on a simple dataset and tested on a hard dataset, the proposed approach outperforms audiovisual feature-level fusion, resulting in a 10.9% and 6.4% absolute increase in the F1 rate for laughter and classification rate, respectively. This indicates that classification based on prediction can produce a good model even when the available dataset is not challenging enough.
引用
收藏
页码:5254 / 5257
页数:4
相关论文
共 50 条
  • [21] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
    Wei, Jie
    Hu, Guanyu
    Yang, Xinyu
    Luu, Anh Tuan
    Dong, Yizhuo
    [J]. INTERSPEECH 2022, 2022, : 1988 - 1992
  • [22] IMPROVING ACOUSTIC MODELING USING AUDIO-VISUAL SPEECH
    Abdelaziz, Ahmed Hussen
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 1081 - 1086
  • [23] Audio-visual speech recognition using lstm and cnn
    El Maghraby E.E.
    Gody A.M.
    Farouk M.H.
    [J]. Recent Advances in Computer Science and Communications, 2021, 14 (06) : 2023 - 2039
  • [24] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    [J]. VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
  • [25] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [26] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    [J]. APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
  • [27] Audio-visual speech recognition using deep learning
    Kuniaki Noda
    Yuki Yamaguchi
    Kazuhiro Nakadai
    Hiroshi G. Okuno
    Tetsuya Ogata
    [J]. Applied Intelligence, 2015, 42 : 722 - 737
  • [28] Audio-visual speech recognition using an infrared headset
    Huang, J
    Potamianos, G
    Connell, J
    Neti, C
    [J]. SPEECH COMMUNICATION, 2004, 44 (1-4) : 83 - 96
  • [29] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
    Sodoyer, D
    Schwartz, JL
    Girin, L
    Klinkisch, J
    Jutten, C
    [J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1165 - 1173
  • [30] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
    [J]. Sodoyer, D. (sodoyer@icp.inpg.fr), 1600, Hindawi Publishing Corporation (2002):