CLASSIFYING LAUGHTER AND SPEECH USING AUDIO-VISUAL FEATURE PREDICTION

被引:9
|
作者
Petridis, Stavros [1 ]
Asghar, Ali [1 ]
Pantic, Maja [1 ]
机构
[1] Univ London Imperial Coll Sci Technol & Med, Dept Comp, London, England
关键词
laughter-vs-speech discrimination; audiovisual speech / laughter feature relationship; prediction-based classification;
D O I
10.1109/ICASSP.2010.5494992
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio features mapping for both classes. Classification of a new frame is performed via prediction. All the networks produce a prediction of the expected audio / visual features and the network with the best prediction, i.e., the model which best describes the audiovisual feature relationship, provides its label to the input frame. When trained on a simple dataset and tested on a hard dataset, the proposed approach outperforms audiovisual feature-level fusion, resulting in a 10.9% and 6.4% absolute increase in the F1 rate for laughter and classification rate, respectively. This indicates that classification based on prediction can produce a good model even when the available dataset is not challenging enough.
引用
收藏
页码:5254 / 5257
页数:4
相关论文
共 50 条
  • [41] Audio-visual speech recognition by speechreading
    Zhang, XZ
    Mersereau, RM
    Clements, MA
    [J]. DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072
  • [42] Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)
    Deligne, S
    Potamianos, G
    Neti, C
    [J]. SAM2002: IEEE SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP PROCEEDINGS, 2002, : 68 - 71
  • [43] Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Tsao, Yu
    Lo, Chen-Chou
    Wang, Hsin-Min
    [J]. INTERSPEECH 2020, 2020, : 1131 - 1135
  • [44] Audio-visual speech processing and attention
    Sams, M
    [J]. PSYCHOPHYSIOLOGY, 2003, 40 : S5 - S6
  • [45] Audio-visual enhancement of speech in noise
    Girin, L
    Schwartz, JL
    Feng, G
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020
  • [46] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    [J]. 2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
  • [47] COMPARISON BETWEEN DIFFERENT FEATURE EXTRACTION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION
    Chitu, Alin G.
    Rothkrantz, Leon J. M.
    Wiggers, Pascal
    Wojdel, Jacek C.
    [J]. JOURNAL ON MULTIMODAL USER INTERFACES, 2007, 1 (01) : 7 - 20
  • [48] A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
    Li, Guizhu
    Fu, Min
    Sun, Mengnan
    Liu, Xuefeng
    Zheng, Bing
    [J]. SENSORS, 2023, 23 (21)
  • [49] Coarse speech recognition by audio-visual integration based on missing feature theory
    Koiwa, Tomoaki
    Nakadai, Kazuhiro
    Imura, Jun-ichi
    [J]. 2007 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-9, 2007, : 1757 - 1762
  • [50] Comparison between different feature extraction techniques for audio-visual speech recognition
    Alin G. Chiţu
    Leon J. M. Rothkrantz
    Pascal Wiggers
    Jacek C. Wojdel
    [J]. Journal on Multimodal User Interfaces, 2007, 1 : 7 - 20