Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition

被引:4
|
作者
Wang, Jianrong [1 ]
Tang, Ziyue [2 ]
Li, Xuewei [1 ]
Yu, Mei [1 ]
Fang, Qiang [3 ]
Liu, Li [4 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Tianjin Univ, Tianjin Int Engn Inst, Tianjin, Peoples R China
[3] Chinese Acad Social Sci, Inst Linguist, Beijing, Peoples R China
[4] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Cued Speech; Cross-modal knowledge distillation; Teacher-student structure; Cued Speech recognition;
D O I
10.21437/Interspeech.2021-432
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British English (97 sentences). In this work, we propose a cross-modal knowledge distillation method with teacher-student structure, which transfers audio speech information to CS to overcome the limited data problem. Firstly, we pretrain a teacher model for CS recognition with a large amount of open source audio speech data, and simultaneously pretrain the feature extractors for lips and hands using CS data. Then, we distill the knowledge from teacher model to the student model with frame-level and sequence-level distillation strategies. Importantly, for frame-level, we exploit multi-task learning to weigh losses automatically, to obtain the balance coefficient. Besides, we establish a five-speaker British English CS dataset for the first time. The proposed method is evaluated on French and British English CS datasets, showing superior CS recognition performance to the state-of-the-art (SOTA) by a large margin.
引用
收藏
页码:2986 / 2990
页数:5
相关论文
共 50 条
  • [1] CROSS-MODAL KNOWLEDGE DISTILLATION FOR ACTION RECOGNITION
    Thoker, Fida Mohammad
    Gall, Juergen
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 6 - 10
  • [2] Cross-modal knowledge distillation for continuous sign language recognition
    Gao, Liqing
    Shi, Peng
    Hu, Lianyu
    Feng, Jichao
    Zhu, Lei
    Wan, Liang
    Feng, Wei
    [J]. NEURAL NETWORKS, 2024, 179
  • [3] Progressive Cross-modal Knowledge Distillation for Human Action Recognition
    Ni, Jianyuan
    Ngu, Anne H. H.
    Yan, Yan
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5903 - 5912
  • [4] DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition
    Wang, Sijie
    She, Rui
    Kang, Qiyu
    Jian, Xingchao
    Zhao, Kai
    Song, Yang
    Tay, Wee Peng
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 9, 2024, : 10377 - 10385
  • [5] CROSS-MODAL KNOWLEDGE DISTILLATION FOR VISION-TO-SENSOR ACTION RECOGNITION
    Ni, Jianyuan
    Sarbajna, Raunak
    Liu, Yang
    Ngu, Anne H. H.
    Yan, Yan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4448 - 4452
  • [6] FedCMD: A Federated Cross-modal Knowledge Distillation for Drivers' Emotion Recognition
    Bano, Saira
    Tonellotto, Nicola
    Cassara, Pietro
    Gotta, Alberto
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
  • [7] Speech Emotion Recognition via Multi-Level Cross-Modal Distillation
    Li, Ruichen
    Zhao, Jinming
    Jin, Qin
    [J]. INTERSPEECH 2021, 2021, : 4488 - 4492
  • [8] Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation
    Chen, Lijiang
    Ren, Jie
    Mao, Xia
    Zhao, Qi
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (09):
  • [9] Visual-to-EEG cross-modal knowledge distillation for continuous emotion recognition
    Zhang, Su
    Tang, Chuangao
    Guan, Cuntai
    [J]. PATTERN RECOGNITION, 2022, 130
  • [10] Visual-to-EEG cross-modal knowledge distillation for continuous emotion recognition
    Zhang, Su
    Tang, Chuangao
    Guan, Cuntai
    [J]. PATTERN RECOGNITION, 2022, 130