Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition

被引:4
|
作者
Wang, Jianrong [1 ]
Tang, Ziyue [2 ]
Li, Xuewei [1 ]
Yu, Mei [1 ]
Fang, Qiang [3 ]
Liu, Li [4 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Tianjin Univ, Tianjin Int Engn Inst, Tianjin, Peoples R China
[3] Chinese Acad Social Sci, Inst Linguist, Beijing, Peoples R China
[4] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Cued Speech; Cross-modal knowledge distillation; Teacher-student structure; Cued Speech recognition;
D O I
10.21437/Interspeech.2021-432
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British English (97 sentences). In this work, we propose a cross-modal knowledge distillation method with teacher-student structure, which transfers audio speech information to CS to overcome the limited data problem. Firstly, we pretrain a teacher model for CS recognition with a large amount of open source audio speech data, and simultaneously pretrain the feature extractors for lips and hands using CS data. Then, we distill the knowledge from teacher model to the student model with frame-level and sequence-level distillation strategies. Importantly, for frame-level, we exploit multi-task learning to weigh losses automatically, to obtain the balance coefficient. Besides, we establish a five-speaker British English CS dataset for the first time. The proposed method is evaluated on French and British English CS datasets, showing superior CS recognition performance to the state-of-the-art (SOTA) by a large margin.
引用
收藏
页码:2986 / 2990
页数:5
相关论文
共 50 条
  • [21] END-TO-END VOICE CONVERSION VIA CROSS-MODAL KNOWLEDGE DISTILLATION FOR DYSARTHRIC SPEECH RECONSTRUCTION
    Wang, Disong
    Yu, Jianwei
    Wu, Xixin
    Liu, Songxiang
    Sung, Lifa
    Liu, Xunying
    Meng, Helen
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7744 - 7748
  • [22] Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
    Zhang, Sheng
    Chen, Min
    Chen, Jincai
    Li, Yuan-Fang
    Wu, Yiling
    Li, Minglei
    Zhu, Chuanbo
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 229
  • [23] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
    Yang, Dingkang
    Huang, Shuai
    Liu, Yang
    Zhang, Lihua
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097
  • [24] Learnable Cross-modal Knowledge Distillation for Multi-modal Learning with Missing Modality
    Wang, Hu
    Ma, Congbo
    Zhang, Jianpeng
    Zhang, Yuan
    Avery, Jodie
    Hull, Louise
    Carneiro, Gustavo
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT IV, 2023, 14223 : 216 - 226
  • [25] Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
    Albanie, Samuel
    Nagrani, Arsha
    Vedaldi, Andrea
    Zisserman, Andrew
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 292 - 301
  • [26] Cross-Modal Knowledge Distillation in Deep Networks for SAR Image Classification
    Jahan, Chowdhury Sadman
    Savakis, Andreas
    Blasch, Erik
    [J]. GEOSPATIAL INFORMATICS XII, 2022, 12099
  • [27] Cross-Modal Knowledge Distillation for Depth Privileged Monocular Visual Odometry
    Li, Bin
    Wang, Shuling
    Ye, Haifeng
    Gong, Xiaojin
    Xiang, Zhiyu
    [J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 6171 - 6178
  • [28] An automatic generation method of cross-modal fuzzy creativity
    Zhang, Fuquan
    Wang, Yiou
    Wu, Chensheng
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (05) : 5685 - 5696
  • [29] CATNet: Cross-modal fusion for audio-visual speech recognition
    Wang, Xingmei
    Mi, Jiachen
    Li, Boquan
    Zhao, Yixu
    Meng, Jiaxiang
    [J]. PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
  • [30] Cross-Modal Commentator: Automatic Machine Commenting Based on Cross-Modal Information
    Yang, Pengcheng
    Zhang, Zhihan
    Luo, Fuli
    Li, Lei
    Huang, Chengyang
    Sun, Xu
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2680 - 2686