Dynamic facial expression recognition with pseudo-label guided multi-modal pre-training

被引:0
|
作者
Yin, Bing [1 ,2 ]
Yin, Shi [2 ,3 ]
Liu, Cong [2 ]
Zhang, Yanyong [3 ]
Xi, Changfeng [2 ]
Yin, Baocai [2 ]
Ling, Zhenhua [1 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, 443 Huangshan Rd, Hefei, Anhui, Peoples R China
[2] iFLYTEK Res, Hefei, Anhui, Peoples R China
[3] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei, Anhui, Peoples R China
关键词
computer vision; emotion recognition; pattern recognition; EMOTION;
D O I
10.1049/cvi2.12217
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the huge cost of manual annotations, the labelled data may not be sufficient to train a dynamic facial expression (DFR) recogniser with good performance. To address this, the authors propose a multi-modal pre-training method with a pseudo-label guidance mechanism to make full use of unlabelled video data for learning informative representations of facial expressions. First, the authors build a pre-training dataset of videos with aligned vision and audio modals. Second, the vision and audio feature encoders are trained through an instance discrimination strategy and a cross-modal alignment strategy on the pre-training data. Third, the vision feature encoder is extended as a dynamic expression recogniser and is fine-tuned on the labelled training data. Fourth, the fine-tuned expression recogniser is adopted to predict pseudo-labels for the pre-training data, and then start a new pre-training phase with the guidance of pseudo-labels to alleviate the long-tail distribution problem and the instance-class confliction. Fifth, since the representations learnt with the guidance of pseudo-labels are more informative, a new fine-tuning phase is added to further boost the generalisation performance on the DFR recognition task. Experimental results on the Dynamic Facial Expression in the Wild dataset demonstrate the superiority of the proposed method.
引用
收藏
页码:33 / 45
页数:13
相关论文
共 50 条
  • [1] MULTI-MODAL PRE-TRAINING FOR AUTOMATED SPEECH RECOGNITION
    Chan, David M.
    Ghosh, Shalini
    Chakrabarty, Debmalya
    Hoffmeister, Bjorn
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 246 - 250
  • [2] TableVLM: Multi-modal Pre-training for Table Structure Recognition
    Chen, Leiyuan
    Huang, Chengsong
    Zheng, Xiaoqing
    Lin, Jinshu
    Huang, Xuanjing
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2437 - 2449
  • [3] Multi-Modal Contrastive Pre-training for Recommendation
    Liu, Zhuang
    Ma, Yunpu
    Schubert, Matthias
    Ouyang, Yuanxin
    Xiong, Zhang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 99 - 108
  • [4] Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment
    Wang, Luyao
    Qi, Pengnian
    Bao, Xigang
    Zhou, Chunlai
    Qin, Biao
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 8, 2024, : 9116 - 9124
  • [5] MGeo: Multi-Modal Geographic Language Model Pre-Training
    Ding, Ruixue
    Chen, Boli
    Xie, Pengjun
    Huang, Fei
    Li, Xin
    Zhang, Qiang
    Xu, Yao
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 185 - 194
  • [6] DCSG: data complement pseudo-label refinement and self-guided pre-training for unsupervised person re-identification
    Han, Qing
    Chen, Jiongjin
    Min, Weidong
    Li, Jiahao
    Zhan, Lixin
    Li, Longfei
    [J]. VISUAL COMPUTER, 2024, 40 (10): : 7235 - 7248
  • [7] Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training
    Ju, Xincheng
    Zhang, Dong
    Zhu, Suyang
    Li, Junhui
    Li, Shoushan
    Zhou, Guodong
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 1045 - 1055
  • [8] Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion
    Yan, Zhiqiang
    Li, Xiang
    Wang, Kun
    Zhang, Zhenyu
    Li, Jun
    Yang, Jian
    [J]. COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 378 - 395
  • [9] Versatile Multi-Modal Pre-Training for Human-Centric Perception
    Hong, Fangzhou
    Pan, Liang
    Cai, Zhongang
    Liu, Ziwei
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16135 - 16145
  • [10] Active Learning with Contrastive Pre-training for Facial Expression Recognition
    Roy, Shuvendu
    Etemad, Ali
    [J]. 2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, ACII, 2023,