HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition

被引:6
|
作者
Sun, Licai [1 ,2 ]
Lian, Zheng [1 ]
Liu, Bin [1 ,2 ]
Tao, Jianhua [3 ,4 ]
机构
[1] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
[4] Tsinghua Univ, Beijing Natl Res Ctr Informat Sci & Technol, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Audio-Visual Emotion Recognition; Self-supervised learning; Masked autoencoder; Contrastive learning; FACIAL EXPRESSION RECOGNITION; FEATURES; AUDIO;
D O I
10.1016/j.inffus.2024.102382
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio -Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion -aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self -supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self -supervised framework that leverages large-scale self -supervised pre -training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self -supervised audio-visual representation learning, HiCMAE adopts two primary forms of self -supervision for pre -training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top -layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three -pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. Firstly, it incorporates hierarchical skip connections between the encoder and decoder to encourage intermediate layers to learn more meaningful representations and bolster masked audio-visual reconstruction. Secondly, hierarchical cross -modal contrastive learning is also exerted on intermediate representations to narrow the audio-visual modality gap progressively and facilitate subsequent cross -modal fusion. Finally, during downstream fine-tuning, HiCMAE employs hierarchical feature fusion to comprehensively integrate multi -level features from different layers. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self -supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models are publicly available at https://github.com/sunlicai/HiCMAE.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
    Sun, Licai
    Lian, Zheng
    Liu, Bin
    Tao, Jianhua
    Information Fusion, 2024, 108
  • [2] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [3] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    INTERSPEECH 2022, 2022, : 2118 - 2122
  • [4] Universal Sound Separation with Self-Supervised Audio Masked Autoencoder
    Zhao, Junqi
    Liu, Xubo
    Zhao, Jinzheng
    Yuan, Yi
    Kong, Qiuqiang
    Plumbley, Mark D.
    Wang, Wenwu
    32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 1 - 5
  • [5] A Survey on Masked Autoencoder for Visual Self-supervised Learning
    Zhang, Chaoning
    Zhang, Chenshuang
    Song, Junha
    Yi, John Seon Keun
    Kweon, In So
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 6805 - 6813
  • [6] Self-Supervised Audio-Visual Soundscape Stylization
    Li, Tingle
    Wang, Renhao
    Huang, Po-Yao
    Owens, Andrew
    Anumanchipalli, Gopala
    COMPUTER VISION - ECCV 2024, PT LXXX, 2025, 15138 : 20 - 40
  • [7] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
    Pan, Xichen
    Chen, Peiyu
    Gong, Yichen
    Zhou, Helong
    Wang, Xinbing
    Lin, Zhouhan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
  • [8] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
    Kurobe, Akiyoshi
    Nakajima, Yoshikatsu
    Kitani, Kris
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 29970 - 29979
  • [9] Masked Graph Autoencoder for Self-Supervised Transportation Mode Recognition
    Zeng, Ziyi
    Wang, Guanwen
    Zhang, Yifan
    Guan, Qingfeng
    Yu, Wenhao
    TRANSACTIONS IN GIS, 2025, 29 (01)
  • [10] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
    Rouditchenko, Andrew
    Zhao, Hang
    Gan, Chuang
    McDermott, Josh
    Torralba, Antonio
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361