Versatile Multi-Modal Pre-Training for Human-Centric Perception

被引:4
|
作者
Hong, Fangzhou [1 ]
Pan, Liang [1 ]
Cai, Zhongang [1 ,2 ,3 ]
Liu, Ziwei [1 ]
机构
[1] Nanyang Technol Univ, S Lab, Singapore, Singapore
[2] SenseTime Res, Hong Kong, Peoples R China
[3] Shanghai AI Lab, Shanghai, Peoples R China
关键词
D O I
10.1109/CVPR52688.2022.01568
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pretrain model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic consistency. HCMoCo provides pre-train for different modalities by combining heterogeneous datasets, which allows efficient usage of existing task-specific human data. Extensive experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo, especially under data-efficient settings (7.16% and 12% improvement on DensePose Estimation and Human Parsing). Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality supervision and missing-modality inference, validating its strong ability in cross-modal association and reasoning. Codes are available at https://github.com/hongfz16/HCMoCo.
引用
收藏
页码:16135 / 16145
页数:11
相关论文
共 50 条
  • [1] Multi-Modal Contrastive Pre-training for Recommendation
    Liu, Zhuang
    Ma, Yunpu
    Schubert, Matthias
    Ouyang, Yuanxin
    Xiong, Zhang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 99 - 108
  • [2] MULTI-MODAL PRE-TRAINING FOR AUTOMATED SPEECH RECOGNITION
    Chan, David M.
    Ghosh, Shalini
    Chakrabarty, Debmalya
    Hoffmeister, Bjorn
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 246 - 250
  • [3] MGeo: Multi-Modal Geographic Language Model Pre-Training
    Ding, Ruixue
    Chen, Boli
    Xie, Pengjun
    Huang, Fei
    Li, Xin
    Zhang, Qiang
    Xu, Yao
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 185 - 194
  • [4] TableVLM: Multi-modal Pre-training for Table Structure Recognition
    Chen, Leiyuan
    Huang, Chengsong
    Zheng, Xiaoqing
    Lin, Jinshu
    Huang, Xuanjing
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2437 - 2449
  • [5] Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion
    Yan, Zhiqiang
    Li, Xiang
    Wang, Kun
    Zhang, Zhenyu
    Li, Jun
    Yang, Jian
    [J]. COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 378 - 395
  • [6] Graph-Text Multi-Modal Pre-training for Medical Representation Learning
    Park, Sungjin
    Bae, Seongsu
    Kim, Jiho
    Kim, Tackeun
    Choi, Edward
    [J]. CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 174, 2022, 174 : 261 - 281
  • [7] MMPT'21: International JointWorkshop on Multi-Modal Pre-Training for Multimedia Understanding
    Liu, Bei
    Fu, Jianlong
    Chen, Shizhe
    Jin, Qin
    Hauptmann, Alexander
    Rui, Yong
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 694 - 695
  • [8] PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
    Li, Yunshui
    Hui, Binyuan
    Yin, Zhichao
    Yang, Min
    Huang, Fei
    Li, Yongbin
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 13402 - 13416
  • [9] The Effectiveness of Self-supervised Pre-training for Multi-modal Endometriosis Classification
    Butler, David
    Wang, Hu
    Zhang, Yuan
    To, Minh-Son
    Condous, George
    Leonardi, Mathew
    Knox, Steven
    Avery, Jodie
    Hull, M. Louise
    Carneiro, Gustavo
    [J]. 2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
  • [10] Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training
    Chen, Zhihong
    Du, Yuhao
    Hu, Jinpeng
    Liu, Yang
    Li, Guanbin
    Wan, Xiang
    Chang, Tsung-Hui
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 679 - 689