Mixed Autoencoder for Self-supervised Visual Representation Learning

被引:10
|
作者
Chen, Kai [1 ]
Liu, Zhili [1 ,2 ]
Hong, Lanqing [2 ]
Xu, Hang [2 ]
Li, Zhenguo [2 ]
Yeung, Dit-Yan [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[2] Huawei Noahs Ark Lab, Montreal, PQ, Canada
关键词
D O I
10.1109/CVPR52729.2023.02178
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction. However, effective data augmentation strategies for MAE still remain open questions, different from those in contrastive learning that serve as the most important part. This paper studies the prevailing mixing augmentation for MAE. We first demonstrate that naive mixing will in contrast degenerate model performance due to the increase of mutual information (MI). To address, we propose homologous recognition, an auxiliary pretext task, not only to alleviate the MI increasement by explicitly requiring each patch to recognize homologous patches, but also to perform object-aware self-supervised pre-training for better downstream dense perception performance. With extensive experiments, we demonstrate that our proposed Mixed Autoencoder (MixedAE) achieves the state-of-the-art transfer results among masked image modeling (MIM) augmentations on different downstream tasks with significant efficiency. Specifically, our MixedAE outperforms MAE by +0.3% accuracy, +1.7 mIoU and +0.9 AP on ImageNet-1K, ADE20K and COCO respectively with a standard ViT-Base. Moreover, MixedAE surpasses iBOT, a strong MIM method combined with instance discrimination, while accelerating training by 2x. To our best knowledge, this is the very first work to consider mixing for MIM from the perspective of pretext task design. Code will be made available.
引用
收藏
页码:22742 / 22751
页数:10
相关论文
共 50 条
  • [31] Self-Distilled Self-supervised Representation Learning
    Jang, Jiho
    Kim, Seonhoon
    Yoo, Kiyoon
    Kong, Chaerin
    Kim, Jangho
    Kwak, Nojun
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2828 - 2838
  • [32] Towards Latent Masked Image Modeling for Self-supervised Visual Representation Learning
    Wei, Yibing
    Gupta, Abhinav
    Morgado, Pedro
    COMPUTER VISION - ECCV 2024, PT XXXIX, 2025, 15097 : 1 - 17
  • [33] solo-learn: A Library of Self-supervised Methods for Visual Representation Learning
    Turrisi da Costa, Victor G.
    Fini, Enrico
    Nabi, Moin
    Sebe, Nicu
    Ricci, Elisa
    Journal of Machine Learning Research, 2022, 23
  • [34] Semantics-Consistent Feature Search for Self-Supervised Visual Representation Learning
    Song, Kaiyou
    Zhang, Shan
    Luo, Zimeng
    Wang, Tong
    Xie, Jin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 16053 - 16062
  • [35] Feature selection and cascade dimensionality reduction for self-supervised visual representation learning
    Qu, Peixin
    Jin, Songlin
    Tian, Yongqin
    Zhou, Ling
    Zheng, Ying
    Zhang, Weidong
    Xu, Yibo
    Pan, Xipeng
    Zhao, Wenyi
    COMPUTERS & ELECTRICAL ENGINEERING, 2023, 106
  • [36] Self-Supervised Representation Learning using Visual Field Expansion on Digital Pathology
    Boyd, Joseph
    Liashuha, Mykola
    Deutsch, Eric
    Paragios, Nikos
    Christodoulidis, Stergios
    Vakalopoulou, Maria
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 639 - 647
  • [37] Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning
    Ge, Chongjian
    Liang, Youwei
    Song, Yibing
    Jiao, Jianbo
    Wang, Jue
    Luo, Ping
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [38] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [39] solo-learn: A Library of Self-supervised Methods for Visual Representation Learning
    Turrisi da Costa, Victor G.
    Fini, Enrico
    Nabi, Moin
    Sebe, Nicu
    Ricci, Elisa
    JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23 : 1 - 6
  • [40] Self-Supervised Speech Representation Learning: A Review
    Mohamed, Abdelrahman
    Lee, Hung-yi
    Borgholt, Lasse
    Havtorn, Jakob D.
    Edin, Joakim
    Igel, Christian
    Kirchhoff, Katrin
    Li, Shang-Wen
    Livescu, Karen
    Maaloe, Lars
    Sainath, Tara N.
    Watanabe, Shinji
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1179 - 1210