Multi-Modal Representation Learning with Text-Driven Soft Masks

被引:3
|
作者
Park, Jaeyoo [1 ]
Han, Bohyung [1 ,2 ]
机构
[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea
[2] Seoul Natl Univ, IPAI, Seoul, South Korea
关键词
D O I
10.1109/CVPR52729.2023.00274
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.
引用
收藏
页码:2798 / 2807
页数:10
相关论文
共 50 条
  • [31] Soft multi-modal data fusion
    Coppock, S
    Mazack, L
    PROCEEDINGS OF THE 12TH IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1 AND 2, 2003, : 636 - 641
  • [32] A Discriminant Information Theoretic Learning Framework for Multi-modal Feature Representation
    Gao, Lei
    Guan, Ling
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2023, 14 (03)
  • [33] Affective Interaction: Attentive Representation Learning for Multi-Modal Sentiment Classification
    Zhang, Yazhou
    Tiwari, Prayag
    Rong, Lu
    Chen, Rui
    Alnajem, Nojoom A.
    Hossain, M. Shamim
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (03)
  • [34] Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning
    Jiang, Qian
    Chen, Changyou
    Zhao, Han
    Chen, Liqun
    Ping, Qing
    Tran, Son Dinh
    Xu, Yi
    Zeng, Belinda
    Chilimbi, Trishul
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7661 - 7671
  • [35] Lightweight Multi-modal Representation Learning for RGB Salient Object Detection
    Xiao, Yun
    Huang, Yameng
    Li, Chenglong
    Liu, Lei
    Zhou, Aiwu
    Tang, Jin
    COGNITIVE COMPUTATION, 2023, 15 (06) : 1868 - 1883
  • [36] Lightweight Multi-modal Representation Learning for RGB Salient Object Detection
    Yun Xiao
    Yameng Huang
    Chenglong Li
    Lei Liu
    Aiwu Zhou
    Jin Tang
    Cognitive Computation, 2023, 15 : 1868 - 1883
  • [37] Multi-modal entity alignment based on joint knowledge representation learning
    Wang H.-Y.
    Lun B.
    Zhang X.-M.
    Sun X.-L.
    Kongzhi yu Juece/Control and Decision, 2021, 35 (12): : 2855 - 2864
  • [38] SSDMM-VAE: variational multi-modal disentangled representation learning
    Arnab Kumar Mondal
    Ajay Sailopal
    Parag Singla
    Prathosh AP
    Applied Intelligence, 2023, 53 : 8467 - 8481
  • [39] Incomplete multi-modal representation learning for Alzheimer's disease diagnosis
    Liu, Yanbei
    Fan, Lianxi
    Zhang, Changqing
    Zhou, Tao
    Xiao, Zhitao
    Geng, Lei
    Shen, Dinggang
    MEDICAL IMAGE ANALYSIS, 2021, 69
  • [40] Deep Multi-modal Latent Representation Learning for Automated Dementia Diagnosis
    Zhou, Tao
    Liu, Mingxia
    Fu, Huazhu
    Wang, Jun
    Shen, Jianbing
    Shao, Ling
    Shen, Dinggang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT IV, 2019, 11767 : 629 - 638