Multi-Modal Representation Learning with Text-Driven Soft Masks

被引:3
|
作者
Park, Jaeyoo [1 ]
Han, Bohyung [1 ,2 ]
机构
[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea
[2] Seoul Natl Univ, IPAI, Seoul, South Korea
关键词
D O I
10.1109/CVPR52729.2023.00274
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.
引用
收藏
页码:2798 / 2807
页数:10
相关论文
共 50 条
  • [41] TeSTNeRF: Text-Driven 3D Style Transfer via Cross-Modal Learning
    Chen, Jiafu
    Ji, Boyan
    Zhang, Zhanjie
    Chu, Tianyi
    Zuo, Zhiwen
    Zhao, Lei
    Xing, Wei
    Lu, Dongming
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5788 - 5796
  • [42] CLMTR: a generic framework for contrastive multi-modal trajectory representation learning
    Liang, Anqi
    Yao, Bin
    Xie, Jiong
    Zheng, Wenli
    Shen, Yanyan
    Ge, Qiqi
    GEOINFORMATICA, 2024, : 233 - 253
  • [43] Unsupervised Multi-modal Learning
    Iqbal, Mohammed Shameer
    ADVANCES IN ARTIFICIAL INTELLIGENCE (AI 2015), 2015, 9091 : 343 - 346
  • [44] Online Multi-modal Task-Driven Dictionary Learning and Robust Joint Sparse Representation for Visual Tracking
    Taalimi, Ali
    Qi, Hairong
    Khorsandi, Rahman
    2015 12TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2015,
  • [45] Learning Multi-modal Similarity
    McFee, Brian
    Lanckriet, Gert
    JOURNAL OF MACHINE LEARNING RESEARCH, 2011, 12 : 491 - 523
  • [46] Multi-modal Feedback for Affordance-driven Interactive Reinforcement Learning
    Cruz, Francisco
    Parisi, German, I
    Wermter, Stefan
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [47] DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency
    Yao, Wenfang
    Yin, Kejing
    Cheung, William K.
    Liu, Jia
    Qin, Jing
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15, 2024, : 16416 - 16424
  • [48] Multi-Region Text-Driven Manipulation of Diffusion Imagery
    Li, Yiming
    Zhou, Peng
    Sun, Jun
    Xu, Yi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3261 - 3269
  • [49] Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks
    Chemmanam, Ajai John
    Jose, Bijoy A.
    Moopan, Asif
    2024 IEEE 7TH INTERNATIONAL CONFERENCE ON INDUSTRIAL CYBER-PHYSICAL SYSTEMS, ICPS 2024, 2024,
  • [50] Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
    Wei, Kaimin
    Zhou, Zhibo
    IEEE ACCESS, 2020, 8 (08): : 96237 - 96248