Multi-Modal Representation Learning with Text-Driven Soft Masks

被引:3
|
作者
Park, Jaeyoo [1 ]
Han, Bohyung [1 ,2 ]
机构
[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea
[2] Seoul Natl Univ, IPAI, Seoul, South Korea
关键词
D O I
10.1109/CVPR52729.2023.00274
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.
引用
收藏
页码:2798 / 2807
页数:10
相关论文
共 50 条
  • [1] Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection
    Zhou, Siyu
    Zhang, Fjwei
    Wang, Ruomei
    Su, Zhuo
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 254 - 268
  • [2] Multi-modal Network Representation Learning
    Zhang, Chuxu
    Jiang, Meng
    Zhang, Xiangliang
    Ye, Yanfang
    Chawla, Nitesh, V
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3557 - 3558
  • [3] MULTI-MODAL LEARNING WITH TEXT MERGING FOR TEXTVQA
    Xu, Changsheng
    Xu, Zhenlong
    He, Yifan
    Zhou, Shuigeng
    Guan, Jihong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1985 - 1989
  • [4] Mineral: Multi-modal Network Representation Learning
    Kefato, Zekarias T.
    Sheikh, Nasrullah
    Montresor, Alberto
    MACHINE LEARNING, OPTIMIZATION, AND BIG DATA, MOD 2017, 2018, 10710 : 286 - 298
  • [5] Scalable multi-modal representation learning networks
    Zihan Fang
    Ying Zou
    Shiyang Lan
    Shide Du
    Yanchao Tan
    Shiping Wang
    Artificial Intelligence Review, 58 (7)
  • [6] Graph-Text Multi-Modal Pre-training for Medical Representation Learning
    Park, Sungjin
    Bae, Seongsu
    Kim, Jiho
    Kim, Tackeun
    Choi, Edward
    CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 174, 2022, 174 : 261 - 281
  • [7] Graph and text multi-modal representation learning with momentum distillation on Electronic Health Records
    Cao, Yu
    Wang, Xu
    Wang, Qian
    Yuan, Zhong
    Shi, Yongguo
    Peng, Dezhong
    KNOWLEDGE-BASED SYSTEMS, 2024, 302
  • [8] Attention driven multi-modal similarity learning
    Gao, Xinjian
    Mu, Tingting
    Goulermas, John Y.
    Wang, Meng
    INFORMATION SCIENCES, 2018, 432 : 530 - 542
  • [9] Fast Multi-Modal Unified Sparse Representation Learning
    Verma, Mridula
    Shukla, Kaushal Kumar
    PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 448 - 452
  • [10] Multi-modal Representation Learning for Successive POI Recommendation
    Li, Lishan
    Liu, Ying
    Wu, Jianping
    He, Lin
    Ren, Gang
    ASIAN CONFERENCE ON MACHINE LEARNING, VOL 101, 2019, 101 : 441 - 456