Multi-Modal Representation Learning with Text-Driven Soft Masks

被引：3

作者：

Park, Jaeyoo ^{[1
]}

Han, Bohyung ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Comp Vis Lab, ECE, Seoul, South Korea

[2] Seoul Natl Univ, IPAI, Seoul, South Korea

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.00274

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.

引用

页码：2798 / 2807

页数：10

共 50 条

[31] Soft multi-modal data fusion
Coppock, S
Mazack, L
PROCEEDINGS OF THE 12TH IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1 AND 2, 2003, : 636 - 641
[32] A Discriminant Information Theoretic Learning Framework for Multi-modal Feature Representation
Gao, Lei
Guan, Ling
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2023, 14 (03)
[33] Affective Interaction: Attentive Representation Learning for Multi-Modal Sentiment Classification
Zhang, Yazhou
Tiwari, Prayag
Rong, Lu
Chen, Rui
Alnajem, Nojoom A.
Hossain, M. Shamim
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (03)
[34] Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning
Jiang, Qian
Chen, Changyou
Zhao, Han
Chen, Liqun
Ping, Qing
Tran, Son Dinh
Xu, Yi
Zeng, Belinda
Chilimbi, Trishul
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7661 - 7671
[35] Lightweight Multi-modal Representation Learning for RGB Salient Object Detection
Xiao, Yun
Huang, Yameng
Li, Chenglong
Liu, Lei
Zhou, Aiwu
Tang, Jin
COGNITIVE COMPUTATION, 2023, 15 (06) : 1868 - 1883
[36] Lightweight Multi-modal Representation Learning for RGB Salient Object Detection
Yun Xiao
Yameng Huang
Chenglong Li
Lei Liu
Aiwu Zhou
Jin Tang
Cognitive Computation, 2023, 15 : 1868 - 1883
[37] Multi-modal entity alignment based on joint knowledge representation learning
Wang H.-Y.
Lun B.
Zhang X.-M.
Sun X.-L.
Kongzhi yu Juece/Control and Decision, 2021, 35 (12): : 2855 - 2864
[38] SSDMM-VAE: variational multi-modal disentangled representation learning
Arnab Kumar Mondal
Ajay Sailopal
Parag Singla
Prathosh AP
Applied Intelligence, 2023, 53 : 8467 - 8481
[39] Incomplete multi-modal representation learning for Alzheimer's disease diagnosis
Liu, Yanbei
Fan, Lianxi
Zhang, Changqing
Zhou, Tao
Xiao, Zhitao
Geng, Lei
Shen, Dinggang
MEDICAL IMAGE ANALYSIS, 2021, 69
[40] Deep Multi-modal Latent Representation Learning for Automated Dementia Diagnosis
Zhou, Tao
Liu, Mingxia
Fu, Huazhu
Wang, Jun
Shen, Jianbing
Shao, Ling
Shen, Dinggang
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT IV, 2019, 11767 : 629 - 638

← 1 2 3 4 5 →