Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

被引:81
|
作者
Huang, Zhicheng [1 ,2 ]
Zeng, Zhaoyang [3 ]
Huang, Yupan [3 ]
Liu, Bei [4 ]
Fu, Dongmei [1 ,2 ]
Fu, Jianlong [4 ]
机构
[1] Univ Sci & Technol Beijing, Sch Automat & Elect Engn, Beijing, Peoples R China
[2] Beijing Engn Res Ctr Ind Spectrum Imaging, Beijing, Peoples R China
[3] Sun Yat Sen Univ, Guangzhou, Peoples R China
[4] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR46437.2021.01278
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing visionlanguage models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "Seeing Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR2 test-P split, 6.7% accuracy on SNLI-VE test split, respectively.
引用
收藏
页码:12971 / 12980
页数:10
相关论文
共 50 条
  • [1] E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
    Xu, Haiyang
    Yan, Ming
    Li, Chenliang
    Bi, Bin
    Huang, Songfang
    Xiao, Wenming
    Huang, Fei
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 503 - 513
  • [2] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
    Jian, Yiren
    Gao, Chongyang
    Vosoughi, Soroush
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [3] Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
    Ji, Yatai
    Tu, Rongcheng
    Jiang, Jie
    Kong, Weijie
    Cai, Chengfei
    Zhao, Wenzhe
    Wang, Hongfa
    Yang, Yujiu
    Liu, Wei
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6789 - 6798
  • [4] Vision-Language Pre-Training with Triple Contrastive Learning
    Yang, Jinyu
    Duan, Jiali
    Tran, Son
    Xu, Yi
    Chanda, Sampath
    Chen, Liqun
    Zeng, Belinda
    Chilimbi, Trishul
    Huang, Junzhou
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15650 - 15659
  • [5] Survey on Vision-language Pre-training
    Yin, Jiong
    Zhang, Zhe-Dong
    Gao, Yu-Han
    Yang, Zhi-Wen
    Li, Liang
    Xiao, Mang
    Sun, Yao-Qi
    Yan, Cheng-Gang
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
  • [6] SPEECH-LANGUAGE PRE-TRAINING FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Qian, Yao
    Bianv, Ximo
    Shi, Yu
    Kanda, Naoyuki
    Shen, Leo
    Xiao, Zhen
    Zeng, Michael
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7458 - 7462
  • [7] Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
    Wang, Tzu-Jui Julius
    Laaksonen, Jorma
    Langer, Tomas
    Arponen, Heikki
    Bishop, Tom E.
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1073 - 1083
  • [8] Speech Model Pre-training for End-to-End Spoken Language Understanding
    Lugosch, Loren
    Ravanelli, Mirco
    Ignoto, Patrick
    Tomar, Vikrant Singh
    Bengio, Yoshua
    [J]. INTERSPEECH 2019, 2019, : 814 - 818
  • [9] UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
    Li, Wei
    Gao, Can
    Niu, Guocheng
    Xiao, Xinyan
    Liu, Hao
    Liu, Jiachen
    Wu, Hua
    Wang, Haifeng
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3187 - 3201
  • [10] VLP: A Survey on Vision-language Pre-training
    Chen, Fei-Long
    Zhang, Du-Zhen
    Han, Ming-Lun
    Chen, Xiu-Yi
    Shi, Jing
    Xu, Shuang
    Xu, Bo
    [J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (01) : 38 - 56