Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval

被引:0
|
作者
Zhang, Liang [1 ]
Hu, Anwen [1 ]
Jin, Qin [1 ,2 ]
机构
[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China
[2] Renmin Univ China, Key Lab Data Engn & Knowledge Engn, MOE, Beijing, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision and diverse languages are important information sources in our living world. A model that understands multi-modalities and multi-languages can be applied to a wider range of real-life scenarios. To build such a multimodal and multilingual model, existing works try to ensemble vision-language data from multiple languages in pre-training. However, due to the large number of languages, these works often require huge computing resources and cannot be flexibly extended to new languages. In this work, we propose a Multi-Lingual Acquisition (MLA) framework that can easily empower a monolingual Vision-Language Pre-training (VLP) model with multilingual capability. Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models. We further propose a two-stage training strategy to optimize the language acquisition encoder, namely the Native Language Transfer stage and the Language Exposure stage. With much less multilingual training data and computing resources, our model achieves state-of-the-art performance on multilingual image-text and video-text retrieval benchmarks.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Cross-lingual Cross-modal Pretraining for Multimodal Retrieval
    Fei, Hongliang
    Yu, Tan
    Li, Ping
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3644 - 3650
  • [2] Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training
    Song, Yuqing
    Chen, Shizhe
    Jin, Qin
    Luo, Wei
    Xie, Jun
    Huang, Fei
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2843 - 2852
  • [3] Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
    Zeng, Yan
    Zhou, Wangchunshu
    Luo, Ao
    Cheng, Ziming
    Zhang, Xinsong
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5731 - 5746
  • [4] UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
    Zhou, Mingyang
    Zhou, Luowei
    Wang, Shuohang
    Cheng, Yu
    Li, Linjie
    Yu, Zhou
    Liu, Jingjing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4153 - 4163
  • [5] UniXcoder: Unified Cross-Modal Pre-training for Code Representation
    Guo, Daya
    Lu, Shuai
    Duan, Nan
    Wang, Yanlin
    Zhou, Ming
    Yin, Jian
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7212 - 7225
  • [6] Cross-lingual Visual Pre-training for Multimodal Machine Translation
    Caglayan, Ozan
    Kuyu, Menekse
    Amac, Mustafa Sercan
    Madhyastha, Pranava
    Erdem, Erkut
    Erdem, Aykut
    Specia, Lucia
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1317 - 1324
  • [7] Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training
    Li, Zejun
    Fan, Zhihao
    Chen, JingJing
    Zhang, Qi
    Huang, Xuanjing
    Wei, Zhongyu
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5939 - 5958
  • [8] PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
    Guo, Zixin
    Wang, Tzu-Jui Julius
    Pehlivan, Selen
    Radman, Abduljalil
    Laaksonen, Jorma
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2261 - 2265
  • [9] COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
    Lu, Haoyu
    Fei, Nanyi
    Huo, Yuqi
    Gao, Yizhao
    Lu, Zhiwu
    Wen, Ji-Rong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15671 - 15680
  • [10] Multimodal adversarial network for cross-modal retrieval
    Hu, Peng
    Peng, Dezhong
    Wang, Xu
    Xiang, Yong
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 180 : 38 - 50