On Scaling up a Multilingual Vision and Language Model

被引:1
|
作者
Chen, Xi [1 ]
Djolonga, Josip [1 ]
Padlewski, Piotr [1 ]
Mustafa, Basil [1 ]
Changpinyo, Soravit [1 ]
Wu, Jialin [1 ]
Ruiz, Carlos Riquelme [1 ]
Goodman, Sebastian [1 ]
Wang, Xiao [1 ]
Tay, Yi [1 ]
Shakeri, Siamak [1 ]
Dehghani, Mostafa [1 ]
Salz, Daniel [1 ]
Lucic, Mario [1 ]
Tschannen, Michael [1 ]
Nagrani, Arsha [1 ]
Hu, Hexiang [1 ]
Joshi, Mandar [1 ]
Pang, Bo [1 ]
Montgomery, Ceslee [1 ]
Pietrzyk, Paulina [1 ]
Ritter, Marvin [1 ]
Piergiovanni, A. J. [1 ]
Minderer, Matthias [1 ]
Pavetic, Filip [1 ]
Waters, Austin [1 ]
Li, Gang [1 ]
Alabdulmohsin, Ibrahim [1 ]
Beyer, Lucas [1 ]
Amelot, Julien [1 ]
Lee, Kenton [1 ]
Steiner, Andreas Peter [1 ]
Li, Yang [1 ]
Keysers, Daniel [1 ]
Arnab, Anurag [1 ]
Xu, Yuanzhong [1 ]
Rong, Keran [1 ]
Kolesnikov, Alexander [1 ]
Seyedhosseini, Mojtaba [1 ]
Angelova, Anelia [1 ]
Zhai, Xiaohua [1 ]
Houlsby, Neil [1 ]
Soricut, Radu [1 ]
机构
[1] Google, Mountain View, CA 94043 USA
关键词
D O I
10.1109/CVPR52733.2024.01368
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
引用
收藏
页码:14432 / 14444
页数:13
相关论文
共 50 条
  • [1] Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model
    Wei, Haoran
    Kong, Lingyu
    Chen, Jinyue
    Zhao, Liang
    Ge, Zheng
    Yang, Jinrong
    Sun, Jianjian
    Han, Chunrui
    Zhang, Xiangyu
    COMPUTER VISION-ECCV 2024, PT IV, 2025, 15062 : 408 - 424
  • [2] SCALING UP DELIBERATION FOR MULTILINGUAL ASR
    Hu, Ke
    Li, Bo
    Sainath, Tara N.
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 771 - 776
  • [3] Scaling Up Vision-Language Pre-training for Image Captioning
    Hu, Xiaowei
    Gan, Zhe
    Wang, Jianfeng
    Yang, Zhengyuan
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17959 - 17968
  • [4] Scaling Up Our Vision
    Oreskes, Naomi
    ISIS, 2014, 105 (02) : 379 - 391
  • [5] Scaling Multilingual Corpora and Language Models to 500 Languages
    Imani, Ayyoob
    Lin, Peiqin
    Kargaran, Amir Hossein
    Severini, Silvia
    Sabet, Masoud Jalili
    Kassner, Nora
    Ma, Chunlan
    Schmid, Helmut
    Martins, Andre F. T.
    Yvon, Francois
    Schuetze, Hinrich
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1082 - 1117
  • [6] Vision-Based Multilingual Sign Language Translation
    Ghotkar A.
    Barde U.
    Sonawane S.
    Gokhale A.
    SN Computer Science, 4 (6)
  • [7] Language model for multilingual natural language generation
    Zhang, Dongmo
    Ge, Yong
    Yao, Tianfang
    Shanghai Jiaotong Daxue Xuebao/Journal of Shanghai Jiaotong University, 2000, 34 (07): : 944 - 947
  • [8] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
    Jia, Chao
    Yang, Yinfei
    Xia, Ye
    Chen, Yi-Ting
    Parekh, Zarana
    Pham, Hieu
    Le, Quoc, V
    Sung, Yunhsuan
    Li, Zhen
    Duerig, Tom
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [9] Scaling-up medical vision-and-language representation learning with federated learning
    Lu, Siyu
    Liu, Zheng
    Liu, Tianlin
    Zhou, Wangchunshu
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 126
  • [10] Revisiting Neural Scaling Laws in Language and Vision
    Alabdulmohsin, Ibrahim
    Neyshabur, Behnam
    Zhai, Xiaohua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,