On Scaling up a Multilingual Vision and Language Model

被引:1
|
作者
Chen, Xi [1 ]
Djolonga, Josip [1 ]
Padlewski, Piotr [1 ]
Mustafa, Basil [1 ]
Changpinyo, Soravit [1 ]
Wu, Jialin [1 ]
Ruiz, Carlos Riquelme [1 ]
Goodman, Sebastian [1 ]
Wang, Xiao [1 ]
Tay, Yi [1 ]
Shakeri, Siamak [1 ]
Dehghani, Mostafa [1 ]
Salz, Daniel [1 ]
Lucic, Mario [1 ]
Tschannen, Michael [1 ]
Nagrani, Arsha [1 ]
Hu, Hexiang [1 ]
Joshi, Mandar [1 ]
Pang, Bo [1 ]
Montgomery, Ceslee [1 ]
Pietrzyk, Paulina [1 ]
Ritter, Marvin [1 ]
Piergiovanni, A. J. [1 ]
Minderer, Matthias [1 ]
Pavetic, Filip [1 ]
Waters, Austin [1 ]
Li, Gang [1 ]
Alabdulmohsin, Ibrahim [1 ]
Beyer, Lucas [1 ]
Amelot, Julien [1 ]
Lee, Kenton [1 ]
Steiner, Andreas Peter [1 ]
Li, Yang [1 ]
Keysers, Daniel [1 ]
Arnab, Anurag [1 ]
Xu, Yuanzhong [1 ]
Rong, Keran [1 ]
Kolesnikov, Alexander [1 ]
Seyedhosseini, Mojtaba [1 ]
Angelova, Anelia [1 ]
Zhai, Xiaohua [1 ]
Houlsby, Neil [1 ]
Soricut, Radu [1 ]
机构
[1] Google, Mountain View, CA 94043 USA
关键词
D O I
10.1109/CVPR52733.2024.01368
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
引用
收藏
页码:14432 / 14444
页数:13
相关论文
共 50 条
  • [21] Scaling Vision-Language Models with Sparse Mixture of Experts
    Shen, Sheng
    Yao, Zhewei
    Li, Chunyuan
    Darrell, Trevor
    Keutzer, Kurt
    He, Yuxiong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11329 - 11344
  • [22] Extrapolating Multilingual Language Understanding Models as Multilingual Language Generators
    Wu, Bohong
    Yuan, Fei
    Zhao, Hai
    Li, Lei
    Xu, Jingjing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15432 - 15444
  • [23] A Vision Check-up for Language Models
    Sharma, Pratyusha
    Shaham, Tamar Rott
    Baradad, Manel
    Fu, Stephanie
    Rodriguez-Munoz, Adrian
    Duggal, Shivam
    Isola, Phillip
    Torralba, Antonio
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 14410 - 14419
  • [24] Accelerating Multilingual Language Model for Excessively Tokenized Languages
    Hong, Jimin
    Lee, Gibbeum
    Cho, Jaewoong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11095 - 11111
  • [25] Amorphous language as alternative model for multilingual education in the Philippines
    Belvis, Cyril
    Morauda-Gutierrez, Merry Ruth
    COGENT EDUCATION, 2019, 6 (01):
  • [26] Scaling up Prediction of Psychosis by Natural Language Processing
    Si, Dong
    Cheng, Sunny Chieh
    Xing, Ruiwen
    Liu, Chang
    Wu, Hoi Yan
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 339 - 347
  • [27] Scaling up Predictive Processing to language with Construction Grammar
    Michel, Christian
    PHILOSOPHICAL PSYCHOLOGY, 2023, 36 (03) : 553 - 579
  • [28] Scaling up disease model discovery
    Dustin M. Graham
    Lab Animal, 2017, 46 : 334 - 334
  • [29] Scaling up disease model discovery
    Graham, Dustin M.
    LAB ANIMAL, 2017, 46 (09) : 334 - 334
  • [30] ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis
    de la Rosa, Javier
    Perez Pozo, Alvaro
    Ros, Salvador
    Gonzalez Blanco, Elena
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2023, (71): : 215 - 225