On Scaling up a Multilingual Vision and Language Model

被引:1
|
作者
Chen, Xi [1 ]
Djolonga, Josip [1 ]
Padlewski, Piotr [1 ]
Mustafa, Basil [1 ]
Changpinyo, Soravit [1 ]
Wu, Jialin [1 ]
Ruiz, Carlos Riquelme [1 ]
Goodman, Sebastian [1 ]
Wang, Xiao [1 ]
Tay, Yi [1 ]
Shakeri, Siamak [1 ]
Dehghani, Mostafa [1 ]
Salz, Daniel [1 ]
Lucic, Mario [1 ]
Tschannen, Michael [1 ]
Nagrani, Arsha [1 ]
Hu, Hexiang [1 ]
Joshi, Mandar [1 ]
Pang, Bo [1 ]
Montgomery, Ceslee [1 ]
Pietrzyk, Paulina [1 ]
Ritter, Marvin [1 ]
Piergiovanni, A. J. [1 ]
Minderer, Matthias [1 ]
Pavetic, Filip [1 ]
Waters, Austin [1 ]
Li, Gang [1 ]
Alabdulmohsin, Ibrahim [1 ]
Beyer, Lucas [1 ]
Amelot, Julien [1 ]
Lee, Kenton [1 ]
Steiner, Andreas Peter [1 ]
Li, Yang [1 ]
Keysers, Daniel [1 ]
Arnab, Anurag [1 ]
Xu, Yuanzhong [1 ]
Rong, Keran [1 ]
Kolesnikov, Alexander [1 ]
Seyedhosseini, Mojtaba [1 ]
Angelova, Anelia [1 ]
Zhai, Xiaohua [1 ]
Houlsby, Neil [1 ]
Soricut, Radu [1 ]
机构
[1] Google, Mountain View, CA 94043 USA
关键词
D O I
10.1109/CVPR52733.2024.01368
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
引用
收藏
页码:14432 / 14444
页数:13
相关论文
共 50 条
  • [41] Multilingual news extraction via stopword language model scoring
    Wu, Yu-Chieh
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 48 (01) : 191 - 213
  • [42] Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
    Ustun, Ahmet
    Aryabumi, Viraat
    Yong, Zheng-Xin
    Ko, Wei-Yin
    D'souza, Daniel
    Onilude, Gbemileke
    Bhandari, Neel
    Singh, Shivalika
    Ooi, Hui-Lee
    Kayid, Amr
    Vargus, Freddie
    Blunsom, Phil
    Longpre, Shayne
    Muennighoff, Niklas
    Fadaee, Marzieh
    Kreutzer, Julia
    Hooker, Sara
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 15894 - 15939
  • [43] Language Scaling for Universal Suggested Replies Model
    Ying, Qianlan
    Bajaj, Payal
    Deb, Budhaditya
    Yang, Yu
    Wang, Wei
    Lin, Bojia
    Shokouhi, Milad
    Song, Xia
    Yang, Yang
    Jiang, Daxin
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, NAACL-HLT 2021, 2021, : 138 - 145
  • [44] Unsupervised Estimation of the Language Model Scaling Factor
    White, Christopher M.
    Rastrow, Ariya
    Khudanpur, Sanjeev
    Jelinek, Frederick
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1195 - +
  • [45] Scaling Up Sign Spotting Through Sign Language Dictionaries
    Gül Varol
    Liliane Momeni
    Samuel Albanie
    Triantafyllos Afouras
    Andrew Zisserman
    International Journal of Computer Vision, 2022, 130 : 1416 - 1439
  • [46] Scaling up school restructuring in multicultural, multilingual contexts - Early observations from Sunland county
    Stringfield, S
    Datnow, A
    Ross, SM
    Snively, F
    EDUCATION AND URBAN SOCIETY, 1998, 30 (03) : 326 - 357
  • [47] Scaling Up Sign Spotting Through Sign Language Dictionaries
    Varol, Gul
    Momeni, Liliane
    Albanie, Samuel
    Afouras, Triantafyllos
    Zisserman, Andrew
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (06) : 1416 - 1439
  • [48] SCALING UP AND ZOOMING IN: BIG DATA AND PERSONALIZATION IN LANGUAGE LEARNING
    Godwin-Jones, Robert
    LANGUAGE LEARNING & TECHNOLOGY, 2017, 21 (01): : 4 - 15
  • [49] NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
    Sammani, Fawaz
    Mukherjee, Tanmoy
    Deligiannis, Nikos
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8312 - 8322
  • [50] Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
    Ku, Alexander
    Anderson, Peter
    Patel, Roma
    Le, Eugene
    Baldridge, Jason
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 4392 - 4412