General Object Foundation Model for Images and Videos at Scale

被引:4
|
作者
Wu, Junfeng [1 ]
Jiang, Yi [2 ]
Liu, Qihao [3 ]
Yuan, Zehuan [2 ]
Bai, Xiang [1 ]
Bai, Song [2 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] ByteDance Inc, Beijing, Peoples R China
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年
关键词
D O I
10.1109/CVPR52733.2024.00363
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The models and code are released at https://github.com/FoundationVision/GLEE.
引用
收藏
页码:3783 / 3795
页数:13
相关论文
共 50 条
  • [1] OBJECT CUT AND PASTE IN IMAGES AND VIDEOS
    Friedland, Gerald
    Jantz, Kristian
    Lenz, Tobias
    Wiesel, Fabian
    Rojas, Raul
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2007, 1 (02) : 221 - 247
  • [2] Unsupervised Object Discovery and Localization in Images and Videos
    Cho, Minsu
    Kwak, Suha
    Laptev, Ivan
    Schmid, Cordelia
    Ponce, Jean
    2015 12TH INTERNATIONAL CONFERENCE ON UBIQUITOUS ROBOTS AND AMBIENT INTELLIGENCE (URAI), 2015, : 292 - 293
  • [3] Expert teacher based on foundation image segmentation model for object detection in aerial images
    Yu, Yinhui
    Sun, Xu
    Cheng, Qing
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [4] Expert teacher based on foundation image segmentation model for object detection in aerial images
    Yinhui Yu
    Xu Sun
    Qing Cheng
    Scientific Reports, 13
  • [5] Unsupervised Learning from Videos for Object Discovery in Single Images
    Zhao, Dong
    Ding, Baoqing
    Wu, Yulin
    Chen, Lei
    Zhou, Hongchao
    SYMMETRY-BASEL, 2021, 13 (01): : 1 - 16
  • [6] Click Carving: Interactive Object Segmentation in Images and Videos with Point Clicks
    Jain, Suyog Dutt
    Grauman, Kristen
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (09) : 1321 - 1344
  • [7] Click Carving: Interactive Object Segmentation in Images and Videos with Point Clicks
    Suyog Dutt Jain
    Kristen Grauman
    International Journal of Computer Vision, 2019, 127 : 1321 - 1344
  • [8] Analysing Domain Shift Factors between Videos and Images for Object Detection
    Kalogeiton, Vicky
    Ferrari, Vittorio
    Schmid, Cordelia
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (11) : 2327 - 2334
  • [9] Stopped Object Detection by Learning Foreground Model in Videos
    Maddalena, Lucia
    Petrosino, Alfredo
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2013, 24 (05) : 723 - 735
  • [10] Adaptive appearance model for object contour tracking in videos
    Allili, Mohand Saied
    Ziou, Djemel
    FOURTH CANADIAN CONFERENCE ON COMPUTER AND ROBOT VISION, PROCEEDINGS, 2007, : 510 - +