General Object Foundation Model for Images and Videos at Scale

被引：4

作者：

Wu, Junfeng ^{[1
]}

Jiang, Yi ^{[2
]}

Liu, Qihao ^{[3
]}

Yuan, Zehuan ^{[2
]}

Bai, Xiang ^{[1
]}

Bai, Song ^{[2
]}

机构：

[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

[2] ByteDance Inc, Beijing, Peoples R China

[3] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.00363

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The models and code are released at https://github.com/FoundationVision/GLEE.

引用

页码：3783 / 3795

页数：13

共 50 条

[1] OBJECT CUT AND PASTE IN IMAGES AND VIDEOS
Friedland, Gerald
Jantz, Kristian
Lenz, Tobias
Wiesel, Fabian
Rojas, Raul
INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2007, 1 (02) : 221 - 247
[2] Unsupervised Object Discovery and Localization in Images and Videos
Cho, Minsu
Kwak, Suha
Laptev, Ivan
Schmid, Cordelia
Ponce, Jean
2015 12TH INTERNATIONAL CONFERENCE ON UBIQUITOUS ROBOTS AND AMBIENT INTELLIGENCE (URAI), 2015, : 292 - 293
[3] Expert teacher based on foundation image segmentation model for object detection in aerial images
Yu, Yinhui
Sun, Xu
Cheng, Qing
SCIENTIFIC REPORTS, 2023, 13 (01)
[4] Expert teacher based on foundation image segmentation model for object detection in aerial images
Yinhui Yu
Xu Sun
Qing Cheng
Scientific Reports, 13
[5] Unsupervised Learning from Videos for Object Discovery in Single Images
Zhao, Dong
Ding, Baoqing
Wu, Yulin
Chen, Lei
Zhou, Hongchao
SYMMETRY-BASEL, 2021, 13 (01): : 1 - 16
[6] Click Carving: Interactive Object Segmentation in Images and Videos with Point Clicks
Jain, Suyog Dutt
Grauman, Kristen
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (09) : 1321 - 1344
[7] Click Carving: Interactive Object Segmentation in Images and Videos with Point Clicks
Suyog Dutt Jain
Kristen Grauman
International Journal of Computer Vision, 2019, 127 : 1321 - 1344
[8] Analysing Domain Shift Factors between Videos and Images for Object Detection
Kalogeiton, Vicky
Ferrari, Vittorio
Schmid, Cordelia
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (11) : 2327 - 2334
[9] Stopped Object Detection by Learning Foreground Model in Videos
Maddalena, Lucia
Petrosino, Alfredo
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2013, 24 (05) : 723 - 735
[10] Adaptive appearance model for object contour tracking in videos
Allili, Mohand Saied
Ziou, Djemel
FOURTH CANADIAN CONFERENCE ON COMPUTER AND ROBOT VISION, PROCEEDINGS, 2007, : 510 - +

← 1 2 3 4 5 →