Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

被引:0
|
作者
Zhang, Taolin [1 ]
He, Sunan [2 ]
Dai, Tao [3 ]
Wang, Zhi [1 ]
Chen, Bin [4 ]
Xia, Shu-Tao [1 ,5 ]
机构
[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China
[3] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
[4] Harbin Inst Technol, Shenzhen, Peoples R China
[5] Peng Cheng Lab, Res Ctr Artifcial Intelligence, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, vision language pre-training frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks. However, when extended to point cloud data, existing works mainly focus on building task-specific models, and fail to extract universal 3D vision-language embedding that generalize well. We carefully investigate three common tasks in semantic 3D scene understanding, and derive key insights into the development of a pre-training model. Motivated by these observations, we propose a vision-language pre-training framework 3DVLP (3D vision-language pre-training with object contrastive learning), which transfers flexibly on 3D vision-language downstream tasks. 3DVLP takes visual grounding as the proxy task and introduces Object-level IoU-guided Detection (OID) loss to obtain high-quality proposals in the scene. Moreover, we design Object-level Cross-Contrastive alignment (OCC) task and Object-level Self-Contrastive learning (OSC) task to align the objects with descriptions and distinguish different objects in the scene, respectively. Extensive experiments verify the excellent performance of 3DVLP on three 3D vision-language tasks, reflecting its superiority in semantic 3D scene understanding. Code is available at https://github.com/iridescentttt/3DVLP.
引用
收藏
页码:7296 / 7304
页数:9
相关论文
共 50 条
  • [31] Position-guided Text Prompt for Vision-Language Pre-training
    Wang, Jinpeng
    Zhou, Pan
    Shou, Mike Zheng
    Yan, Shuicheng
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23242 - 23251
  • [32] Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends
    Gan, Zhe
    Li, Linjie
    Li, Chunyuan
    Wang, Lijuan
    Liu, Zicheng
    Gao, Jianfeng
    [J]. FOUNDATIONS AND TRENDS IN COMPUTER GRAPHICS AND VISION, 2022, 14 (3-4): : 163 - 352
  • [33] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
    Wang, Weihan
    Yang, Zhen
    Xu, Bin
    Li, Juanzi
    Sun, Yankui
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
  • [34] Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Jin, Linbo
    Chen, Ben
    Zhou, Haoming
    Qiu, Minghui
    Shao, Ling
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12642 - 12652
  • [35] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
    Liang, Mingliang
    Larson, Martha
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
  • [36] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
    Mu, Yao
    Zhang, Qinglong
    Hu, Mengkang
    Wang, Wenhai
    Ding, Mingyu
    Jin, Jun
    Wang, Bin
    Dai, Jifeng
    Qiao, Yu
    Luo, Ping
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [37] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
    Huang, Zhicheng
    Zeng, Zhaoyang
    Huang, Yupan
    Liu, Bei
    Fu, Dongmei
    Fu, Jianlong
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12971 - 12980
  • [38] CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
    Ma, Zhiyuan
    Li, Jianjun
    Li, Guohui
    Huang, Kaiyan
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4515 - 4524
  • [39] Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training
    Zhang, Wenyu
    Shen, Li
    Foo, Chuan-Sheng
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
  • [40] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
    Moon, Jong Hak
    Lee, Hyungyung
    Shin, Woncheol
    Kim, Young-Hak
    Choi, Edward
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (12) : 6070 - 6080