Panoptic Vision-Language Feature Fields

被引:2
|
作者
Chen, Haoran [1 ]
Blomqvist, Kenneth [1 ]
Milano, Francesco [1 ]
Siegwart, Roland [1 ]
机构
[1] Swiss Fed Inst Technol, Autonomous Syst Lab, CH-8092 Zurich, Switzerland
基金
欧盟地平线“2020”;
关键词
Semantics; Three-dimensional displays; Semantic segmentation; Self-supervised learning; Instance segmentation; Image reconstruction; Computational modeling; Semantic scene understanding; deep learning for visual perception; 3D open vocabulary panoptic segmentation; neural implicit representation;
D O I
10.1109/LRA.2024.3354624
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Recently, methods have been proposed for 3D open-vocabulary semantic segmentation. Such methods are able to segment scenes into arbitrary classes based on text descriptions provided during runtime. In this letter, we propose to the best of our knowledge the first algorithm for open-vocabulary panoptic segmentation in 3D scenes. Our algorithm, Panoptic Vision-Language Feature Fields (PVLFF), learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model, and jointly fits an instance feature field through contrastive learning using 2D instance segments on input frames. Despite not being trained on the target classes, our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and additionally outperforms current 3D open-vocabulary systems in terms of semantic segmentation. We ablate the components of our method to demonstrate the effectiveness of our model architecture.
引用
收藏
页码:2144 / 2151
页数:8
相关论文
共 50 条
  • [1] Neural Implicit Vision-Language Feature Fields
    Blomqvist, Kenneth
    Milano, Francesco
    Chung, Jen Jen
    Ott, Lionel
    Siegwart, Roland
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1313 - 1318
  • [2] PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining
    Gao, Yuting
    Liu, Jinfeng
    Xu, Zihan
    Zhang, Jun
    Li, Ke
    Shen, Chunhua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [3] Feature distillation from vision-language model for semisupervised action classification
    Celik, Asli
    Kucukmanisa, Ayhan
    Urhan, Oguzhan
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2023, 31 (06) : 1129 - 1145
  • [4] CLIP-Adapter: Better Vision-Language Models with Feature Adapters
    Peng Gao
    Shijie Geng
    Renrui Zhang
    Teli Ma
    Rongyao Fang
    Yongfeng Zhang
    Hongsheng Li
    Yu Qiao
    International Journal of Computer Vision, 2024, 132 (2) : 581 - 595
  • [5] Improving multimodal sentiment prediction through vision-language feature interaction
    An, Jieyu
    Ding, Binfen
    Zainon, Wan Mohd Nazmee Wan
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [6] CLIP-Adapter: Better Vision-Language Models with Feature Adapters
    Gao, Peng
    Geng, Shijie
    Zhang, Renrui
    Ma, Teli
    Fang, Rongyao
    Zhang, Yongfeng
    Li, Hongsheng
    Qiao, Yu
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 581 - 595
  • [7] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [8] Enriching visual feature representations for vision-language tasks using spectral transforms
    Ondeng, Oscar
    Ouma, Heywood
    Akuon, Peter
    IMAGE AND VISION COMPUTING, 2025, 154
  • [9] Vision-Language Models for Feature Detection of Macular Diseases on Optical Coherence Tomography
    Antaki, Fares
    Chopra, Reena
    Keane, Pearse A.
    JAMA OPHTHALMOLOGY, 2024, 142 (06) : 573 - 576
  • [10] Boosting adversarial transferability in vision-language models via multimodal feature heterogeneity
    Chen, Long
    Chen, Yuling
    Ouyang, Zhi
    Dou, Hui
    Zhang, Yangwen
    Sang, Haiwei
    SCIENTIFIC REPORTS, 2025, 15 (01):