Multi-modal Prompts with Feature Decoupling for Open-Vocabulary Object Detection

被引：0

作者：

Wang, Duorui ^{[1
]}

Zhao, Xiaowei ^{[1
]}

机构：

[1] Beihang Univ, State Key Lab Complex & Crit Software Environm, Beijing 100191, Peoples R China

来源：

GENERALIZING FROM LIMITED RESOURCES IN THE OPEN WORLD, GLOW-IJCAI 2024 | 2024年 / 2160卷

关键词：

feature decoupling; multi-modal prompts; open-vocabulary object detection; region expansion;

D O I：

10.1007/978-981-97-6125-8_14

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Open-vocabulary object detection aims to acquire the ability to recognize novel categories through text description using data of limited categories for training. The Prompt serves as a template to assist in the construction of textual descriptions for categories. With the development of open-vocabulary object detection, multi-modal prompts with better performance have emerged. However, existing multi-modal prompts fail to align the context and object components across different modalities during the construction. To address the issue, we propose an open-vocabulary object detection framework based on multi-modal prompts with feature decoupling. The framework consists of two modules, the construction of Multi-modal Prompts with Feature Decoupling (MPFD) and the visual Region Expansion (RE). During prompts constructing, the MPFD decouples the object and context components from the visual embeddings and then performs multi-modal fusion with the corresponding parts of the text embeddings respectively. The RE incorporates additional context information into the visual embeddings to enhance the discriminative ability of the prompts. Sufficient experiments have demonstrated that feature decoupling multi-modal prompts can effectively improve the performance of open-vocabulary object detection models.

引用

页码：180 / 194

页数：15

共 50 条

[41] OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection
Zhang, Hu
Ku, Jianhua
Tang, Tao
Sun, Haiyang
Huang, Xin
Huang, Zi
Yu, Kaicheng
COMPUTER VISION - ECCV 2024, PT LXXXIV, 2025, 15142 : 1 - 19
[42] Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection
Li, Liangqi
Miao, Jiaxu
Shi, Dahu
Tan, Wenming
Ren, Ye
Yang, Yi
Pu, Shiliang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6478 - 6487
[43] RGB-D Salient Object Detection Based on Multi-Modal Feature Interaction
Gao, Yue
Dai, Meng
Zhang, Qing
Computer Engineering and Applications, 2024, 60 (02) : 211 - 220
[44] Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection
Li, Xin
Shi, Botian
Hou, Yuenan
Wu, Xingjiao
Ma, Tianlong
Li, Yikang
He, Liang
COMPUTER VISION, ECCV 2022, PT XXXVIII, 2022, 13698 : 691 - 707
[45] Multi-Modal Weights Sharing and Hierarchical Feature Fusion for RGBD Salient Object Detection
Xiao, Fen
Li, Bin
Peng, Yimu
Cao, Chunhong
Hu, Kai
Gao, Xieping
IEEE ACCESS, 2020, 8 : 26602 - 26611
[46] Multi-modal feature fusion for 3D object detection in the production workshop
Hou, Rui
Chen, Guangzhu
Han, Yinhe
Tang, Zaizuo
Ru, Qingjun
APPLIED SOFT COMPUTING, 2022, 115
[47] Deformable Feature Aggregation for Dynamic Multi-modal 3D Object Detection
Chen, Zehui
Li, Zhenyu
Zhang, Shiquan
Fang, Liangji
Jiang, Qinhong
Zhao, Feng
COMPUTER VISION, ECCV 2022, PT VIII, 2022, 13668 : 628 - 644
[48] Deformable Feature Fusion Network for Multi-Modal 3D Object Detection
Guo, Kun
Gan, Tong
Ding, Zhao
Ling, Qiang
2024 3RD INTERNATIONAL CONFERENCE ON ROBOTICS, ARTIFICIAL INTELLIGENCE AND INTELLIGENT CONTROL, RAIIC 2024, 2024, : 363 - 367
[49] OvarNet: Towards Open-vocabulary Object Attribute Recognition
Chen, Keyan
Jiang, Xiaolong
Hu, Yao
Tang, Xu
Gao, Yan
Chen, Jianqi
Xie, Weidi
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23518 - 23527
[50] Contrastive Feature Masking Open-Vocabulary Vision Transformer
Kim, Dahun
Angelova, Anelia
Kuo, Weicheng
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15556 - 15566

← 1 2 3 4 5 →