Towards zero-shot human-object interaction detection via vision-language integration

被引：0

作者：

Xue, Weiying ^{[1
]}

Liu, Qi ^{[1
]}

Wang, Yuxiao ^{[1
]}

Wei, Zhenao ^{[1
]}

Xing, Xiaofen ^{[1
]}

Xu, Xiangmin ^{[1
]}

机构：

[1] South China Univ Technol, Sch Future Technol, Guangzhou 511400, Guangdong, Peoples R China

来源：

NEURAL NETWORKS | 2025年 / 187卷

基金：

中国国家自然科学基金;

关键词：

Human-object interaction; Multimodal integration; Zero-shot; Weakly supervision;

D O I：

10.1016/j.neunet.2025.107348

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human-object interaction (HOI) detection aims to locate human-object pairs and identify their interaction categories in images. Most existing methods primarily focus on supervised learning, which relies on extensive manual HOI annotations. Such heavy reliance on closed-set supervised learning limits their generalization capabilities to unseen object categories. Inspired by the remarkable zero-shot capabilities of VLM, we propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of the visual-language model to improve zero-shot HOI detection. Specifically, we propose a ho-pair encoder to supplement contextual and interaction-specific semantic representation decoder into our model. Additionally, we propose two fusion strategies to facilitate prior knowledge transfer of VLM. One is visual-level fusion, producing more global context interaction features; another is language-level fusion, further enhancing the capability of VLM for HOI detection. Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets demonstrate that our model outperforms the previous methods in various zero-shot and full-supervised settings. The source code is available in https://github.com/xwyscut/K2HOI.

引用

页数：9

共 50 条

[41] Zero-Shot Object Detection for Indoor Robots
Abdalwhab, Abdalwhab
Liu, Huaping
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[42] Zero-Shot Object Detection with Textual Descriptions
Li, Zhihui
Yao, Lina
Zhang, Xiaoqin
Wang, Xianzhi
Kanhere, Salil
Zhang, Huaxiang
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8690 - 8697
[43] Transductive Learning for Zero-Shot Object Detection
Rahman, Shafin
Khan, Salman
Barnes, Nick
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6081 - 6090
[44] MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models
Jia, Yunpeng
Ye, Xiufen
Mei, Xinkui
Liu, Yusong
Guo, Shuxiang
MULTIMEDIA SYSTEMS, 2024, 30 (03)
[45] LANGUAGE-GUIDED ZERO-SHOT OBJECT COUNTING
Wang, Mingjie
Yuan, Song
Li, Zhuohang
Zhu, Longlong
Buys, Eric
Gong, Minglun
2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
[46] Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model
Goo, June Moh
Zeng, Zichao
Boehm, Jan
MID-TERM SYMPOSIUM THE ROLE OF PHOTOGRAMMETRY FOR A SUSTAINABLE WORLD, VOL. 48-2, 2024, : 107 - 113
[47] KDNet: Leveraging Vision-Language Knowledge Distillation for Few-Shot Object Detection
Ma, Mengyuan
Qian, Lin
Yin, Hujun
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II, 2024, 15017 : 153 - 167
[48] BERTastic at SemEval-2024 Task 4: State-of-the-Art Multilingual Propaganda Detection in Memes via Zero-Shot Learning with Vision-Language Models
Mahmoud, Tarek
Nakov, Preslav
PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 503 - 510
[49] Zero-shot urban function inference with street view images through prompting a pretrained vision-language model
Huang, Weiming
Wang, Jing
Cong, Gao
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2024, 38 (07) : 1414 - 1442
[50] Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
Huang, Po-Yao
Patrick, Mandela
Hu, Junjie
Neubig, Graham
Metze, Florian
Hauptmann, Alexander
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 2443 - 2459

← 1 2 3 4 5 →