Towards zero-shot human-object interaction detection via vision-language integration

被引:0
|
作者
Xue, Weiying [1 ]
Liu, Qi [1 ]
Wang, Yuxiao [1 ]
Wei, Zhenao [1 ]
Xing, Xiaofen [1 ]
Xu, Xiangmin [1 ]
机构
[1] South China Univ Technol, Sch Future Technol, Guangzhou 511400, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Human-object interaction; Multimodal integration; Zero-shot; Weakly supervision;
D O I
10.1016/j.neunet.2025.107348
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-object interaction (HOI) detection aims to locate human-object pairs and identify their interaction categories in images. Most existing methods primarily focus on supervised learning, which relies on extensive manual HOI annotations. Such heavy reliance on closed-set supervised learning limits their generalization capabilities to unseen object categories. Inspired by the remarkable zero-shot capabilities of VLM, we propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of the visual-language model to improve zero-shot HOI detection. Specifically, we propose a ho-pair encoder to supplement contextual and interaction-specific semantic representation decoder into our model. Additionally, we propose two fusion strategies to facilitate prior knowledge transfer of VLM. One is visual-level fusion, producing more global context interaction features; another is language-level fusion, further enhancing the capability of VLM for HOI detection. Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets demonstrate that our model outperforms the previous methods in various zero-shot and full-supervised settings. The source code is available in https://github.com/xwyscut/K2HOI.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Zero-Shot Human-Object Interaction Detection via Similarity Propagation
    Zong, Daoming
    Sun, Shiliang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (12) : 17805 - 17816
  • [2] Zero-shot Object Detection Through Vision-Language Embedding Alignment
    Xie, Johnathan
    Zheng, Shuai
    2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW, 2022, : 926 - 940
  • [3] Zero-Shot Temporal Action Detection via Vision-Language Prompting
    Nag, Sauradip
    Zhu, Xiatian
    Song, Yi-Zhe
    Xiang, Tao
    COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 681 - 697
  • [4] ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection
    Liu, Ye
    Yuan, Junsong
    Chen, Chang Wen
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4235 - 4243
  • [5] VLPSR: Enhancing Zero-Shot Object ReID with Vision-Language Model
    Hu, Mingzhe
    ADVANCES IN VISUAL COMPUTING, ISVC 2024, PT II, 2025, 15047 : 56 - 69
  • [6] Zero-Shot Object Counting With Vision-Language Prior Guidance Network
    Zhai, Wenzhe
    Xing, Xianglei
    Gao, Mingliang
    Li, Qilei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2487 - 2498
  • [7] Zero-Shot Learning on Human-Object Interaction Recognition in video
    Maraghi, Vali Ollah
    Faez, Karim
    2019 5TH IRANIAN CONFERENCE ON SIGNAL PROCESSING AND INTELLIGENT SYSTEMS (ICSPIS 2019), 2019,
  • [8] Scaling Human-Object Interaction Recognition through Zero-Shot Learning
    Shen, Liyue
    Yeung, Serena
    Hoffman, Judy
    Mori, Greg
    Li Fei-Fei
    2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1568 - 1576
  • [9] Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning
    Maraghi, Vali Ollah
    Faez, Karim
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
  • [10] Label Propagation for Zero-shot Classification with Vision-Language Models
    Stojnic, Vladan
    Kalantidis, Yannis
    Tolias, Giorgos
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23209 - 23218