Towards zero-shot human-object interaction detection via vision-language integration

被引:0
|
作者
Xue, Weiying [1 ]
Liu, Qi [1 ]
Wang, Yuxiao [1 ]
Wei, Zhenao [1 ]
Xing, Xiaofen [1 ]
Xu, Xiangmin [1 ]
机构
[1] South China Univ Technol, Sch Future Technol, Guangzhou 511400, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Human-object interaction; Multimodal integration; Zero-shot; Weakly supervision;
D O I
10.1016/j.neunet.2025.107348
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-object interaction (HOI) detection aims to locate human-object pairs and identify their interaction categories in images. Most existing methods primarily focus on supervised learning, which relies on extensive manual HOI annotations. Such heavy reliance on closed-set supervised learning limits their generalization capabilities to unseen object categories. Inspired by the remarkable zero-shot capabilities of VLM, we propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of the visual-language model to improve zero-shot HOI detection. Specifically, we propose a ho-pair encoder to supplement contextual and interaction-specific semantic representation decoder into our model. Additionally, we propose two fusion strategies to facilitate prior knowledge transfer of VLM. One is visual-level fusion, producing more global context interaction features; another is language-level fusion, further enhancing the capability of VLM for HOI detection. Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets demonstrate that our model outperforms the previous methods in various zero-shot and full-supervised settings. The source code is available in https://github.com/xwyscut/K2HOI.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Zero-Shot Object Detection for Indoor Robots
    Abdalwhab, Abdalwhab
    Liu, Huaping
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [42] Zero-Shot Object Detection with Textual Descriptions
    Li, Zhihui
    Yao, Lina
    Zhang, Xiaoqin
    Wang, Xianzhi
    Kanhere, Salil
    Zhang, Huaxiang
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8690 - 8697
  • [43] Transductive Learning for Zero-Shot Object Detection
    Rahman, Shafin
    Khan, Salman
    Barnes, Nick
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6081 - 6090
  • [44] MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models
    Jia, Yunpeng
    Ye, Xiufen
    Mei, Xinkui
    Liu, Yusong
    Guo, Shuxiang
    MULTIMEDIA SYSTEMS, 2024, 30 (03)
  • [45] LANGUAGE-GUIDED ZERO-SHOT OBJECT COUNTING
    Wang, Mingjie
    Yuan, Song
    Li, Zhuohang
    Zhu, Longlong
    Buys, Eric
    Gong, Minglun
    2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
  • [46] Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model
    Goo, June Moh
    Zeng, Zichao
    Boehm, Jan
    MID-TERM SYMPOSIUM THE ROLE OF PHOTOGRAMMETRY FOR A SUSTAINABLE WORLD, VOL. 48-2, 2024, : 107 - 113
  • [47] KDNet: Leveraging Vision-Language Knowledge Distillation for Few-Shot Object Detection
    Ma, Mengyuan
    Qian, Lin
    Yin, Hujun
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II, 2024, 15017 : 153 - 167
  • [48] BERTastic at SemEval-2024 Task 4: State-of-the-Art Multilingual Propaganda Detection in Memes via Zero-Shot Learning with Vision-Language Models
    Mahmoud, Tarek
    Nakov, Preslav
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 503 - 510
  • [49] Zero-shot urban function inference with street view images through prompting a pretrained vision-language model
    Huang, Weiming
    Wang, Jing
    Cong, Gao
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2024, 38 (07) : 1414 - 1442
  • [50] Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
    Huang, Po-Yao
    Patrick, Mandela
    Hu, Junjie
    Neubig, Graham
    Metze, Florian
    Hauptmann, Alexander
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 2443 - 2459