HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

被引:7
|
作者
Ning, Shan [1 ]
Qiu, Longtian [1 ]
Liu, Yongfei [2 ]
He, Xuming [1 ,3 ]
机构
[1] ShanghaiTech Univ, Shanghai, Peoples R China
[2] ByteDance Inc, Beijing, Peoples R China
[3] Shanghai Engn Res Ctr Intelligent Vis & 1maging, Shanghai, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.02251
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to generate a classifier by embedding HOI descriptions. To distinguish fine-grained interactions, we build a verb classifier from training data via visual semantic arithmetic and a lightweight verb representation adapter. Furthermore, we propose a training-free enhancement to exploit global HOI predictions from CLIP. Extensive experiments demonstrate that our method outperforms the state of the art by a large margin on various settings, e.g. +4.04 mAP on HICO-Det. The source code is available in https://github.com/Artanic30/HOICLIP.
引用
收藏
页码:23507 / 23517
页数:11
相关论文
共 50 条
  • [1] QLDT: adaptive Query Learning for HOI Detection via vision-language knowledge Transfer
    Wang, Xincheng
    Gao, Yongbin
    Yu, Wenjun
    Wu, Chenmou
    Chen, Mingxuan
    Ma, Honglei
    Chen, Zhichao
    [J]. APPLIED INTELLIGENCE, 2024, 54 (19) : 9008 - 9027
  • [2] Vision-Language Models for Robot Success Detection
    Luo, Fiona
    [J]. THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23750 - 23752
  • [3] Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
    Wu, Qiong
    Yu, Wei
    Zhou, Yiyi
    Huang, Shubin
    Sun, Xiaoshuai
    Ji, Rongrong
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
    Li, Xin
    Lian, Dongze
    Lu, Zhihe
    Bai, Jiawang
    Chen, Zhibo
    Wang, Xinchao
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Adapting Vision-Language Models via Learning to Inject Knowledge
    Xuan, Shiyu
    Yang, Ming
    Zhang, Shiliang
    [J]. IEEE Transactions on Image Processing, 2024, 33 : 5798 - 5809
  • [6] Towards Multimodal Disinformation Detection by Vision-language Knowledge Interaction
    Li, Qilei
    Gao, Mingliang
    Zhang, Guisheng
    Zhai, Wenzhe
    Chen, Jinyong
    Jeon, Gwanggil
    [J]. INFORMATION FUSION, 2024, 102
  • [7] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [8] Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
    Luo, Gen
    Zhou, Yiyi
    Ren, Tianhe
    Chen, Shengxin
    Sun, Xiaoshuai
    Ji, Rongrong
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [9] Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
    Kan, Baoshuo
    Wang, Teng
    Lu, Wenpeng
    Zhen, Xiantong
    Guan, Weili
    Zheng, Feng
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15624 - 15634
  • [10] Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
    Wang, Yubin
    Jiang, Xinyang
    Cheng, De
    Li, Dongsheng
    Zhao, Cairong
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5749 - 5757