End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

被引:0
|
作者
Wu, Mingrui [1 ,2 ]
Gu, Jiaxin [3 ]
Shen, Yunhang [2 ]
Lin, Mingbao [2 ]
Chen, Chao [2 ]
Sun, Xiaoshuai [1 ,4 ,5 ]
机构
[1] Xiamen Univ, Sch Informat, MAC Lab, Xiamen, Peoples R China
[2] Tencent, Youtu Lab, Shenzhen, Peoples R China
[3] VIS Baidu Inc, Beijing, Peoples R China
[4] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[5] Xiamen Univ, Fujian Engn Res Ctr Trusted Artificial Intelligen, Xiamen, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel End-to-end zero-shot HOI Detection (EoID) framework via vision-language knowledge distillation. We first design an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human-object pairs in an action-agnostic manner. Then we transfer the distribution of action probability from the pretrained vision-language teacher as well as the seen ground truth to the HOI model to attain zero-shot HOI classification. Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our EoID outperforms the previous SOTAs under various zero-shot settings. Moreover, our method is generalizable to large-scale object detection data to further scale up the action sets. The source code is available at: https://github.com/mrwu-mac/EoID.
引用
收藏
页码:2839 / 2846
页数:8
相关论文
共 50 条
  • [1] Indicative Vision Transformer for end-to-end zero-shot sketch-based image retrieval
    Zhang, Haoxiang
    Cheng, Deqiang
    Kou, Qiqi
    Asad, Mujtaba
    Jiang, He
    ADVANCED ENGINEERING INFORMATICS, 2024, 60
  • [2] END-TO-END ZERO-SHOT VOICE CONVERSION USING A DDSP VOCODER
    Nercessian, Shahan
    2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2021, : 306 - 310
  • [3] Zero-Shot End-To-End Spoken Question Answering In Medical Domain
    Labrak, Yanis
    Moumeni, Adel
    Dufour, Richard
    Rouvier, Mickael
    INTERSPEECH 2024, 2024, : 2020 - 2024
  • [4] Masked Autoencoder via End-to-End Zero-Shot Learning for Fault Diagnosis of Unseen Classes
    Long, Jianyu
    Lin, Jing
    Jiang, Lingli
    Yang, Zhe
    Guo, Jianwen
    Yin, Tao
    Li, Chuan
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73
  • [5] StreamVoice plus : Evolving Into End-to-End Streaming Zero-Shot Voice Conversion
    Wang, Zhichao
    Chen, Yuanzhe
    Wang, Xinsheng
    Xie, Lei
    Wang, Yuping
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3000 - 3004
  • [6] End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions
    Kang, Wonjune
    Hasegawa-Johnson, Mark
    Roy, Deb
    INTERSPEECH 2023, 2023, : 2303 - 2307
  • [7] Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
    Brattoli, Biagio
    Tighe, Joseph
    Zhdanov, Fedor
    Perona, Pietro
    Chalupka, Krzysztof
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4612 - 4622
  • [8] Towards Zero-Shot Knowledge Distillation for Natural Language Processing
    Rashid, Ahmad
    Lioutas, Vasileios
    Ghaddar, Abbas
    Rezagholizadeh, Mehdi
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6551 - 6561
  • [9] Zero-Shot Temporal Action Detection via Vision-Language Prompting
    Nag, Sauradip
    Zhu, Xiatian
    Song, Yi-Zhe
    Xiang, Tao
    COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 681 - 697
  • [10] Towards Zero-shot Learning for End-to-end Cross-modal Translation Models
    Yang, Jichen
    Fang, Kai
    Liao, Minpeng
    Chen, Boxing
    Huang, Zhongqiang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 13078 - 13087