End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

被引:0
|
作者
Wu, Mingrui [1 ,2 ]
Gu, Jiaxin [3 ]
Shen, Yunhang [2 ]
Lin, Mingbao [2 ]
Chen, Chao [2 ]
Sun, Xiaoshuai [1 ,4 ,5 ]
机构
[1] Xiamen Univ, Sch Informat, MAC Lab, Xiamen, Peoples R China
[2] Tencent, Youtu Lab, Shenzhen, Peoples R China
[3] VIS Baidu Inc, Beijing, Peoples R China
[4] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[5] Xiamen Univ, Fujian Engn Res Ctr Trusted Artificial Intelligen, Xiamen, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel End-to-end zero-shot HOI Detection (EoID) framework via vision-language knowledge distillation. We first design an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human-object pairs in an action-agnostic manner. Then we transfer the distribution of action probability from the pretrained vision-language teacher as well as the seen ground truth to the HOI model to attain zero-shot HOI classification. Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our EoID outperforms the previous SOTAs under various zero-shot settings. Moreover, our method is generalizable to large-scale object detection data to further scale up the action sets. The source code is available at: https://github.com/mrwu-mac/EoID.
引用
收藏
页码:2839 / 2846
页数:8
相关论文
共 50 条
  • [21] Zero-Shot Text Normalization via Cross-Lingual Knowledge Distillation
    Wang, Linqin
    Huang, Xiang
    Yu, Zhengtao
    Peng, Hao
    Gao, Shengxiang
    Mao, Cunli
    Huang, Yuxin
    Dong, Ling
    Yu, Philip S.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4631 - 4646
  • [22] End-to-end spoofing speech detection and knowledge distillation under noisy conditions
    Liu, Pengfei
    Zhang, Zhenchuan
    Yang, Yingchun
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [23] Generalized Zero-shot Intent Detection via Commonsense Knowledge
    Siddique, A. B.
    Jamour, Fuad
    Xu, Luxun
    Hristidis, Vagelis
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1925 - 1929
  • [24] Improving Zero-Shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
    Mistretta, Marco
    Baldrati, Alberto
    Bertini, Marco
    Bagdanov, Andrew D.
    COMPUTER VISION - ECCV 2024, PT LXXXIV, 2025, 15142 : 459 - 477
  • [25] TWO-STAGE TEXTUAL KNOWLEDGE DISTILLATION FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Kim, Seongbin
    Kim, Gyuwan
    Shin, Seongjin
    Lee, Sangmin
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7463 - 7467
  • [26] Diverse Knowledge Distillation for End-to-End Person Search
    Zhang, Xinyu
    Wang, Xinlong
    Bian, Jia-Wang
    Shen, Chunhua
    You, Mingyu
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3412 - 3420
  • [27] Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network
    Gul, Sania
    Khan, Muhammad Salman
    Ur-Rehman, Ata
    PLOS ONE, 2024, 19 (07):
  • [28] Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-shot Voice Conversion
    Xu, Le
    Zhong, Rongxiu
    Liu, Ying
    Yang, Huibao
    Zhang, Shilei
    INTERSPEECH 2023, 2023, : 2293 - 2297
  • [29] Towards zero-shot human-object interaction detection via vision-language integration
    Xue, Weiying
    Liu, Qi
    Wang, Yuxiao
    Wei, Zhenao
    Xing, Xiaofen
    Xu, Xiangmin
    NEURAL NETWORKS, 2025, 187
  • [30] Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model
    Goo, June Moh
    Zeng, Zichao
    Boehm, Jan
    MID-TERM SYMPOSIUM THE ROLE OF PHOTOGRAMMETRY FOR A SUSTAINABLE WORLD, VOL. 48-2, 2024, : 107 - 113