End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

被引：0

作者：

Wu, Mingrui ^{[1
,2
]}

Gu, Jiaxin ^{[3
]}

Shen, Yunhang ^{[2
]}

Lin, Mingbao ^{[2
]}

Chen, Chao ^{[2
]}

Sun, Xiaoshuai ^{[1
,4
,5
]}

机构：

[1] Xiamen Univ, Sch Informat, MAC Lab, Xiamen, Peoples R China

[2] Tencent, Youtu Lab, Shenzhen, Peoples R China

[3] VIS Baidu Inc, Beijing, Peoples R China

[4] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China

[5] Xiamen Univ, Fujian Engn Res Ctr Trusted Artificial Intelligen, Xiamen, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel End-to-end zero-shot HOI Detection (EoID) framework via vision-language knowledge distillation. We first design an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human-object pairs in an action-agnostic manner. Then we transfer the distribution of action probability from the pretrained vision-language teacher as well as the seen ground truth to the HOI model to attain zero-shot HOI classification. Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our EoID outperforms the previous SOTAs under various zero-shot settings. Moreover, our method is generalizable to large-scale object detection data to further scale up the action sets. The source code is available at: https://github.com/mrwu-mac/EoID.

引用

页码：2839 / 2846

页数：8

共 50 条

[21] Zero-Shot Text Normalization via Cross-Lingual Knowledge Distillation
Wang, Linqin
Huang, Xiang
Yu, Zhengtao
Peng, Hao
Gao, Shengxiang
Mao, Cunli
Huang, Yuxin
Dong, Ling
Yu, Philip S.
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4631 - 4646
[22] End-to-end spoofing speech detection and knowledge distillation under noisy conditions
Liu, Pengfei
Zhang, Zhenchuan
Yang, Yingchun
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[23] Generalized Zero-shot Intent Detection via Commonsense Knowledge
Siddique, A. B.
Jamour, Fuad
Xu, Luxun
Hristidis, Vagelis
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1925 - 1929
[24] Improving Zero-Shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
Mistretta, Marco
Baldrati, Alberto
Bertini, Marco
Bagdanov, Andrew D.
COMPUTER VISION - ECCV 2024, PT LXXXIV, 2025, 15142 : 459 - 477
[25] TWO-STAGE TEXTUAL KNOWLEDGE DISTILLATION FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
Kim, Seongbin
Kim, Gyuwan
Shin, Seongjin
Lee, Sangmin
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7463 - 7467
[26] Diverse Knowledge Distillation for End-to-End Person Search
Zhang, Xinyu
Wang, Xinlong
Bian, Jia-Wang
Shen, Chunhua
You, Mingyu
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3412 - 3420
[27] Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network
Gul, Sania
Khan, Muhammad Salman
Ur-Rehman, Ata
PLOS ONE, 2024, 19 (07):
[28] Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-shot Voice Conversion
Xu, Le
Zhong, Rongxiu
Liu, Ying
Yang, Huibao
Zhang, Shilei
INTERSPEECH 2023, 2023, : 2293 - 2297
[29] Towards zero-shot human-object interaction detection via vision-language integration
Xue, Weiying
Liu, Qi
Wang, Yuxiao
Wei, Zhenao
Xing, Xiaofen
Xu, Xiangmin
NEURAL NETWORKS, 2025, 187
[30] Zero-Shot Detection of Buildings in Mobile LiDAR using Language Vision Model
Goo, June Moh
Zeng, Zichao
Boehm, Jan
MID-TERM SYMPOSIUM THE ROLE OF PHOTOGRAMMETRY FOR A SUSTAINABLE WORLD, VOL. 48-2, 2024, : 107 - 113

← 1 2 3 4 5 →