End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

被引:0
|
作者
Wu, Mingrui [1 ,2 ]
Gu, Jiaxin [3 ]
Shen, Yunhang [2 ]
Lin, Mingbao [2 ]
Chen, Chao [2 ]
Sun, Xiaoshuai [1 ,4 ,5 ]
机构
[1] Xiamen Univ, Sch Informat, MAC Lab, Xiamen, Peoples R China
[2] Tencent, Youtu Lab, Shenzhen, Peoples R China
[3] VIS Baidu Inc, Beijing, Peoples R China
[4] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[5] Xiamen Univ, Fujian Engn Res Ctr Trusted Artificial Intelligen, Xiamen, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel End-to-end zero-shot HOI Detection (EoID) framework via vision-language knowledge distillation. We first design an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human-object pairs in an action-agnostic manner. Then we transfer the distribution of action probability from the pretrained vision-language teacher as well as the seen ground truth to the HOI model to attain zero-shot HOI classification. Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our EoID outperforms the previous SOTAs under various zero-shot settings. Moreover, our method is generalizable to large-scale object detection data to further scale up the action sets. The source code is available at: https://github.com/mrwu-mac/EoID.
引用
收藏
页码:2839 / 2846
页数:8
相关论文
共 50 条
  • [41] Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation
    Inaguma, Hirofumi
    Kawahara, Tatsuya
    Watanabe, Shinji
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1872 - 1881
  • [42] TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition
    Yoon, Ji Won
    Lee, Hyeonseung
    Kim, Hyung Yong
    Cho, Won Ik
    Kim, Nam Soo
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 (29) : 1626 - 1638
  • [43] A Lightweight Framework With Knowledge Distillation for Zero-Shot Mars Scene Classification
    Tan, Xiaomeng
    Xi, Bobo
    Xu, Haitao
    Li, Jiaojiao
    Li, Yunsong
    Xue, Changbin
    Chanussot, Jocelyn
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [44] Zero-Shot Hashing via Transferring Supervised Knowledge
    Yang, Yang
    Luo, Yadan
    Chen, Weilun
    Shen, Fumin
    Shao, Jie
    Shen, Heng Tao
    MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 1286 - 1295
  • [45] Zero-shot Learning via Recurrent Knowledge Transfer
    Zhao, Bo
    Sun, Xinwei
    Hong, Xiaopeng
    Yao, Yuan
    Wang, Yizhou
    2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1308 - 1317
  • [46] END-TO-END VOICE CONVERSION VIA CROSS-MODAL KNOWLEDGE DISTILLATION FOR DYSARTHRIC SPEECH RECONSTRUCTION
    Wang, Disong
    Yu, Jianwei
    Wu, Xixin
    Liu, Songxiang
    Sung, Lifa
    Liu, Xunying
    Meng, Helen
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7744 - 7748
  • [47] Multi-domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models
    Kim, Ho-Gyeong
    Lee, Min-Joong
    Lee, Hoshik
    Kang, Tae Gyoon
    Lee, Jihyun
    Yang, Eunho
    Hwang, Sung Ju
    INTERSPEECH 2021, 2021, : 2531 - 2535
  • [48] Knowledge Enhanced Zero-Shot Visual Relationship Detection
    Ding, Nan
    Lai, Yong
    Liu, Jie
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT III, KSEM 2024, 2024, 14886 : 3 - 15
  • [49] Improving Zero-Shot Stance Detection by Infusing Knowledge from Large Language Models
    Guo, Mengzhuo
    Jiang, Xiaorui
    Liao, Yong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XIII, ICIC 2024, 2024, 14874 : 121 - 132
  • [50] Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis
    Park, Seongyeon
    Kim, Bohyung
    Oh, Tae-Hyun
    INTERSPEECH 2023, 2023, : 4319 - 4323