Agglomerative Transformer for Human-Object Interaction Detection

被引:0
|
作者
Tu, Danyang [1 ]
Sun, Wei [1 ]
Zhai, Guangtao [1 ]
Shen, Wei [2 ]
机构
[1] Shanghai Jiao Tong Univ, Inst Image Commun & Network Engn, Shanghai, Peoples R China
[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai, Peoples R China
基金
上海市自然科学基金; 国家重点研发计划;
关键词
D O I
10.1109/ICCV51070.2023.01976
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an agglomerative Transformer (AGER) that enables Transformer-based human-object interaction ( HOI) detectors to flexibly exploit extra instance-level cues in a single-stage and end-to-end manner for the first time. AGER acquires instance tokens by dynamically clustering patch tokens and aligning cluster centers to instances with textual guidance, thus enjoying two benefits: 1) Integrality: each instance token is encouraged to contain all discriminative feature regions of an instance, which demonstrates a significant improvement in the extraction of different instance-level cues and subsequently leads to a new state-of-the-art performance of HOI detection with 36.75 mAP on HICO-Det. 2) Efficiency: the dynamical clustering mechanism allows AGER to generate instance tokens jointly with the feature learning of the Transformer encoder, eliminating the need of an additional object detector or instance decoder in prior methods, thus allowing the extraction of desirable extra cues for HOI detection in a single-stage and end-to-end pipeline. Concretely, AGER reduces GFLOPs by 8.5% and improves FPS by 36%, even compared to a vanilla DETR-like pipeline without extra cue extraction. The code will be available at https://github.com/six6607/AGER.git.
引用
收藏
页码:21557 / 21567
页数:11
相关论文
共 50 条
  • [1] Enhanced Transformer Interaction Components for Human-Object Interaction Detection
    Zhang, JinHui
    Zhao, Yuxiao
    Zhang, Xian
    Wang, Xiang
    Zhao, Yuxuan
    Wang, Peng
    Hu, Jian
    [J]. ACM SYMPOSIUM ON SPATIAL USER INTERACTION, SUI 2023, 2023,
  • [2] Human-Object Interaction Detection via Disentangled Transformer
    Zhou, Desen
    Liu, Zhichao
    Wang, Jian
    Wang, Leshan
    Hu, Tao
    Ding, Errui
    Wang, Jingdong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19546 - 19555
  • [3] Human-Object Interaction Detection with Ratio-Transformer
    Wang, Tianlang
    Lu, Tao
    Fang, Wenhua
    Zhang, Yanduo
    [J]. SYMMETRY-BASEL, 2022, 14 (08):
  • [4] Rethinking vision transformer through human-object interaction detection
    Cheng, Yamin
    Zhao, Zitian
    Wang, Zhi
    Duan, Hancong
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 122
  • [5] Mask-Guided Transformer for Human-Object Interaction Detection
    Ying, Daocheng
    Yang, Hua
    Sun, Jun
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,
  • [6] Learning Human-Object Interaction Detection via Deformable Transformer
    Cai, Shuang
    Ma, Shiwei
    Gu, Dongzhou
    [J]. 2021 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO PROCESSING, AND ARTIFICIAL INTELLIGENCE, 2021, 12076
  • [7] Compositional Learning in Transformer-Based Human-Object Interaction Detection
    Zhuang, Zikun
    Qian, Ruihao
    Xie, Chi
    Liang, Shuang
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1038 - 1043
  • [8] Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows
    Tu, Danyang
    Min, Xiongkuo
    Duan, Huiyu
    Guo, Guodong
    Zhai, Guangtao
    Shen, Wei
    [J]. COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 : 87 - 103
  • [9] Pairwise CNN-Transformer Features for Human-Object Interaction Detection
    Quan, Hutuo
    Lai, Huicheng
    Gao, Guxue
    Ma, Jun
    Li, Junkai
    Chen, Dongji
    [J]. ENTROPY, 2024, 26 (03)
  • [10] Human-object interaction detection based on disentangled axial attention transformer
    Xia, Limin
    Xiao, Qiyue
    [J]. MACHINE VISION AND APPLICATIONS, 2024, 35 (04)