MonoPSTR: Monocular 3-D Object Detection With Dynamic Position and Scale-Aware Transformer

被引:0
|
作者
Yang, Fan [1 ]
He, Xuan [2 ]
Chen, Wenrui [1 ,3 ]
Zhou, Pengjie [2 ]
Li, Zhiyong [2 ,3 ]
机构
[1] Hunan Univ, Sch Robot, Changsha 410012, Peoples R China
[2] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
[3] Hunan Univ, Natl Engn Res Ctr Robot Visual Percept & Control T, Changsha 410082, Peoples R China
基金
中国国家自然科学基金;
关键词
Three-dimensional displays; Transformers; Object detection; Decoding; Training; Accuracy; Feature extraction; Autonomous driving; monocular 3-D object detection; robotics; scene understanding; transformer;
D O I
10.1109/TIM.2024.3415231
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Transformer-based approaches have demonstrated outstanding performance in monocular 3-D object detection, which involves predicting 3-D attributes from a single 2-D image. These transformer-based methods typically rely on visual and depth representations to identify crucial queries related to objects. However, the feature and location of queries are expected to learn adaptively without any prior knowledge, which often leads to an imprecise location in some complex scenes and a long-time training process. To overcome this limitation, we present MonoPSTR, which employs a dynamic position and scale-aware transformer for monocular 3-D detection. Our approach introduces a dynamically and explicitly position-coded query (DEP-query) and a scale-assisted deformable attention (SDA) module to help the raw query possess valuable spatial and content cues. Specifically, the DEP-query employs explicit position priors of 3-D projection coordinates to enhance the accuracy of query localization, thereby enabling the attention layers in the decoder to avoid noisy background information. The SDA module optimizes the receptive field learning of queries by the size priors of the corresponding 2-D boxes; thus, the queries could acquire high-quality visual features. Both the position and size priors do not require any additional data and are updated in each layer of the decoder to provide long-term assistance. Extensive experiments show that our model outperforms all the existing methods in terms of inference speed, which reaches the impressive 62.5 FPS. What is more, compared with the backbone MonoDETR, our MonoPSTR achieves around two times of training convergence speed and surpasses its precision by over 1.15% on famous KITTI dataset, demonstrating the sufficient practical value. The code is available at: https://github.com/yangfan293/MonoPSTR/tree/master/MonoPSTR.
引用
收藏
页码:1 / 1
页数:13
相关论文
共 50 条
  • [31] Learning region-guided scale-aware feature selection for object detection
    Liu, Liu
    Wang, Rujing
    Xie, Chengjun
    Li, Rui
    Wang, Fangyuan
    Zhou, Man
    Teng, Yue
    NEURAL COMPUTING & APPLICATIONS, 2021, 33 (11): : 6389 - 6403
  • [32] Learning region-guided scale-aware feature selection for object detection
    Liu Liu
    Rujing Wang
    Chengjun Xie
    Rui Li
    Fangyuan Wang
    Man Zhou
    Yue Teng
    Neural Computing and Applications, 2021, 33 : 6389 - 6403
  • [33] Anchor-Free Object Detection with Scale-Aware Networks for Autonomous Driving
    Piao, Zhengquan
    Wang, Junbo
    Tang, Linbo
    Zhao, Baojun
    Zhou, Shichao
    ELECTRONICS, 2022, 11 (20)
  • [34] Scale-Aware Regional Collective Feature Enhancement Network for Scene Object Detection
    Yiyao Li
    Jin Liu
    Zhenyu Gao
    Neural Processing Letters, 2023, 55 : 6289 - 6310
  • [35] SAUI: Scale-Aware Unseen Imagineer for Zero-Shot Object Detection
    Wang, Jiahao
    Yan, Caixia
    Zhang, Weizhan
    Liu, Huan
    Sun, Hao
    Zheng, Qinghua
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5445 - 5453
  • [36] Point-Guided Contrastive Learning for Monocular 3-D Object Detection
    Feng, Dapeng
    Han, Songfang
    Xu, Hang
    Liang, Xiaodan
    Tan, Xiaojun
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (02) : 954 - 966
  • [37] Monocular 3D Object Detection for Autonomous Driving Based on Contextual Transformer
    She, Xiangyang
    Yan, Weijia
    Dong, Lihong
    Computer Engineering and Applications, 2024, 60 (19) : 178 - 189
  • [38] MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
    Zhang, Renrui
    Qiu, Han
    Wang, Tai
    Guo, Ziyu
    Cui, Ziteng
    Qiao, Yu
    Li, Hongsheng
    Gao, Peng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 9121 - 9132
  • [39] MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer
    Zhou, Yunsong
    Zhu, Hongzi
    Liu, Quan
    Chang, Shan
    Guo, Minyi
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 17493 - 17503
  • [40] Ground-Aware Monocular 3D Object Detection for Autonomous Driving
    Liu, Yuxuan
    Yixuan, Yuan
    Liu, Ming
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (02): : 919 - 926