Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers

被引:21
|
作者
Zhu, Tianyu [1 ]
Hiller, Markus [2 ]
Ehsanpour, Mahsa [3 ]
Ma, Rongkai [1 ]
Drummond, Tom [2 ]
Reid, Ian
Rezatofighi, Hamid [4 ]
机构
[1] Monash Univ, Dept Elect & Comp Syst Engn, Clayton, Vic 3800, Australia
[2] Univ Melbourne, Sch Comp & Informat Syst, Parkville, Vic 3010, Australia
[3] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA 5005, Australia
[4] Monash Univ, Dept Data Sci & AI, Clayton, Vic 3800, Australia
关键词
Tracking; Transformers; Task analysis; Visualization; Object recognition; History; Feature extraction; Multi-object tracking; transformer; spatio-temporal model; pedestrian tracking; end-to-end learning; MULTITARGET;
D O I
10.1109/TPAMI.2022.3213073
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tracking a time-varying indefinite number of objects in a video sequence over time remains a challenge despite recent advances in the field. Most existing approaches are not able to properly handle multi-object tracking challenges such as occlusion, in part because they ignore long-term temporal information. To address these shortcomings, we present MO3TR: a truly end-to-end Transformer-based online multi-object tracking (MOT) framework that learns to handle occlusions, track initiation and termination without the need for an explicit data association module or any heuristics. MO3TR encodes object interactions into long-term temporal embeddings using a combination of spatial and temporal Transformers, and recursively uses the information jointly with the input data to estimate the states of all tracked objects over time. The spatial attention mechanism enables our framework to learn implicit representations between all the objects and the objects to the measurements, while the temporal attention mechanism focuses on specific parts of past information, allowing our approach to resolve occlusions over multiple frames. Our experiments demonstrate the potential of this new approach, achieving results on par with or better than the current state-of-the-art on multiple MOT metrics for several popular multi-object tracking benchmarks.
引用
收藏
页码:12783 / 12797
页数:15
相关论文
共 50 条
  • [31] An end-to-end tracking framework via multi-view and temporal feature aggregation
    Yang, Yihan
    Xu, Ming
    Ralph, Jason F.
    Ling, Yuchen
    Pan, Xiaonan
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
  • [32] GSMR-CNN: An End-to-End Trainable Architecture for Grasping Target Objects from Multi-Object Scenes
    Holomjova, Valerija
    Starkey, Andrew J.
    Meissner, Pascal
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 3808 - 3814
  • [33] Multi-Object Tracking With Spatial-Temporal Topology-Based Detector
    You, Sisi
    Yao, Hantao
    Xu, Changsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (05) : 3023 - 3035
  • [34] Temporal-Spatial Feature Interaction Network for Multi-Drone Multi-Object Tracking
    Wu, Han
    Sun, Hao
    Ji, Kefeng
    Kuang, Gangyao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1165 - 1179
  • [35] Enhancing Online UAV Multi-Object Tracking with Temporal Context and Spatial Topological Relationships
    Xiao, Changcheng
    Cao, Qiong
    Zhong, Yujie
    Lan, Long
    Zhang, Xiang
    Cai, Huayue
    Luo, Zhigang
    DRONES, 2023, 7 (06)
  • [36] ST-Tracking: Spatial-temporal Graph Convolution Neural Network for Multi-object Tracking
    Xue, Yaqing
    Luo, Guiyang
    Yuan, Quan
    Li, Jinglin
    Yang, Fangchun
    2021 IEEE INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC), 2021, : 2958 - 2964
  • [37] Spatial-Temporal Routing for Supporting End-to-End Hard Deadlines in Multi-hop Networks
    Liu, Xin
    Ying, Lei
    2016 ANNUAL CONFERENCE ON INFORMATION SCIENCE AND SYSTEMS (CISS), 2016,
  • [38] Spatial-temporal routing for supporting end-to-end hard deadlines in multi-hop networks
    Liu, Xin
    Wang, Weichang
    Ying, Lei
    PERFORMANCE EVALUATION, 2019, 135
  • [39] Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism
    Chu, Qi
    Ouyang, Wanli
    Li, Hongsheng
    Wang, Xiaogang
    Liu, Bin
    Yu, Nenghai
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4846 - 4855
  • [40] ReST: A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking
    Cheng, Cheng-Che
    Qiu, Min-Xuan
    Chiang, Chen-Kuo
    Lai, Shang-Hong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10017 - 10026