Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers

被引：21

作者：

Zhu, Tianyu ^{[1
]}

Hiller, Markus ^{[2
]}

Ehsanpour, Mahsa ^{[3
]}

Ma, Rongkai ^{[1
]}

Drummond, Tom ^{[2
]}

Reid, Ian

Rezatofighi, Hamid ^{[4
]}

机构：

[1] Monash Univ, Dept Elect & Comp Syst Engn, Clayton, Vic 3800, Australia

[2] Univ Melbourne, Sch Comp & Informat Syst, Parkville, Vic 3010, Australia

[3] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA 5005, Australia

[4] Monash Univ, Dept Data Sci & AI, Clayton, Vic 3800, Australia

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 11期

关键词：

Tracking; Transformers; Task analysis; Visualization; Object recognition; History; Feature extraction; Multi-object tracking; transformer; spatio-temporal model; pedestrian tracking; end-to-end learning; MULTITARGET;

D O I：

10.1109/TPAMI.2022.3213073

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Tracking a time-varying indefinite number of objects in a video sequence over time remains a challenge despite recent advances in the field. Most existing approaches are not able to properly handle multi-object tracking challenges such as occlusion, in part because they ignore long-term temporal information. To address these shortcomings, we present MO3TR: a truly end-to-end Transformer-based online multi-object tracking (MOT) framework that learns to handle occlusions, track initiation and termination without the need for an explicit data association module or any heuristics. MO3TR encodes object interactions into long-term temporal embeddings using a combination of spatial and temporal Transformers, and recursively uses the information jointly with the input data to estimate the states of all tracked objects over time. The spatial attention mechanism enables our framework to learn implicit representations between all the objects and the objects to the measurements, while the temporal attention mechanism focuses on specific parts of past information, allowing our approach to resolve occlusions over multiple frames. Our experiments demonstrate the potential of this new approach, achieving results on par with or better than the current state-of-the-art on multiple MOT metrics for several popular multi-object tracking benchmarks.

引用

页码：12783 / 12797

页数：15

共 50 条

[31] An end-to-end tracking framework via multi-view and temporal feature aggregation
Yang, Yihan
Xu, Ming
Ralph, Jason F.
Ling, Yuchen
Pan, Xiaonan
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
[32] GSMR-CNN: An End-to-End Trainable Architecture for Grasping Target Objects from Multi-Object Scenes
Holomjova, Valerija
Starkey, Andrew J.
Meissner, Pascal
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 3808 - 3814
[33] Multi-Object Tracking With Spatial-Temporal Topology-Based Detector
You, Sisi
Yao, Hantao
Xu, Changsheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (05) : 3023 - 3035
[34] Temporal-Spatial Feature Interaction Network for Multi-Drone Multi-Object Tracking
Wu, Han
Sun, Hao
Ji, Kefeng
Kuang, Gangyao
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1165 - 1179
[35] Enhancing Online UAV Multi-Object Tracking with Temporal Context and Spatial Topological Relationships
Xiao, Changcheng
Cao, Qiong
Zhong, Yujie
Lan, Long
Zhang, Xiang
Cai, Huayue
Luo, Zhigang
DRONES, 2023, 7 (06)
[36] ST-Tracking: Spatial-temporal Graph Convolution Neural Network for Multi-object Tracking
Xue, Yaqing
Luo, Guiyang
Yuan, Quan
Li, Jinglin
Yang, Fangchun
2021 IEEE INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC), 2021, : 2958 - 2964
[37] Spatial-Temporal Routing for Supporting End-to-End Hard Deadlines in Multi-hop Networks
Liu, Xin
Ying, Lei
2016 ANNUAL CONFERENCE ON INFORMATION SCIENCE AND SYSTEMS (CISS), 2016,
[38] Spatial-temporal routing for supporting end-to-end hard deadlines in multi-hop networks
Liu, Xin
Wang, Weichang
Ying, Lei
PERFORMANCE EVALUATION, 2019, 135
[39] Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism
Chu, Qi
Ouyang, Wanli
Li, Hongsheng
Wang, Xiaogang
Liu, Bin
Yu, Nenghai
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4846 - 4855
[40] ReST: A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking
Cheng, Cheng-Che
Qiu, Min-Xuan
Chiang, Chen-Kuo
Lai, Shang-Hong
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10017 - 10026

← 1 2 3 4 5 →