Egocentric Human Trajectory Forecasting With a Wearable Camera and Multi-Modal Fusion

被引:6
|
作者
Qiu, Jianing [1 ]
Chen, Lipeng [2 ]
Gu, Xiao [1 ]
Lo, Frank P-W [1 ]
Tsai, Ya-Yen [1 ]
Sun, Jiankai [2 ,3 ]
Liu, Jiaqi [2 ,4 ]
Lo, Benny [1 ]
机构
[1] Imperial Coll London, Hamlyn Ctr Robot Surg, London SW7 2AZ, England
[2] Tencent Robot X, Shenzhen 518057, Peoples R China
[3] Stanford Univ, Dept Aeronaut & Astronaut, Stanford, CA 94305 USA
[4] Shanghai Jiao Tong Univ, Inst Med Robot, Shanghai 200240, Peoples R China
关键词
Human trajectory forecasting; egocentric vision; multi-modal learning;
D O I
10.1109/LRA.2022.3188101
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
In this letter, we address the problem of forecasting the trajectory of an egocentric camera wearer (ego-person) in crowded spaces. The trajectory forecasting ability learned from the data of different camera wearers walking around in the real world can he transferred to assist visually impaired people in navigation, as well as to instill human navigation behaviours in mobile robots, enabling better human-robot interactions. To this end, a novel egocentric human trajectory forecasting dataset was constructed, containing real trajectories of people navigating in crowded spaces wearing a camera, as well as extracted rich contextual data. We extract and utilize three different modalities to forecast the trajectory of the camera wearer, i.e., his/her past trajectory, the past trajectories of nearby people, and the environment such as the scene semantics or the depth of the scene. A Transformer-based encoder-decoder neural network model, integrated with a novel cascaded cross-attention mechanism that fuses multiple modalities, has been designed to predict the future trajectory of the camera wearer. Extensive experiments have been conducted, with results showing that our model outperforms the state-of-the-art methods in egocentric human trajectory forecasting.
引用
收藏
页码:8799 / 8806
页数:8
相关论文
共 50 条
  • [11] Hybrid Multi-modal Fusion for Human Action Recognition
    Seddik, Bassem
    Gazzah, Sami
    Ben Amara, Najoua Essoukri
    IMAGE ANALYSIS AND RECOGNITION, ICIAR 2017, 2017, 10317 : 201 - 209
  • [12] Human activity recognition based on multi-modal fusion
    Cheng Zhang
    Tianqi Zu
    Yibin Hou
    Jian He
    Shengqi Yang
    Ruihai Dong
    CCF Transactions on Pervasive Computing and Interaction, 2023, 5 : 321 - 332
  • [13] Multi-modal trajectory forecasting with Multi-scale Interactions and Multi-pseudo-target Supervision
    Zhao, Cong
    Song, Andi
    Zeng, Zimu
    Ji, Yuxiong
    Du, Yuchuan
    KNOWLEDGE-BASED SYSTEMS, 2024, 296
  • [14] Probabilistic multi-modal depth estimation based on camera–LiDAR sensor fusion
    Johan S. Obando-Ceron
    Victor Romero-Cano
    Sildomar Monteiro
    Machine Vision and Applications, 2023, 34
  • [15] Autonomous Human Activity Classification From Wearable Multi-Modal Sensors
    Lu, Yantao
    Velipasalar, Senem
    IEEE SENSORS JOURNAL, 2019, 19 (23) : 11403 - 11412
  • [16] EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset
    Northcutt, Curtis G.
    Zha, Shengxin
    Lovegrove, Steven
    Newcombe, Richard
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 6783 - 6793
  • [17] Rethinking Fusion Baselines for Multi-modal Human Action Recognition
    Jiang, Hongda
    Li, Yanghao
    Song, Sijie
    Liu, Jiaying
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 178 - 187
  • [18] Multi-Camera Trajectory Forecasting With Trajectory Tensors
    Styles, Olly
    Guha, Tanaya
    Sanchez, Victor
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (11) : 8482 - 8491
  • [19] Soft multi-modal data fusion
    Coppock, S
    Mazack, L
    PROCEEDINGS OF THE 12TH IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1 AND 2, 2003, : 636 - 641
  • [20] Multi-modal fusion for video understanding
    Hoogs, A
    Mundy, J
    Cross, G
    30TH APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, PROCEEDINGS: ANALYSIS AND UNDERSTANDING OF TIME VARYING IMAGERY, 2001, : 103 - 108