Multi-modal visual tracking based on textual generation

被引:0
|
作者
Wang, Jiahao [1 ,2 ]
Liu, Fang [1 ,2 ]
Jiao, Licheng [1 ,2 ]
Wang, Hao [1 ,2 ]
Li, Shuo [1 ,2 ]
Li, Lingling [1 ,2 ]
Chen, Puhua [1 ,2 ]
Liu, Xu [1 ,2 ]
机构
[1] Xidian Univ, Int Res Ctr Intelligent Percept & Computat, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Shaanxi Provinc, Peoples R China
[2] Xidian Univ, Sch Artificial Intelligence, Joint Int Res Lab Intelligent Percept & Computat, Xian 710071, Shaanxi Provinc, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Multi-modal tracking; Image descriptions; Visual and language modalities; Prompt learning; FUSION;
D O I
10.1016/j.inffus.2024.102531
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-modal tracking has garnered significant attention due to its wide range of potential applications. Existing multi-modal tracking approaches typically merge data from different visual modalities on top of RGB tracking. However, focusing solely on the visual modality is insufficient due to the scarcity of tracking data. Inspired by the recent success of large models, this paper introduces a Multi-modal Visual Tracking Based on Textual Generation (MVTTG) approach to address the limitations of visual tracking, which lacks language information and overlooks semantic relationships between the target and the search area. To achieve this, we leverage large models to generate image descriptions, using these descriptions to provide complementary information about the target's appearance and movement. Furthermore, to enhance the consistency between visual and language modalities, we employ prompt learning and design a Visual-Language Interaction Prompt Manager (V-L PM) to facilitate collaborative learning between visual and language domains. Experiments conducted with MVTTG on multiple benchmark datasets confirm the effectiveness and potential of incorporating image descriptions in multi-modal visual tracking.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Multi-modal pedestrian detection with misalignment based on modal-wise regression and multi-modal IoU
    Wanchaitanawong, Napat
    Tanaka, Masayuki
    Shibata, Takashi
    Okutomi, Masatoshi
    JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (01)
  • [42] An Embedded Multi-Modal System for Object Localization and Tracking
    Rodriguez F, Sergio A.
    Fremont, Vincent
    Bonnifait, Philippe
    Cherfaoui, Veronique
    IEEE INTELLIGENT TRANSPORTATION SYSTEMS MAGAZINE, 2012, 4 (04) : 42 - U53
  • [43] An Embedded Multi-Modal System for Object Localization and Tracking
    Rodriguez F, Sergio A.
    Fremont, Vincent
    Bonnifait, Philippe
    Cherfaoui, Veronique
    2010 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2010, : 211 - 216
  • [44] Multi-modal face tracking using Bayesian network
    Liu, F
    Lin, XY
    Lie, SZ
    Shi, YC
    IEEE INTERNATIONAL WORKSHOP ON ANALYSIS AND MODELING OF FACE AND GESTURES, 2003, : 135 - 142
  • [45] Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer
    Liu T.
    Hu Y.
    Gao J.
    Wang J.
    Sun Y.
    Yin B.
    Neural Networks, 2024, 176
  • [46] Multi-modal user interaction method based on gaze tracking and gesture recognition
    Lee, Heekyung
    Lim, Seong Yong
    Lee, Injae
    Cha, Jihun
    Cho, Dong-Chan
    Cho, Sunyoung
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2013, 28 (02) : 114 - 126
  • [47] Multi-Modal Fusion Object Tracking Based on Fully Convolutional Siamese Network
    Qi, Ke
    Chen, Liji
    Zhou, Yicong
    Qi, Yutao
    2023 2ND ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING, CACML 2023, 2023, : 440 - 444
  • [48] Person Tracking Association Using Multi-modal Systems
    Belmonte-Hernandez, A.
    Solachidis, V.
    Theodoridis, T.
    Hernandez-Penaloza, G.
    Conti, G.
    Vretosl, N.
    Alvarez, F.
    Daras, P.
    2017 14TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2017,
  • [49] Research on Multi-Modal Pedestrian Detection and Tracking Algorithm Based on Deep Learning
    Zhao, Rui
    Hao, Jutao
    Huo, Huan
    FUTURE INTERNET, 2024, 16 (06)
  • [50] SiamMMF: multi-modal multi-level fusion object tracking based on Siamese networks
    Yang, Zhen
    Huang, Peng
    He, Dunyun
    Cai, Zhongwang
    Yin, Zhijian
    MACHINE VISION AND APPLICATIONS, 2023, 34 (01)