Multi-modal visual tracking based on textual generation

被引:0
|
作者
Wang, Jiahao [1 ,2 ]
Liu, Fang [1 ,2 ]
Jiao, Licheng [1 ,2 ]
Wang, Hao [1 ,2 ]
Li, Shuo [1 ,2 ]
Li, Lingling [1 ,2 ]
Chen, Puhua [1 ,2 ]
Liu, Xu [1 ,2 ]
机构
[1] Xidian Univ, Int Res Ctr Intelligent Percept & Computat, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Shaanxi Provinc, Peoples R China
[2] Xidian Univ, Sch Artificial Intelligence, Joint Int Res Lab Intelligent Percept & Computat, Xian 710071, Shaanxi Provinc, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Multi-modal tracking; Image descriptions; Visual and language modalities; Prompt learning; FUSION;
D O I
10.1016/j.inffus.2024.102531
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-modal tracking has garnered significant attention due to its wide range of potential applications. Existing multi-modal tracking approaches typically merge data from different visual modalities on top of RGB tracking. However, focusing solely on the visual modality is insufficient due to the scarcity of tracking data. Inspired by the recent success of large models, this paper introduces a Multi-modal Visual Tracking Based on Textual Generation (MVTTG) approach to address the limitations of visual tracking, which lacks language information and overlooks semantic relationships between the target and the search area. To achieve this, we leverage large models to generate image descriptions, using these descriptions to provide complementary information about the target's appearance and movement. Furthermore, to enhance the consistency between visual and language modalities, we employ prompt learning and design a Visual-Language Interaction Prompt Manager (V-L PM) to facilitate collaborative learning between visual and language domains. Experiments conducted with MVTTG on multiple benchmark datasets confirm the effectiveness and potential of incorporating image descriptions in multi-modal visual tracking.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Visual Prompt Multi-Modal Tracking
    Zhu, Jiawen
    Lai, Simiao
    Chen, Xin
    Wang, Dong
    Lu, Huchuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 9516 - 9526
  • [2] Mining Visual and Textual Data for Constructing a Multi-Modal Thesaurus
    Frigui, Hichem
    Candill, Joshua
    PROCEEDINGS OF THE SEVENTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 479 - 484
  • [3] Multi-modal recommendation algorithm fusing visual and textual features
    Hu, Xuefeng
    Yu, Wenting
    Wu, Yun
    Chen, Yukang
    PLOS ONE, 2023, 18 (06):
  • [4] Multi-modal visual tracking: Review and experimental comparison
    Pengyu Zhang
    Dong Wang
    Huchuan Lu
    Computational Visual Media, 2024, 10 : 193 - 214
  • [5] Multi-modal visual tracking: Review and experimental comparison
    Zhang, Pengyu
    Wang, Dong
    Lu, Huchuan
    COMPUTATIONAL VISUAL MEDIA, 2024, 10 (02) : 193 - 214
  • [6] Generation of Visual Representations for Multi-Modal Mathematical Knowledge
    Wu, Lianlong
    Choi, Seewon
    Raggi, Daniel
    Stockdill, Aaron
    Garcia, Grecia Garcia
    Colarusso, Fiorenzo
    Cheng, Peter C. H.
    Jamnik, Mateja
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23850 - 23852
  • [7] Learning consumer preferences through textual and visual data: a multi-modal approach
    Liu, Xinyu
    Liu, Yezheng
    Qian, Yang
    Jiang, Yuanchun
    Ling, Haifeng
    ELECTRONIC COMMERCE RESEARCH, 2023,
  • [8] Visual audio and textual triplet fusion network for multi-modal sentiment analysis
    Lv, Cai-Chao
    Zhang, Xuan
    Zhang, Hong-Bo
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, : 9505 - 9513
  • [9] Multi-modal Retrieval via Deep Textual-Visual Correlation Learning
    Song, Jun
    Wang, Yueyang
    Wu, Fei
    Lu, Weiming
    Tang, Siliang
    Zhuang, Yueting
    INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: IMAGE AND VIDEO DATA ENGINEERING, ISCIDE 2015, PT I, 2015, 9242 : 176 - 185
  • [10] Prompting for Multi-Modal Tracking
    Yang, Jinyu
    Li, Zhe
    Zheng, Feng
    Leonardis, Ales
    Song, Jingkuan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3492 - 3500