Multi-modal visual tracking based on textual generation

被引:0
|
作者
Wang, Jiahao [1 ,2 ]
Liu, Fang [1 ,2 ]
Jiao, Licheng [1 ,2 ]
Wang, Hao [1 ,2 ]
Li, Shuo [1 ,2 ]
Li, Lingling [1 ,2 ]
Chen, Puhua [1 ,2 ]
Liu, Xu [1 ,2 ]
机构
[1] Xidian Univ, Int Res Ctr Intelligent Percept & Computat, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Xian 710071, Shaanxi Provinc, Peoples R China
[2] Xidian Univ, Sch Artificial Intelligence, Joint Int Res Lab Intelligent Percept & Computat, Xian 710071, Shaanxi Provinc, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Multi-modal tracking; Image descriptions; Visual and language modalities; Prompt learning; FUSION;
D O I
10.1016/j.inffus.2024.102531
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-modal tracking has garnered significant attention due to its wide range of potential applications. Existing multi-modal tracking approaches typically merge data from different visual modalities on top of RGB tracking. However, focusing solely on the visual modality is insufficient due to the scarcity of tracking data. Inspired by the recent success of large models, this paper introduces a Multi-modal Visual Tracking Based on Textual Generation (MVTTG) approach to address the limitations of visual tracking, which lacks language information and overlooks semantic relationships between the target and the search area. To achieve this, we leverage large models to generate image descriptions, using these descriptions to provide complementary information about the target's appearance and movement. Furthermore, to enhance the consistency between visual and language modalities, we employ prompt learning and design a Visual-Language Interaction Prompt Manager (V-L PM) to facilitate collaborative learning between visual and language domains. Experiments conducted with MVTTG on multiple benchmark datasets confirm the effectiveness and potential of incorporating image descriptions in multi-modal visual tracking.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] VISUAL AS MULTI-MODAL ARGUMENTATION IN LAW
    Novak, Marko
    BRATISLAVA LAW REVIEW, 2021, 5 (01): : 91 - 110
  • [22] Multi-modal measurement of the visual cortex
    Amano, Kaoru
    Takemura, Hiromasa
    I-PERCEPTION, 2014, 5 (04): : 408 - 408
  • [23] Visual Sorting Method Based on Multi-Modal Information Fusion
    Han, Song
    Liu, Xiaoping
    Wang, Gang
    APPLIED SCIENCES-BASEL, 2022, 12 (06):
  • [24] Multi-Modal Sparse Tracking by Jointing Timing and Modal Consistency
    Li, Jiajun
    Fang, Bin
    Zhou, Mingliang
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2022, 36 (06)
  • [25] Knowledge Synergy Learning for Multi-Modal Tracking
    He, Yuhang
    Ma, Zhiheng
    Wei, Xing
    Gong, Yihong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5519 - 5532
  • [26] Agile Multi-modal Tracking with Dependent Measurements
    Zhang, Jun Jason
    Ding, Quan
    Kay, Steven
    Papandreou-Suppappola, Antonia
    Rangaswamy, Muralidhar
    2010 CONFERENCE RECORD OF THE FORTY FOURTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS (ASILOMAR), 2010, : 1653 - 1657
  • [27] Multi-modal tracking using texture changes
    Kemp, Christopher
    Drummond, Tom
    IMAGE AND VISION COMPUTING, 2008, 26 (03) : 442 - 450
  • [28] Multi-modal tracking of faces for video communications
    Crowley, JL
    Berard, F
    1997 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, 1997, : 640 - 645
  • [29] Deep Object Tracking with Multi-modal Data
    Zhang, Xuezhi
    Yuan, Yuan
    Lu, Xiaoqiang
    2016 INTERNATIONAL CONFERENCE ON COMPUTER, INFORMATION AND TELECOMMUNICATION SYSTEMS (CITS), 2016, : 161 - 165
  • [30] TV commercial classification by using multi-modal textual information
    Zheng, Yantao
    Duan, Lingyu
    Tian, Qi
    Jin, Jesse S.
    2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, : 497 - 500