Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

被引:0
|
作者
Yan, Shilin [1 ,2 ]
Zhang, Renrui [2 ,3 ]
Guo, Ziyu [3 ]
Chen, Wenchao [1 ]
Zhang, Wei [1 ]
Li, Hongyang [2 ]
Qiao, Yu [2 ]
Dong, Hao [4 ,5 ]
He, Zhongjiang [6 ]
Gao, Peng [2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[4] Peking Univ, Sch CS, Beijing, Peoples R China
[5] PKU agibot Lab, Beijing, Peoples R China
[6] China Telecom Corp Ltd, Data&AI Technol Co, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.
引用
收藏
页码:6449 / 6457
页数:9
相关论文
共 50 条
  • [1] Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning
    Sun, Mingjie
    Xiao, Jimin
    Lim, Eng Gee
    Zhao, Cairong
    Zhao, Yao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6722 - 6734
  • [2] Multi-Modality Video Scene Segmentation Algorithm with Shot Force Competition
    Xiang, Yun-zhu
    APPLIED SCIENCE, MATERIALS SCIENCE AND INFORMATION TECHNOLOGIES IN INDUSTRY, 2014, 513-517 : 514 - 517
  • [3] Hyper-Connected Transformer Network for Multi-Modality PET-CT Segmentation
    Bi, Lei
    Fulham, Michael
    Song, Shaoli
    Feng, David Dagan
    Kim, Jinman
    2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
  • [4] Robust Multi-Modality Multi-Object Tracking
    Zhang, Wenwei
    Zhou, Hui
    Sun, Shuyang
    Wang, Zhe
    Shi, Jianping
    Loy, Chen Change
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2365 - 2374
  • [5] TLCFuse: Temporal Multi-Modality Fusion Towards Occlusion-Aware Semantic Segmentation
    Salazar-Gomez, Gustavo
    Liu, Wenqian
    Diaz-Zapata, Manuel
    Sierra-Gonzalez, David
    Laugier, Christian
    2024 35TH IEEE INTELLIGENT VEHICLES SYMPOSIUM, IEEE IV 2024, 2024, : 2110 - 2116
  • [6] M2FTrans: Modality-Masked Fusion Transformer for Incomplete Multi-Modality BrainT umor Segmentation
    Shi, Junjie
    Yu, Li
    Cheng, Qimin
    Yang, Xin
    Cheng, Kwang-Ting
    Yan, Zengqiang
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (01) : 379 - 390
  • [7] Unified Spatio-Temporal Dynamic Routing for Efficient Video Object Segmentation
    Dang, Jisheng
    Zheng, Huicheng
    Xu, Xiaohao
    Guo, Yulan
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (05) : 4512 - 4526
  • [8] Object Tracking Based on Multi-modality Dictionary Learning
    Wang, Jing
    Zhu, Hong
    Xue, Shan
    Shi, Jing
    IMAGE AND GRAPHICS (ICIG 2017), PT II, 2017, 10667 : 129 - 138
  • [9] MixNet: Multi-modality Mix Network for Brain Segmentation
    Chen, Long
    Merhof, Dorit
    BRAINLESION: GLIOMA, MULTIPLE SCLEROSIS, STROKE AND TRAUMATIC BRAIN INJURIES, BRAINLES 2018, PT I, 2019, 11383 : 367 - 377
  • [10] Learning based Multi-modality Image and Video Compression
    Lu, Guo
    Zhong, Tianxiong
    Geng, Jing
    Hu, Qiang
    Xu, Dong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 6073 - 6082