Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

被引:0
|
作者
Yan, Shilin [1 ,2 ]
Zhang, Renrui [2 ,3 ]
Guo, Ziyu [3 ]
Chen, Wenchao [1 ]
Zhang, Wei [1 ]
Li, Hongyang [2 ]
Qiao, Yu [2 ]
Dong, Hao [4 ,5 ]
He, Zhongjiang [6 ]
Gao, Peng [2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[4] Peking Univ, Sch CS, Beijing, Peoples R China
[5] PKU agibot Lab, Beijing, Peoples R China
[6] China Telecom Corp Ltd, Data&AI Technol Co, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.
引用
收藏
页码:6449 / 6457
页数:9
相关论文
共 50 条
  • [31] Video Object Segmentation without Temporal Information
    Maninis, Kevis-Kokitsi
    Caelles, Sergi
    Chen, Yuhua
    Pont-Tuset, Jordi
    Leal-Taixe, Laura
    Cremers, Daniel
    Van Gool, Luc
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (06) : 1515 - 1530
  • [32] UMINet: a unified multi-modality interaction network for RGB-D and RGB-T salient object detection
    Gao, Lina
    Fu, Ping
    Xu, Mingzhu
    Wang, Tiantian
    Liu, Bing
    VISUAL COMPUTER, 2024, 40 (03): : 1565 - 1582
  • [33] Multi-concept multi-modality active learning for interactive video annotation
    Wang, Meng
    Hua, Xian-Sheng
    Song, Yan
    Tang, Jinhui
    Dai, Li-Rong
    ICSC 2007: INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, PROCEEDINGS, 2007, : 321 - +
  • [34] Video Object Segmentation with Weakly Temporal Information
    Zhang, Yikun
    Yao, Rui
    Jiang, Qingnan
    Zhang, Changbin
    Wang, Shi
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2019, 13 (03): : 1434 - 1449
  • [35] INTERACTIVE VIDEO ANNOTATION BY MULTI-CONCEPT MULTI-MODALITY ACTIVE LEARNING
    Wang, Meng
    Hua, Xian-Sheng
    Mei, Tao
    Tang, Jinhui
    Qi, Guo-Jun
    Song, Yan
    Dai, Li-Rong
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2007, 1 (04) : 459 - 477
  • [36] Multi-Modality Convolutional Neural Network for Automatic Lung Tumor Segmentation
    Wang, S.
    Yuan, L.
    Mahon, R.
    Weiss, E.
    MEDICAL PHYSICS, 2020, 47 (06) : E302 - E303
  • [37] Style Data Augmentation for Robust Segmentation of Multi-modality Cardiac MRI
    Ly, Buntheng
    Cochet, Hubert
    Sermesant, Maxime
    STATISTICAL ATLASES AND COMPUTATIONAL MODELS OF THE HEART: MULTI-SEQUENCE CMR SEGMENTATION, CRT-EPIGGY AND LV FULL QUANTIFICATION CHALLENGES, 2020, 12009 : 197 - 208
  • [38] Exploring and Exploiting Multi-Modality Uncertainty for Tumor Segmentation on PET/CT
    Kang, Susu
    Kang, Yixiong
    Tan, Shan
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (09) : 5435 - 5446
  • [39] MMFNet: A multi-modality MRI fusion network for segmentation of nasopharyngeal carcinoma
    Chen, Huai
    Qi, Yuxiao
    Yin, Yong
    Li, Tengxiang
    Liu, Xiaoqing
    Li, Xiuli
    Gong, Guanzhong
    Wang, Lisheng
    NEUROCOMPUTING, 2020, 394 : 27 - 40
  • [40] Evidence Fusion with Contextual Discounting for Multi-modality Medical Image Segmentation
    Huang, Ling
    Denoeux, Thierry
    Vera, Pierre
    Ruan, Su
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 401 - 411