Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

被引：0

作者：

Yan, Shilin ^{[1
,2
]}

Zhang, Renrui ^{[2
,3
]}

Guo, Ziyu ^{[3
]}

Chen, Wenchao ^{[1
]}

Zhang, Wei ^{[1
]}

Li, Hongyang ^{[2
]}

Qiao, Yu ^{[2
]}

Dong, Hao ^{[4
,5
]}

He, Zhongjiang ^{[6
]}

Gao, Peng ^{[2
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China

[2] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[4] Peking Univ, Sch CS, Beijing, Peoples R China

[5] PKU agibot Lab, Beijing, Peoples R China

[6] China Telecom Corp Ltd, Data&AI Technol Co, Beijing, Peoples R China

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6 | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.

引用

页码：6449 / 6457

页数：9

共 50 条

[21] Thoracic and Abdominal Atlas Segmentation with Diffeomorphic Multi-Modality Registration
Schreibmann, E.
Fox, T.
Crocker, I.
MEDICAL PHYSICS, 2011, 38 (06)
[22] Multi-Modality Microscopy Image Style Augmentation for Nuclei Segmentation
Liu, Ye
Wagner, Sophia J.
Peng, Tingying
JOURNAL OF IMAGING, 2022, 8 (03)
[23] Video Event Detection via Multi-modality Deep Learning
Jhuo, I-Hong
Lee, D. T.
2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 666 - 671
[24] A software vulnerability detection method based on multi-modality with unified processing
Cai, Wenjing
Chen, Junlin
Yu, Jiaping
Hu, Wei
Gao, Lipeng
INFORMATION AND SOFTWARE TECHNOLOGY, 2025, 182
[25] Multi-modality Based Affective Video Summarization for Game Players
Farooq, Sehar Shahzad
Aziz, Abdullah
Mukhtar, Hammad
Fiaz, Mustansar
Baek, Ki Yeol
Choi, Naram
Yun, Sang Bin
Kim, Kyung Joong
Jung, Soon Ki
FRONTIERS OF COMPUTER VISION, IW-FCV 2021, 2021, 1405 : 59 - 69
[26] Concept-Driven Multi-Modality Fusion for Video Search
Wei, Xiao-Yong
Jiang, Yu-Gang
Ngo, Chong-Wah
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2011, 21 (01) : 62 - 73
[27] A Multi-modality Driven Promptable Transformer for Automated Parapneumonic Effusion Staging
Chen, Yan
Liu, Qing
Xiang, Yao
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT XIII, 2024, 14437 : 248 - 259
[28] Siamese Network with Interactive Transformer for Video Object Segmentation
Lan, Meng
Zhang, Jing
He, Fengxiang
Zhang, Lefei
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1228 - 1236
[29] Spatial-Temporal Multi-level Association for Video Object Segmentation
Miao, Deshui
Li, Xin
He, Zhenyu
Lu, Huchuan
Yang, Ming-Hsuan
COMPUTER VISION - ECCV 2024, PT LXVII, 2025, 15125 : 91 - 107
[30] UMINet: a unified multi-modality interaction network for RGB-D and RGB-T salient object detection
Lina Gao
Ping Fu
Mingzhu Xu
Tiantian Wang
Bing Liu
The Visual Computer, 2024, 40 : 1565 - 1582

← 1 2 3 4 5 →