Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

被引：0

作者：

Yan, Shilin ^{[1
,2
]}

Zhang, Renrui ^{[2
,3
]}

Guo, Ziyu ^{[3
]}

Chen, Wenchao ^{[1
]}

Zhang, Wei ^{[1
]}

Li, Hongyang ^{[2
]}

Qiao, Yu ^{[2
]}

Dong, Hao ^{[4
,5
]}

He, Zhongjiang ^{[6
]}

Gao, Peng ^{[2
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China

[2] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[4] Peking Univ, Sch CS, Beijing, Peoples R China

[5] PKU agibot Lab, Beijing, Peoples R China

[6] China Telecom Corp Ltd, Data&AI Technol Co, Beijing, Peoples R China

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6 | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.

引用

页码：6449 / 6457

页数：9

共 50 条

[11] Multi-modality video shot clustering with tensor representation
Yanan Liu
Fei Wu
Multimedia Tools and Applications, 2009, 41 : 93 - 109
[12] Multi-modality video shot clustering with tensor representation
Liu, Yanan
Wu, Fei
MULTIMEDIA TOOLS AND APPLICATIONS, 2009, 41 (01) : 93 - 109
[13] A Unified Algorithm for Object Tracking and Segmentation and its Application on Intelligent Video Surveillance for Transformer Substation
Chen X.
Han Y.
Yan Y.
Qi D.
Shen J.
Zhongguo Dianji Gongcheng Xuebao/Proceedings of the Chinese Society of Electrical Engineering, 2020, 40 (23): : 7578 - 7586
[14] Accounting for Random Regressors: A Unified Approach to Multi-modality Imaging
Yang, Xue
Lauzon, Carolyn B.
Crainiceanu, Ciprian
Caffo, Brian
Resnick, Susan M.
Landman, Bennett A.
MULTIMODAL BRAIN IMAGE ANALYSIS, 2011, 7012 : 1 - +
[15] Homologous point transformer for multi-modality prostate image registration
Ruchti, Alexander
Neuwirth, Alexander
Lowman, Allison K.
Duenweg, Savannah R.
LaViolette, Peter S.
Bukowy, John D.
PEERJ COMPUTER SCIENCE, 2022, 8
[16] Homologous point transformer for multi-modality prostate image registration
Ruchti, Alexander
Neuwirth, Alexander
Lowman, Allison K.
Duenweg, Savannah R.
LaViolette, Peter S.
Bukowy, John D.
PeerJ Computer Science, 2022, 8
[17] TUMOR SEGMENTATION VIA MULTI-MODALITY JOINT DICTIONARY LEARNING
Wang, Yan
Yu, Biting
Wang, Lei
Zu, Chen
Luo, Yong
Wu, Xi
Yang, Zhipeng
Zhou, Jiliu
Zhou, Luping
2018 IEEE 15TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2018), 2018, : 1336 - 1339
[18] A Fast Unsupervised Approach for Multi-Modality Surgical Trajectory Segmentation
Zhao, Hongfa
Xie, Jiexin
Shao, Zhenzhou
Qu, Ying
Guan, Yong
Tan, Jindong
IEEE ACCESS, 2018, 6 : 56411 - 56422
[19] Cell Segmenter: A General Framework for Multi-modality Cell Segmentation
Hu, Kaiwen
Zhang, Shengxuming
Jia, Zhijie
Cheng, Lechao
Feng, Zunlei
COMPETITIONS IN NEURAL INFORMATION PROCESSING SYSTEMS, VOL 212, 2022, 212
[20] Weakly Supervised Cell Instance Segmentation for Multi-Modality Microscopy
Xue, Ming
COMPETITIONS IN NEURAL INFORMATION PROCESSING SYSTEMS, VOL 212, 2022, 212

← 1 2 3 4 5 →