Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

被引：0

作者：

Yan, Shilin ^{[1
,2
]}

Zhang, Renrui ^{[2
,3
]}

Guo, Ziyu ^{[3
]}

Chen, Wenchao ^{[1
]}

Zhang, Wei ^{[1
]}

Li, Hongyang ^{[2
]}

Qiao, Yu ^{[2
]}

Dong, Hao ^{[4
,5
]}

He, Zhongjiang ^{[6
]}

Gao, Peng ^{[2
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China

[2] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[4] Peking Univ, Sch CS, Beijing, Peoples R China

[5] PKU agibot Lab, Beijing, Peoples R China

[6] China Telecom Corp Ltd, Data&AI Technol Co, Beijing, Peoples R China

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6 | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.

引用

页码：6449 / 6457

页数：9

共 50 条

[41] Video Semantic Segmentation via Sparse Temporal Transformer
Li, Jiangtong
Wang, Wentao
Chen, Junjie
Niu, Li
Si, Jianlou
Qian, Chen
Zhang, Liqing
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 59 - 68
[42] Multi-Modality Image Fusion and Object Detection Based on Semantic Information
Liu, Yong
Zhou, Xin
Zhong, Wei
ENTROPY, 2023, 25 (05)
[43] Contrastive Trustworthy Prototype Learning for multi-modality myocardial pathology segmentation
Liu, Jingjing
Wei, Ao
Cao, Lijuan
He, Xiao
Tang, Chang
APPLIED SOFT COMPUTING, 2025, 173
[44] Multi-modality MRI Arbitrary Transformation using Unified Generative Adversarial Networks
Lei, Yang
Fu, Yabo
Mao, Hui
Curran, Walter J.
Liu, Tian
Yang, Xiaofeng
MEDICAL IMAGING 2020: IMAGE PROCESSING, 2021, 11313
[45] XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning
Huang, Yiqing
Cai, Qiuyu
Xu, Siyu
Chen, Jiansheng
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4600 - 4604
[46] Towards Unified Modality Understanding for Alzheimer's Disease Diagnosis Using Incomplete Multi-modality Data
Han, Kangfu
Zhao, Fenqiang
Zhu, Dajiang
Liu, Tianming
Yang, Feng
Li, Gang
MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT II, 2024, 14349 : 184 - 193
[47] Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI
Zhuang, Xiahai
Shen, Juan
MEDICAL IMAGE ANALYSIS, 2016, 31 : 77 - 87
[48] ADVIT: VISION TRANSFORMER ON MULTI-MODALITY PET IMAGES FOR ALZHEIMER DISEASE DIAGNOSIS
Xing, Xin
Liang, Gongbo
Zhang, Yu
Khanal, Subash
Lin, Ai-Ling
Jacobs, Nathan
2022 IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (IEEE ISBI 2022), 2022,
[49] Learning spatiotemporal relationships with a unified framework for video object segmentation
Mei, Jianbiao
Wang, Mengmeng
Yang, Yu
Li, Zizhang
Liu, Yong
APPLIED INTELLIGENCE, 2024, 54 (08) : 6138 - 6153
[50] Structural Transformer with Region Strip Attention for Video Object Segmentation
Guan, Qingfeng
Fang, Hao
Han, Chenchen
Wang, Zhicheng
Zhang, Ruiheng
Zhang, Yitian
Lu, Xiankai
NEUROCOMPUTING, 2024, 596

← 1 2 3 4 5 →