Prompt-guided bidirectional deep fusion network for referring image segmentation

被引：0

作者：

机构：

[1] [1,Wu, Junxian

[2] Zhang, Yujia

[3] Kampffmeyer, Michael

[4] Zhao, Xiaoguang

来源：

Zhang, Yujia (zhangyujia2014@ia.ac.cn) | 2025年 / 616卷

关键词：

Image segmentation;

D O I：

10.1016/j.neucom.2024.128899

中图分类号：

学科分类号：

摘要：

Referring image segmentation involves accurately segmenting objects based on natural language descriptions. This poses challenges due to the intricate and varied nature of language expressions, as well as the requirement to identify relevant image regions among multiple objects. Current models predominantly employ language-aware early fusion techniques, which may lead to misinterpretations of language expressions due to the lack of explicit visual guidance of the language encoder. Additionally, early fusion methods are unable to adequately leverage high-level contexts. To address these limitations, this paper introduces the Prompt-guided Bidirectional Deep Fusion Network (PBDF-Net) to enhance the fusion of language and vision modalities. In contrast to traditional unidirectional early fusion approaches, our approach employs a prompt-guided bidirectional encoder fusion (PBEF) module to promote mutual cross-modal fusion across multiple stages of the vision and language encoders. Furthermore, PBDF-Net incorporates a prompt-guided cross-modal interaction (PCI) module during the late fusion stage, facilitating a more profound integration of contextual information from both modalities, resulting in more accurate target segmentation. Comprehensive experiments conducted on the RefCOCO, RefCOCO+, G-Ref and ReferIt datasets substantiate the efficacy of our proposed method, demonstrating significant advancements in performance compared to existing approaches. © 2024 Elsevier B.V.

引用

共 50 条

[1] Multiscale deep feature selection fusion network for referring image segmentation
Xianwen Dai
Jiacheng Lin
Ke Nai
Qingpeng Li
Zhiyong Li
[J]. Multimedia Tools and Applications, 2024, 83 : 36287 - 36305
[2] Multiscale deep feature selection fusion network for referring image segmentation
Dai, Xianwen
Lin, Jiacheng
Nai, Ke
Li, Qingpeng
Li, Zhiyong
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (12) : 36287 - 36305
[3] Bidirectional Relationship Inferring Network for Referring Image Localization and Segmentation
Feng, Guang
Hu, Zhiwei
Zhang, Lihe
Sun, Jiayu
Lu, Huchuan
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (05) : 2246 - 2258
[4] Structured Multimodal Fusion Network for Referring Image Segmentation
Xue, Mingcheng
Liu, Yu
Xu, Kaiping
Zhang, Haiyang
Yu, Chengyang
[J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 36 - 47
[5] PROMPTCAP: Prompt-Guided Image Captioning for VQA with GPT-3
Hu, Yushi
Hua, Hang
Yang, Zhengyuan
Shi, Weijia
Smith, Noah A.
Luo, Jiebo
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2951 - 2963
[6] Prompt-Guided Sparse Transformer for Remote Sensing Image Dehazing
Dong, Haobo
Song, Tianyu
Qi, Xuanyu
Jin, Guiyue
Jin, Jiyu
Ma, Ling
[J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21
[7] Prompt-guided image color aesthetics assessment: Models, datasets and benchmarks
He, Shuai
Xiao, Yi
Ming, Anlong
Ma, Huadong
[J]. Information Fusion, 2025, 114
[8] MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition
Liu, Wei
Ren, Aiqun
Wang, Chao
Peng, Yan
Xie, Shaorong
Li, Weimin
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 71639 - 71663
[9] Low-Rank Prompt-Guided Transformer for Hyperspectral Image Denoising
Tan, Xiaodong
Shao, Mingwen
Qiao, Yuanjian
Liu, Tiyao
Cao, Xiangyong
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[10] Referring image segmentation with attention guided cross modal fusion for semantic oriented languages
Qianli Zhou
Rong Wang
Haimiao Hu
Quange Tan
Wenjin Zhang
[J]. Frontiers of Computer Science, 2022, 16

← 1 2 3 4 5 →