Local-global coordination with transformers for referring image segmentation

被引:8
|
作者
Liu, Fang [1 ]
Kong, Yuqiu [2 ]
Zhang, Lihe [3 ]
Feng, Guang [3 ]
Yin, Baocai [1 ]
机构
[1] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian 116024, Peoples R China
[2] Dalian Univ Technol, Sch Innovat & Entrepreneurship, Dalian 116024, Peoples R China
[3] Dalian Univ Technol, Sch Informat & Commun Engn, Dalian, Peoples R China
基金
中国国家自然科学基金;
关键词
Referring image segmentation; Cross modality transformer; Cross-level information integration;
D O I
10.1016/j.neucom.2022.12.018
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation has sprung up benefiting from the outstanding performance of deep neural networks. However, most existing methods explore either local details or the global context of the scene without sufficiently modelling the coordination between them, leading to sub-optimal results. In this paper, we propose a transformer-based method to enforce the in-depth coordination between short -and long-range dependencies in both explicit and implicit fusion processes. Specifically, we design a Cross Modality Transformer (CMT) module with two successive blocks for explicitly integrating linguistic and visual features, which first locates the related visual region in a global view before concentrating on local patterns. Besides, a Hybrid Transformer Architecture (HTA) is utilized as a feature extractor in the encoding stage to capture global relationships and retain local cues. It can further aggregate the multi -modal features in an implicit manner. In the decoding stage, a Cross-level Information Integration module (CI2) is developed to gather information from adjacent levels by dual top-down paths, including a guided filtration path and a residual reservation path. Experimental results show that the proposed method out-performs the state-of-the-art methods on four RIS benchmarks.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:39 / 52
页数:14
相关论文
共 50 条
  • [1] Omnidirectional image quality assessment with local-global vision transformers
    Tofighi, Nafiseh Jabbari
    Elfkir, Mohamed Hedi
    Imamoglu, Nevrez
    Ozcinar, Cagri
    Erdem, Aykut
    Erdem, Erkut
    [J]. IMAGE AND VISION COMPUTING, 2024, 148
  • [2] Global and Local Interactive Perception Network for Referring Image Segmentation
    Liu, Jing
    Tan, Hongchen
    Hu, Yongli
    Sun, Yanfeng
    Wang, Huasheng
    Yin, Baocai
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, : 1 - 14
  • [3] Global Selection and Local Attention Network for Referring Image Segmentation
    Ding, Haixin
    Zhang, Shengchuan
    Cao, Liujuan
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 : 284 - 295
  • [4] LGI Net: Enhancing local-global information interaction for medical image segmentation
    Liu, Linjie
    Li, Yan
    Wu, Yanlin
    Ren, Lili
    Wang, Guanglei
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 167
  • [5] ReSTR: Convolution-free Referring Image Segmentation Using Transformers
    Kim, Namyup
    Kim, Dongwon
    Lan, Cuiling
    Zeng, Wenjun
    Kwak, Suha
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18124 - 18133
  • [6] A variational level set model based on local-global function approximation for image segmentation
    Dang, Hongyu
    Tang, Liming
    Ren, Yanjun
    Xu, Yaya
    [J]. DIGITAL SIGNAL PROCESSING, 2024, 146
  • [7] Zero-shot Referring Image Segmentation with Global-Local Context Features
    Yu, Seonghoon
    Seo, Paul Hongsuck
    Son, Jeany
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19456 - 19465
  • [8] LGANET: LOCAL-GLOBAL AUGMENTATION NETWORK FOR SKIN LESION SEGMENTATION
    Guo, Qingqing
    Fang, Xianyong
    Wang, Linbo
    Zhang, Enming
    Liu, Zhengyi
    [J]. 2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
  • [9] Local-Global Transformer Neural Network for temporal action segmentation
    Tian, Xiaoyan
    Jin, Ye
    Tang, Xianglong
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (02) : 615 - 626
  • [10] Local-global visual interaction attention for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    [J]. DIGITAL SIGNAL PROCESSING, 2022, 130