Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation

被引:9
|
作者
Liu, Chang [1 ]
Ding, Henghui [2 ]
Zhang, Yulun [2 ]
Jiang, Xudong [1 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn EEE, Singapore 639798, Singapore
[2] Swiss Fed Inst Technol, Comp Vis Lab CVL, CH-8092 Zurich, Switzerland
关键词
Transformers; Decoding; Image segmentation; Task analysis; Feature extraction; Image reconstruction; Iterative methods; Referring image segmentation; multi-modal mutual attention; iterative multi-modal interaction; language feature reconstruction;
D O I
10.1109/TIP.2023.3277791
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, which limits the model to comprehensively understand the multi-modal information, and brings uncertainty for the subsequent mask decoder to extract the output mask. To address this issue, we propose Multi-Modal Mutual Attention (M(3)Att) and Multi-Modal Mutual Decoder (M(3)Dec) that better fuse information from the two input modalities. Based on M(3)Dec, we further propose Iterative Multi-modal Interaction (IMI) to allow continuous and in-depth interactions between language and vision features. Furthermore, we introduce Language Feature Reconstruction (LFR) to prevent the language information from being lost or distorted in the extracted feature. Extensive experiments show that our proposed approach significantly improves the baseline and outperforms state-of-the-art referring image segmentation methods on RefCOCO series datasets consistently.
引用
收藏
页码:3054 / 3065
页数:12
相关论文
共 50 条
  • [1] Comprehensive Multi-Modal Interactions for Referring Image Segmentation
    Jain, Kanishk
    Gandhi, Vineet
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3427 - 3435
  • [2] Mutual Query Network for Multi-Modal Product Image Segmentation
    Guo, Yun
    Feng, Wei
    Zhang, Zheng
    Ren, Xiancong
    Li, Yaoyu
    Lv, Jingjing
    Zhu, Xin
    Lin, Zhangang
    Shao, Jingping
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2273 - 2278
  • [3] Shape gradient for multi-modal image segmentation using mutual information
    Herbulot, A
    Jehan-Besson, S
    Barlaud, M
    Aubert, G
    [J]. ICIP: 2004 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1- 5, 2004, : 2729 - 2732
  • [4] Modality-Aware Mutual Learning for Multi-modal Medical Image Segmentation
    Zhang, Yao
    Yang, Jiawei
    Tian, Jiang
    Shi, Zhongchao
    Zhong, Cheng
    Zhang, Yang
    He, Zhiqiang
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT I, 2021, 12901 : 589 - 599
  • [5] Referring Image Segmentation with Multi-Modal Feature Interaction and Alignment Based on Convolutional Nonlinear Spiking Neural Membrane Systems
    Sun, Siyan
    Wang, Peng
    Peng, Hong
    Liu, Zhicai
    [J]. INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2024,
  • [6] Multi-modal semantic image segmentation
    Pemasiri, Akila
    Kien Nguyen
    Sridharan, Sridha
    Fookes, Clinton
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 202
  • [7] MixFuse: An iterative mix-attention transformer for multi-modal image fusion
    Li, Jinfu
    Song, Hong
    Liu, Lei
    Li, Yanan
    Xia, Jianghan
    Huang, Yuqi
    Fan, Jingfan
    Lin, Yucong
    Yang, Jian
    [J]. Expert Systems with Applications, 2025, 261
  • [8] Cross-modal attention for multi-modal image registration
    Song, Xinrui
    Chao, Hanqing
    Xu, Xuanang
    Guo, Hengtao
    Xu, Sheng
    Turkbey, Baris
    Wood, Bradford J.
    Sanford, Thomas
    Wang, Ge
    Yan, Pingkun
    [J]. MEDICAL IMAGE ANALYSIS, 2022, 82
  • [9] Dual-Attention Deep Fusion Network for Multi-modal Medical Image Segmentation
    Zheng, Shenhai
    Ye, Xin
    Tan, Jiaxin
    Yang, Yifei
    Li, Laquan
    [J]. FOURTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING, ICGIP 2022, 2022, 12705
  • [10] Cross-Modal Self-Attention Network for Referring Image Segmentation
    Ye, Linwei
    Rochan, Mrigank
    Liu, Zhi
    Wang, Yang
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10494 - 10503