Language-guided target segmentation method based on multi-granularity feature fusion

被引:0
|
作者
Tan Q. [1 ]
Wang R. [1 ]
Wu A. [1 ]
机构
[1] School of Information and Cyber Security, People’s Public Security University of China, Beijing
基金
中国国家自然科学基金;
关键词
cross-model; feature fusion; referring segmentation; target segmentation; text understanding;
D O I
10.13700/j.bh.1001-5965.2022.0384
中图分类号
学科分类号
摘要
The objective of language-guided target segmentation is to match the targets described in the text with the entities they refer to, thereby achieving an understanding of the relationships between text and entities, as well as the localization of the referred targets. This task has significant application value in scenarios such as information extraction, text classification, and machine translation. The paper proposes a language-guided multi-granularity feature fusion target segmentation method based on the Refvos model, which can accurately locate segment-specific targets. Using the Swin Transformer and Bert network to extract multi-granularity visual features and text features respectively, so as to obtain features that have strong expression ability to the whole and part. Through language direction, text features are combined with visual features of varying granularities to improve targeted expression. Ultimately, in order to achieve more precise segmentation results, we enhance multi-granularity fusion features using convolutional long and short-term memory networks to facilitate information flow across features of different granularities. The model was trained and tested on UNC and UNC+ datasets. Experimental results show that the proposed method compared with Refvos, IoU results in UNC dataset Val and testB are improved by 0.92% and 4.1% respectively, and IoU results in UNC+ dataset Val, testA and testB are improved by 1.83%, 0.63%, and 1.75% respectively. The proposed method IoU results of G-Ref and ReferIt data sets are 40.16% and 64.37%, reaching the frontier level. It is proved that the proposed method is effective and advanced. © 2024 Beijing University of Aeronautics and Astronautics (BUAA). All rights reserved.
引用
收藏
页码:542 / 550
页数:8
相关论文
共 25 条
  • [1] HU R H, ROHRBACH M, DARRELL T., Segmentation from natural language expressions, Proceeding of the European Conference on Computer Vision, pp. 108-124, (2016)
  • [2] LIU C X, LIN Z, SHEN X H, Et al., Recurrent multimodal interaction for referring image segmentation, Proceedings of the 2017 IEEE International Conference on Computer Vision, pp. 1280-1289, (2017)
  • [3] MARGFFOY-TUAY E, PEREZ J C, BOTERO E, Et al., Dynamic multimodal instance segmentation guided by natural language queries[C], Proceedings of the European Conference on Computer Vision, pp. 656-672, (2018)
  • [4] LEI T, ZHANG Y., Training RNNs as fast as CNNs
  • [5] LI R Y, LI K C, KUO Y C, Et al., Referring image segmentation via recurrent refinement networks, Proceedings of the 2018 IEEE/ CVF Conference on Computer Vision and Pattern Recognition, pp. 5745-5753, (2018)
  • [6] YE L W, ROCHAN M, LIU Z, Et al., Cross-modal self-attention network for referring image segmentation, Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10494-10503, (2019)
  • [7] CHEN D J, JIA S H, LO Y C, Et al., See-through-text grouping for referring image segmentation, Proceedings of the 2019 IEEE/ CVF International Conference on Computer Vision, pp. 7453-7462, (2019)
  • [8] HUANG S F, HUI T R, LIU S, Et al., Referring image segmentation via cross-modal progressive comprehension, Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10485-10494, (2020)
  • [9] HUI T R, LIU S, HUANG S F, Et al., Linguistic structure guided context modeling for referring image segmentation, Proceeding of the European Conference on Computer Vision, pp. 59-75, (2020)
  • [10] BELLVER M, VENTURA C, SILBERER C, Et al., A closer look at referring expressions for video object segmentation, Multimedia Tools and Applications, 82, 3, pp. 4419-4438, (2023)