Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding

被引:10
|
作者
Zhao, Heng [1 ]
Zhou, Joey Tianyi [1 ]
Ong, Yew-Soon [1 ,2 ]
机构
[1] A STAR Ctr Frontier AI Res CFAR, Singapore 138632, Singapore
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
关键词
Cross-attention; deep learning; multimodal; referring expression comprehension; visual grounding;
D O I
10.1109/TNNLS.2022.3183827
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual features for target localization. Such a formulation provides insufficient ability to model query at the word level, and therefore is prone to neglect words that may not be the most important ones for a sentence but are critical for the referred object. In this article, we propose Word2Pix: a one-stage visual grounding network based on the encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. Each word from the query sentence is given an equal opportunity when attending to visual pixels through multiple stacks of transformer decoder layers. In this way, the decoder can learn to model the language query and fuse language with the visual features for target prediction simultaneously. We conduct the experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets, and the proposed Word2Pix outperforms the existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses the two-stage visual grounding models, while at the same time keeping the merits of the one-stage paradigm, namely, end-to-end training and fast inference speed. Code is available at https:// github.com/azurerain7/Word2Pix.
引用
收藏
页码:1523 / 1533
页数:11
相关论文
共 50 条
  • [21] On the nonautomaticity of visual word processing: Electrophysiological evidence that word processing requires central attention
    Lien, Mei-Ching
    Cornett, Logan
    Goodin, Zachary
    Ruthruff, Eric
    Allen, Philip A.
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 2008, 34 (03) : 751 - 773
  • [22] Fully Cross-Attention Transformer for Guided Depth Super-Resolution
    Ariav, Ido
    Cohen, Israel
    SENSORS, 2023, 23 (05)
  • [23] Word Boundaries Affect Visual Attention in Chinese Reading
    Li, Xingshan
    Ma, Guojie
    PLOS ONE, 2012, 7 (11):
  • [24] Visual word recognition: Reattending to the role of spatial attention
    Stolz, JA
    McCann, RS
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 2000, 26 (04) : 1320 - 1331
  • [25] THE ROLE OF SPATIAL ATTENTION IN VISUAL WORD-PROCESSING
    MCCANN, RS
    FOLK, CL
    JOHNSTON, JC
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 1992, 18 (04) : 1015 - 1029
  • [26] Attention modulates initial stages of visual word processing
    Ruz, Maria
    Nobre, Anna C.
    JOURNAL OF COGNITIVE NEUROSCIENCE, 2008, 20 (09) : 1727 - 1736
  • [27] Emotion and attention in visual word processing:: An ERP study
    Kissler, Johanna
    Herbert, Cornelia
    Junghofer, Markus
    PSYCHOPHYSIOLOGY, 2007, 44 : S37 - S37
  • [28] Bridging CNN and Transformer With Cross-Attention Fusion Network for Hyperspectral Image Classification
    Xu, Fulin
    Mei, Shaohui
    Zhang, Ge
    Wang, Nan
    Du, Qian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [29] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
    Chen, Chun-Fu
    Fan, Quanfu
    Panda, Rameswar
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 347 - 356
  • [30] An efficient object tracking based on multi-head cross-attention transformer
    Dai, Jiahai
    Li, Huimin
    Jiang, Shan
    Yang, Hongwei
    EXPERT SYSTEMS, 2025, 42 (02)