Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

被引:73
|
作者
Messina, Nicola [1 ]
Amato, Giuseppe [1 ]
Esuli, Andrea [1 ]
Falchi, Fabrizio [1 ]
Gennaro, Claudio [1 ]
Marchand-Maillet, Stephane [2 ]
机构
[1] ISTI CNR, Pisa, Italy
[2] Univ Geneva, VIPER Grp, Geneva, Switzerland
基金
欧盟地平线“2020”;
关键词
Deep learning; cross-modal retrieval; multi-modal matching; computer vision; natural language processing; LANGUAGE; GENOME;
D O I
10.1145/3451390
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. 000Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric.
引用
收藏
页数:23
相关论文
共 50 条
  • [21] Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval
    Wang, Hao
    Lin, Guosheng
    Hoi, Steven
    Miao, Chunyan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5517 - 5526
  • [22] Fine-Grained Matching with Multi-Perspective Similarity Modeling for Cross-Modal Retrieval
    Xie, Xiumin
    Hou, Chuanwen
    Li, Zhixin
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, VOL 180, 2022, 180 : 2148 - 2158
  • [23] Fine-Grained Cross-Modal Retrieval for Cultural Items with Focal Attention and Hierarchical Encodings
    Sheng, Shurong
    Laenen, Katrien
    Van Gool, Luc
    Moens, Marie-Francine
    COMPUTERS, 2021, 10 (09)
  • [24] A Cross-modal Attention Model for Fine-Grained Incident Retrieval from Dashcam Videos
    Pham, Dinh-Duy
    Dao, Minh-Son
    Nguyen, Thanh-Binh
    MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 409 - 420
  • [25] Fine-Grained Image Generation Network With Radar Range Profiles Using Cross-Modal Visual Supervision
    Bao, Jiacheng
    Li, Da
    Li, Shiyong
    Zhao, Guoqiang
    Sun, Houjun
    Zhang, Yi
    IEEE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES, 2024, 72 (02) : 1339 - 1352
  • [26] Deep Self-Supervised Hashing With Fine-Grained Similarity Mining for Cross-Modal Retrieval
    Han, Lijun
    Wang, Renlin
    Chen, Chunlei
    Zhang, Huihui
    Zhang, Yujie
    Zhang, Wenfeng
    IEEE ACCESS, 2024, 12 : 31756 - 31770
  • [27] Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval
    Zhu, Jianwei
    Li, Zhixin
    Wei, Jiahui
    Zeng, Yufei
    Ma, Huifang
    IMAGE AND VISION COMPUTING, 2022, 124
  • [28] Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learning
    Zhang, Bolin
    Kyutoku, Haruya
    Doman, Keisuke
    Komamizu, Takahiro
    Ide, Ichiro
    Qian, Jiangbo
    KNOWLEDGE-BASED SYSTEMS, 2024, 305
  • [29] Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval
    Zhu, Jianwei
    Li, Zhixin
    Wei, Jiahui
    Zeng, Yufei
    Ma, Huifang
    Image and Vision Computing, 2022, 124
  • [30] A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval
    Zheng, Fuzhong
    Wang, Xu
    Wang, Luyao
    Zhang, Xiong
    Zhu, Hongze
    Wang, Long
    Zhang, Haisu
    SENSORS, 2023, 23 (20)