Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

被引:73
|
作者
Messina, Nicola [1 ]
Amato, Giuseppe [1 ]
Esuli, Andrea [1 ]
Falchi, Fabrizio [1 ]
Gennaro, Claudio [1 ]
Marchand-Maillet, Stephane [2 ]
机构
[1] ISTI CNR, Pisa, Italy
[2] Univ Geneva, VIPER Grp, Geneva, Switzerland
基金
欧盟地平线“2020”;
关键词
Deep learning; cross-modal retrieval; multi-modal matching; computer vision; natural language processing; LANGUAGE; GENOME;
D O I
10.1145/3451390
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. 000Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] VIDEO-MUSIC RETRIEVAL WITH FINE-GRAINED CROSS-MODAL ALIGNMENT
    Era, Yuki
    Togo, Ren
    Maeda, Keisuke
    Ogawa, Takahiro
    Haseyama, Miki
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2005 - 2009
  • [2] Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
    Han, Ning
    Chen, Jingjing
    Xiao, Guangyi
    Zhang, Hao
    Zeng, Yawen
    Chen, Hao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3826 - 3834
  • [3] Integration of Global and Local Representations for Fine-Grained Cross-Modal Alignment
    Jin, Seungwan
    Choi, Hoyoung
    Noh, Taehyung
    Han, Kyungsik
    COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 53 - 70
  • [4] Multi-label adversarial fine-grained cross-modal retrieval
    Sun, Chunpu
    Zhang, Huaxiang
    Liu, Li
    Liu, Dongmei
    Wang, Lin
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 117
  • [5] TECMH: Transformer-Based Cross-Modal Hashing For Fine-Grained Image-Text Retrieval
    Li, Qiqi
    Ma, Longfei
    Jiang, Zheng
    Li, Mingyong
    Jin, Bo
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (02): : 3713 - 3728
  • [6] Cross-Modal Face Matching: Tackling Visual Abstraction using Fine-Grained Attributes
    Hu, Yichuan
    Li, Ke
    Zhang, Honggang
    2016 30TH ANNIVERSARY OF VISUAL COMMUNICATION AND IMAGE PROCESSING (VCIP), 2016,
  • [7] Fine-grained similarity semantic preserving deep hashing for cross-modal retrieval
    Li, Guoyou
    Peng, Qingjun
    Zou, Dexu
    Yang, Jinyue
    Shu, Zhenqiu
    FRONTIERS IN PHYSICS, 2023, 11
  • [8] Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval
    Liu, Hui
    Lv, Gang
    Gu, Yanhong
    Nian, Fudong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14866 : 298 - 310
  • [9] Fine-grained Feature Assisted Cross-modal Image-text Retrieval
    Bu, Chaofei
    Liu, Xueliang
    Huang, Zhen
    Su, Yuling
    Tu, Junfeng
    Hong, Richang
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 306 - 320
  • [10] Deep Multiscale Fine-Grained Hashing for Remote Sensing Cross-Modal Retrieval
    Huang, Jiaxiang
    Feng, Yong
    Zhou, Mingliang
    Xiong, Xiancai
    Wang, Yongheng
    Qiang, Baohua
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5