Understanding, Categorizing and Predicting Semantic Image-Text Relations

被引:15
|
作者
Otto, Christian [1 ]
Springstein, Matthias [1 ]
Anand, Avishek [2 ]
Ewerth, Ralph [3 ]
机构
[1] Leibniz Informat Ctr Sci & Technol TIB, Hannover, Germany
[2] Leibniz Univ Hannover, L3S Res Ctr, Hannover, Germany
[3] Leibniz Univ Hannover, L3S Res Ctr, Leibniz Informat Ctr Sci & Technol TIB, Hannover, Germany
关键词
Image-text class; multimodality; data augmentation; semantic gap;
D O I
10.1145/3323873.3325049
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes ( e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.
引用
收藏
页码:168 / 176
页数:9
相关论文
共 50 条
  • [41] Regularizing Visual Semantic Embedding With Contrastive Learning for Image-Text Matching
    Liu, Yang
    Liu, Hong
    Wang, Huaqiu
    Liu, Mengyuan
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1332 - 1336
  • [42] Text-image communication, image-text communication
    Münkner, J
    ZEITSCHRIFT FUR GERMANISTIK, 2004, 14 (02): : 454 - 455
  • [43] Nascent and mature uses of a semiotic system: the case of image-text relations
    Martinec, Radan
    VISUAL COMMUNICATION, 2013, 12 (02) : 147 - 172
  • [44] Visual images and image-text relations in ELT textbooks for young learners
    Zhu, Yan
    Yang, Nana
    LANGUAGE TEACHING FOR YOUNG LEARNERS, 2023, 5 (02) : 196 - 216
  • [45] Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals
    Jin, Lu
    Li, Zechao
    Tang, Jinhui
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (04) : 1838 - 1851
  • [46] UIT: Unifying Pre-training Objectives for Image-Text Understanding
    Xu, Guoqiang
    Yan, Shenggang
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT V, 2023, 14258 : 572 - 585
  • [47] Multi-level Symmetric Semantic Alignment Network for image-text matching
    Wang, Wenzhuang
    Di, Xiaoguang
    Liu, Maozhen
    Gao, Feng
    NEUROCOMPUTING, 2024, 599
  • [48] Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval
    Qin, Xue-Yang
    Li, Li-Shuang
    Tang, Jing-Yao
    Hao, Fei
    Ge, Mei-Ling
    Pang, Guang-Yao
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2024, 39 (04) : 811 - 826
  • [49] Remote sensing image-text retrieval based on layout semantic joint representation
    Zhang R.
    Nie J.
    Song N.
    Zheng C.
    Wei Z.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (02): : 671 - 683
  • [50] SAM: cross-modal semantic alignments module for image-text retrieval
    Pilseo Park
    Soojin Jang
    Yunsung Cho
    Youngbin Kim
    Multimedia Tools and Applications, 2024, 83 : 12363 - 12377