Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images

被引:0
|
作者
Wu, Shangyou [1 ]
Yu, Wenhao [1 ,2 ]
Zhang, Yifan [1 ]
Huang, Mengqiu [1 ]
机构
[1] China Univ Geosci, Sch Geog & Informat Engn, Wuhan, Peoples R China
[2] China Univ Geosci, Natl Engn Res Ctr Geog Informat Syst, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
RECOGNITION;
D O I
10.1111/tgis.13146
中图分类号
P9 [自然地理学]; K9 [地理];
学科分类号
0705 ; 070501 ;
摘要
As one of the classic tasks in information retrieval, the core of image retrieval is to identify the images sharing similar features with a query image, aiming to enable users to find the required information from a large number of images conveniently. Street view image retrieval, in particular, finds extensive applications in many fields, such as improvements to navigation and mapping services, formulation of urban development planning scheme, and analysis of historical evolution of buildings. However, the intricate foreground and background details in street view images, coupled with a lack of attribute annotations, render it among the most challenging issues in practical applications. Current image retrieval research mainly uses the visual model that is completely dependent on the image visual features, and the multimodal learning model that necessitates additional data sources (e.g., annotated text). Yet, creating annotated datasets is expensive, and street view images, which contain a large amount of scene texts themselves, are often unannotated. Therefore, this paper proposes a deep unsupervised learning algorithm that combines visual and text features from image data for improving the accuracy of street view image retrieval. Specifically, we employ text detection algorithms to identify scene text, utilize the Pyramidal Histogram of Characters encoding predictor model to extract text information from images, deploy deep convolutional neural networks for visual feature extraction, and incorporate a contrastive learning module for image retrieval. Upon testing across three street view image datasets, the results demonstrate that our model holds certain advantages over the state-of-the-art multimodal models pre-trained on extensive datasets, characterized by fewer parameters and lower floating point operations. Code and data are available at .
引用
收藏
页码:486 / 508
页数:23
相关论文
共 24 条
  • [1] Fusing deep learning and statistical visual features for no-reference image quality assessment
    Zhang, Yin
    Yan, Junhua
    Du, Xuan
    Bai, Xuehan
    Zhi, Xiyang
    Hou, Ping
    Ma, Yue
    JOURNAL OF ELECTRONIC IMAGING, 2020, 29 (04)
  • [2] FOF: Fusing object features into deep learning model to generate image caption
    Zhou, Hang
    Lv, Xue-Qiang
    You, Xin-Dong
    Dong, Zhi-An
    Zhang, Kai
    Journal of Computers (Taiwan), 2019, 30 (04) : 206 - 216
  • [3] A Multimodal Deep Learning Model Using Text, Image, and Code Data for Improving Issue Classification Tasks
    Kwak, Changwon
    Jung, Pilsu
    Lee, Seonah
    APPLIED SCIENCES-BASEL, 2023, 13 (16):
  • [4] A Study on Verification of CCTV Image Data through Unsupervised Learning Model of Deep Learning
    Lee, Yangsun
    TEHNICKI GLASNIK-TECHNICAL JOURNAL, 2023, 17 (03): : 353 - 358
  • [5] Multimodal Deep Learning Framework for Sentiment Analysis from Text-Image Web Data
    Thuseethan, Selvarajah
    Janarthan, Sivasubramaniam
    Rajasegarar, Sutharshan
    Kumari, Priya
    Yearwood, John
    2020 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2020), 2020, : 267 - 274
  • [6] Deep Learning Model for Retrieving Color Logo Images in Content Based Image Retrieval
    Pinjarkar, Latika
    Bagga, Jaspal
    Agrawal, Poorva
    Kaur, Gagandeep
    Pinjarkar, Vedant
    Rajendra, Rutuja
    JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (02) : 1325 - 1333
  • [7] Enhancing Image Description Generation through Deep Reinforcement Learning: Fusing Multiple Visual Features and Reward Mechanisms
    Li, Yan
    Wang, Qiyuan
    Jia, Kaidi
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (02): : 2469 - 2489
  • [8] Interpretable Multimodal Sentiment Classification Using Deep Multi-View Attentive Network of Image and Text Data
    Al-Tameemi, Israa Khalaf Salman
    Feizi-Derakhshi, Mohammad-Reza
    Pashazadeh, Saeid
    Asadpour, Mohammad
    IEEE ACCESS, 2023, 11 : 91060 - 91081
  • [9] Effectiveness of Image Augmentation Techniques on Detection of Building Characteristics from Street View Images Using Deep Learning
    Han, Jongwon
    Kim, Jaejun
    Kim, Seongkyung
    Wang, Seunghyeon
    JOURNAL OF CONSTRUCTION ENGINEERING AND MANAGEMENT, 2024, 150 (10)
  • [10] Classification of Image and Text Data Using Deep Learning-Based LSTM Model
    Yechuri, Praveen Kumar
    Ramadass, Suguna
    TRAITEMENT DU SIGNAL, 2021, 38 (06) : 1809 - 1817