Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images

被引：0

作者：

Wu, Shangyou ^{[1
]}

Yu, Wenhao ^{[1
,2
]}

Zhang, Yifan ^{[1
]}

Huang, Mengqiu ^{[1
]}

机构：

[1] China Univ Geosci, Sch Geog & Informat Engn, Wuhan, Peoples R China

[2] China Univ Geosci, Natl Engn Res Ctr Geog Informat Syst, Wuhan, Peoples R China

来源：

TRANSACTIONS IN GIS | 2024年 / 28卷 / 03期

基金：

中国国家自然科学基金;

关键词：

RECOGNITION;

D O I：

10.1111/tgis.13146

中图分类号：

P9 [自然地理学]; K9 [地理];

学科分类号：

0705 ; 070501 ;

摘要：

As one of the classic tasks in information retrieval, the core of image retrieval is to identify the images sharing similar features with a query image, aiming to enable users to find the required information from a large number of images conveniently. Street view image retrieval, in particular, finds extensive applications in many fields, such as improvements to navigation and mapping services, formulation of urban development planning scheme, and analysis of historical evolution of buildings. However, the intricate foreground and background details in street view images, coupled with a lack of attribute annotations, render it among the most challenging issues in practical applications. Current image retrieval research mainly uses the visual model that is completely dependent on the image visual features, and the multimodal learning model that necessitates additional data sources (e.g., annotated text). Yet, creating annotated datasets is expensive, and street view images, which contain a large amount of scene texts themselves, are often unannotated. Therefore, this paper proposes a deep unsupervised learning algorithm that combines visual and text features from image data for improving the accuracy of street view image retrieval. Specifically, we employ text detection algorithms to identify scene text, utilize the Pyramidal Histogram of Characters encoding predictor model to extract text information from images, deploy deep convolutional neural networks for visual feature extraction, and incorporate a contrastive learning module for image retrieval. Upon testing across three street view image datasets, the results demonstrate that our model holds certain advantages over the state-of-the-art multimodal models pre-trained on extensive datasets, characterized by fewer parameters and lower floating point operations. Code and data are available at .

引用

页码：486 / 508

页数：23

共 24 条

[1] Fusing deep learning and statistical visual features for no-reference image quality assessment
Zhang, Yin
Yan, Junhua
Du, Xuan
Bai, Xuehan
Zhi, Xiyang
Hou, Ping
Ma, Yue
JOURNAL OF ELECTRONIC IMAGING, 2020, 29 (04)
[2] FOF: Fusing object features into deep learning model to generate image caption
Zhou, Hang
Lv, Xue-Qiang
You, Xin-Dong
Dong, Zhi-An
Zhang, Kai
Journal of Computers (Taiwan), 2019, 30 (04) : 206 - 216
[3] A Multimodal Deep Learning Model Using Text, Image, and Code Data for Improving Issue Classification Tasks
Kwak, Changwon
Jung, Pilsu
Lee, Seonah
APPLIED SCIENCES-BASEL, 2023, 13 (16):
[4] A Study on Verification of CCTV Image Data through Unsupervised Learning Model of Deep Learning
Lee, Yangsun
TEHNICKI GLASNIK-TECHNICAL JOURNAL, 2023, 17 (03): : 353 - 358
[5] Multimodal Deep Learning Framework for Sentiment Analysis from Text-Image Web Data
Thuseethan, Selvarajah
Janarthan, Sivasubramaniam
Rajasegarar, Sutharshan
Kumari, Priya
Yearwood, John
2020 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2020), 2020, : 267 - 274
[6] Deep Learning Model for Retrieving Color Logo Images in Content Based Image Retrieval
Pinjarkar, Latika
Bagga, Jaspal
Agrawal, Poorva
Kaur, Gagandeep
Pinjarkar, Vedant
Rajendra, Rutuja
JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (02) : 1325 - 1333
[7] Enhancing Image Description Generation through Deep Reinforcement Learning: Fusing Multiple Visual Features and Reward Mechanisms
Li, Yan
Wang, Qiyuan
Jia, Kaidi
CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (02): : 2469 - 2489
[8] Interpretable Multimodal Sentiment Classification Using Deep Multi-View Attentive Network of Image and Text Data
Al-Tameemi, Israa Khalaf Salman
Feizi-Derakhshi, Mohammad-Reza
Pashazadeh, Saeid
Asadpour, Mohammad
IEEE ACCESS, 2023, 11 : 91060 - 91081
[9] Effectiveness of Image Augmentation Techniques on Detection of Building Characteristics from Street View Images Using Deep Learning
Han, Jongwon
Kim, Jaejun
Kim, Seongkyung
Wang, Seunghyeon
JOURNAL OF CONSTRUCTION ENGINEERING AND MANAGEMENT, 2024, 150 (10)
[10] Classification of Image and Text Data Using Deep Learning-Based LSTM Model
Yechuri, Praveen Kumar
Ramadass, Suguna
TRAITEMENT DU SIGNAL, 2021, 38 (06) : 1809 - 1817

← 1 2 3 →