Hierarchical visual-semantic interaction for scene text recognition

被引：2

作者：

Diao, Liang ^{[1
]}

Tang, Xin ^{[2
]}

Wang, Jun ^{[3
,4
]}

Xie, Guotong ^{[3
,4
]}

Hu, Junlin ^{[5
]}

机构：

[1] Ping An Property & Casualty Insurance Co, Shenzhen, Peoples R China

[2] Huawei Technol Ltd, Shenzhen, Peoples R China

[3] Ping An Healthcare Technol, Beijing, Peoples R China

[4] Ping An Technol, Shenzhen, Peoples R China

[5] Beihang Univ, Sch Software, Beijing, Peoples R China

来源：

INFORMATION FUSION | 2024年 / 102卷

关键词：

Scene text recognition; Visual semantic interaction; Scene text representation; Feature fusion;

D O I：

10.1016/j.inffus.2023.102080

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Proper interaction between visual and semantic features is crucial to obtain a powerful feature representation for scene text recognition (STR). The existing interaction methods usually treat visual and semantic features as distinct tokens and use transformers to learn contextual information and prior language knowledge, and they achieve promising performance for STR task. However, there still remain several issues needed to be further addressed such as the imbalance number and mis-alignment between visual and semantic features, and the necessarily of stacking transformers to a progressive improvement in accuracy. To this aim, this paper proposes a novel interaction manner namely hierarchical visual-semantic interaction (HVSI) which contains three novel modules including a hierarchical visual-semantic interaction module, fusion module, and visual-semantic alignment module. The hierarchical visual-semantic interaction module employs multiple visual-semantic interaction blocks in various scales to enhance the representation power of visual features and semantic features. To better exploit multi-scale visual and semantic features, the fusion module is introduced to fuse multiple semantic features based on attention mechanisms. Furthermore, our HVSI presents a simple plug-in block named visual-semantic alignment module to alleviate mis-alignment of semantic features by mapping them into a unified semantic space, which helps improve the performance of HVSI. Extensive experiments on multiple benchmarks including English and Chinese text recognition datasets show that our method obtains state-of-the-art or competitive performances.

引用

页数：9

共 50 条

[1] Multimodal Visual-Semantic Representations Learning for Scene Text Recognition
Gao, Xinjian
Pang, Ye
Liu, Yuyu
Han, Maokun
Yu, Jun
Wang, Wei
Chen, Yuanxu
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
[2] Visual-semantic network: a visual and semantic enhanced model for gesture recognition
Yizhe Wang
Congqi Cao
Yanning Zhang
[J]. Visual Intelligence, 1 (1):
[3] Learning Hierarchical Visual-Semantic Representation with Phrase Alignment
Yan, Baoming
Zhang, Qingheng
Chen, Liyu
Wang, Lin
Pei, Leihao
Yang, Jiang
Yu, Enyun
Li, Xiaobo
Zhao, Binqiang
[J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 349 - 357
[4] Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
Niu, Zhenxing
Zhou, Mo
Wang, Le
Gao, Xinbo
Hua, Gang
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1899 - 1907
[5] Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition
Li, Qiaozhe
Zhao, Xin
He, Ran
Huang, Kaiqi
[J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8634 - 8641
[6] A Hierarchical Utilization of Semantic Gradients and Scene Structure for Visual Place Recognition
Bao, Yaoqi
Pan, Yun
Yang, Zhe
Huan, Ruohong
[J]. IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2024, 16 (02) : 570 - 583
[7] Visual and semantic ensemble for scene text recognition with gated dual mutual attention
Liu, Zhiguang
Wang, Liangwei
Qiao, Jian
[J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (04) : 669 - 680
[8] Visual and semantic ensemble for scene text recognition with gated dual mutual attention
Zhiguang Liu
Liangwei Wang
Jian Qiao
[J]. International Journal of Multimedia Information Retrieval, 2022, 11 : 669 - 680
[9] Scene recognition by semantic visual words
Elahe Farahzadeh
Tat-Jen Cham
Andrzej Sluzek
[J]. Signal, Image and Video Processing, 2015, 9 : 1935 - 1944
[10] Scene recognition by semantic visual words
Farahzadeh, Elahe
Cham, Tat-Jen
Sluzek, Andrzej
[J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2015, 9 (08) : 1935 - 1944

← 1 2 3 4 5 →