Hierarchical visual-semantic interaction for scene text recognition

被引:2
|
作者
Diao, Liang [1 ]
Tang, Xin [2 ]
Wang, Jun [3 ,4 ]
Xie, Guotong [3 ,4 ]
Hu, Junlin [5 ]
机构
[1] Ping An Property & Casualty Insurance Co, Shenzhen, Peoples R China
[2] Huawei Technol Ltd, Shenzhen, Peoples R China
[3] Ping An Healthcare Technol, Beijing, Peoples R China
[4] Ping An Technol, Shenzhen, Peoples R China
[5] Beihang Univ, Sch Software, Beijing, Peoples R China
关键词
Scene text recognition; Visual semantic interaction; Scene text representation; Feature fusion;
D O I
10.1016/j.inffus.2023.102080
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Proper interaction between visual and semantic features is crucial to obtain a powerful feature representation for scene text recognition (STR). The existing interaction methods usually treat visual and semantic features as distinct tokens and use transformers to learn contextual information and prior language knowledge, and they achieve promising performance for STR task. However, there still remain several issues needed to be further addressed such as the imbalance number and mis-alignment between visual and semantic features, and the necessarily of stacking transformers to a progressive improvement in accuracy. To this aim, this paper proposes a novel interaction manner namely hierarchical visual-semantic interaction (HVSI) which contains three novel modules including a hierarchical visual-semantic interaction module, fusion module, and visual-semantic alignment module. The hierarchical visual-semantic interaction module employs multiple visual-semantic interaction blocks in various scales to enhance the representation power of visual features and semantic features. To better exploit multi-scale visual and semantic features, the fusion module is introduced to fuse multiple semantic features based on attention mechanisms. Furthermore, our HVSI presents a simple plug-in block named visual-semantic alignment module to alleviate mis-alignment of semantic features by mapping them into a unified semantic space, which helps improve the performance of HVSI. Extensive experiments on multiple benchmarks including English and Chinese text recognition datasets show that our method obtains state-of-the-art or competitive performances.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Multimodal Visual-Semantic Representations Learning for Scene Text Recognition
    Gao, Xinjian
    Pang, Ye
    Liu, Yuyu
    Han, Maokun
    Yu, Jun
    Wang, Wei
    Chen, Yuanxu
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
  • [2] Visual-semantic network: a visual and semantic enhanced model for gesture recognition
    Yizhe Wang
    Congqi Cao
    Yanning Zhang
    [J]. Visual Intelligence, 1 (1):
  • [3] Learning Hierarchical Visual-Semantic Representation with Phrase Alignment
    Yan, Baoming
    Zhang, Qingheng
    Chen, Liyu
    Wang, Lin
    Pei, Leihao
    Yang, Jiang
    Yu, Enyun
    Li, Xiaobo
    Zhao, Binqiang
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 349 - 357
  • [4] Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
    Niu, Zhenxing
    Zhou, Mo
    Wang, Le
    Gao, Xinbo
    Hua, Gang
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1899 - 1907
  • [5] Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition
    Li, Qiaozhe
    Zhao, Xin
    He, Ran
    Huang, Kaiqi
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8634 - 8641
  • [6] A Hierarchical Utilization of Semantic Gradients and Scene Structure for Visual Place Recognition
    Bao, Yaoqi
    Pan, Yun
    Yang, Zhe
    Huan, Ruohong
    [J]. IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2024, 16 (02) : 570 - 583
  • [7] Visual and semantic ensemble for scene text recognition with gated dual mutual attention
    Liu, Zhiguang
    Wang, Liangwei
    Qiao, Jian
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (04) : 669 - 680
  • [8] Visual and semantic ensemble for scene text recognition with gated dual mutual attention
    Zhiguang Liu
    Liangwei Wang
    Jian Qiao
    [J]. International Journal of Multimedia Information Retrieval, 2022, 11 : 669 - 680
  • [9] Scene recognition by semantic visual words
    Elahe Farahzadeh
    Tat-Jen Cham
    Andrzej Sluzek
    [J]. Signal, Image and Video Processing, 2015, 9 : 1935 - 1944
  • [10] Scene recognition by semantic visual words
    Farahzadeh, Elahe
    Cham, Tat-Jen
    Sluzek, Andrzej
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2015, 9 (08) : 1935 - 1944