Visual and semantic ensemble for scene text recognition with gated dual mutual attention

被引:0
|
作者
Zhiguang Liu
Liangwei Wang
Jian Qiao
机构
[1] Huawei Noah’s Ark Lab,
[2] Huawei Technologies,undefined
关键词
Text recognition; Multimodal fusion; Convolutional neural network;
D O I
暂无
中图分类号
学科分类号
摘要
Scene text recognition is a challenging task in computer vision due to the significant differences in text appearance, such as image distortion and rotation. However, linguistic prior helps individuals reason text from images even if some characters are missing or blurry. This paper investigates the fusion of visual cues and linguistic dependencies to boost recognition performance. We introduce a relational attention module to leverage visual patterns and word representations. We embed linguistic dependencies from a language model into the optimization framework to ensure that the predicted sequence captures the contextual dependencies within a word. We propose a dual mutual attention transformer that promotes cross-modality interactions such that the inter- and intra-correlations between visual and linguistic can be fully explored. The introduced gate function enables the model to learn to determine the contribution of each modality and further boost the model performance. Extensive experiments demonstrate that our method enhances the recognition performance of low-quality images and achieves state-of-the-art performance on datasets of texts from regular and irregular scenes.
引用
收藏
页码:669 / 680
页数:11
相关论文
共 50 条
  • [1] Visual and semantic ensemble for scene text recognition with gated dual mutual attention
    Liu, Zhiguang
    Wang, Liangwei
    Qiao, Jian
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (04) : 669 - 680
  • [2] SCENE TEXT RECOGNITION VIA GATED CASCADE ATTENTION
    Wang, Siwei
    Wang, Yongtao
    Qin, Xiaoran
    Zhao, Qijie
    Tang, Zhi
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1018 - 1023
  • [3] Scene Text Recognition by Attention Network with Gated Embedding
    Wang, Cong
    Liu, Cheng-Lin
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [4] Visual attention models for scene text recognition
    Ghosh, Suman K.
    Valveny, Ernest
    Bagdanov, Andrew D.
    [J]. 2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 943 - 948
  • [5] Reading Scene Text by Fusing Visual Attention with Semantic Representations
    Liu, Zhiguang
    Wang, Liangwei
    Qiao, Jian
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 210 - 218
  • [6] Hierarchical visual-semantic interaction for scene text recognition
    Diao, Liang
    Tang, Xin
    Wang, Jun
    Xie, Guotong
    Hu, Junlin
    [J]. INFORMATION FUSION, 2024, 102
  • [7] Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling
    Fang, Shancheng
    Xie, Hongtao
    Zha, Zheng-Jun
    Sun, Nannan
    Tan, Jianlong
    Zhang, Yongdong
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 248 - 256
  • [8] Flexible scene text recognition based on dual attention mechanism
    Tian, Zhiqiang
    Wang, Chunhui
    Xiao, Youzi
    Lin, Yuping
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (22):
  • [9] Scene text recognition via dual character counting-aware visual and semantic modeling network
    Ke Xiao
    Anna Zhu
    Brian Kenji Iwana
    Cheng-Lin Liu
    [J]. Science China Information Sciences, 2024, 67
  • [10] Scene text recognition via dual character counting-aware visual and semantic modeling network
    Ke XIAO
    Anna ZHU
    Brian Kenji IWANA
    Cheng-Lin LIU
    [J]. Science China(Information Sciences), 2024, 67 (03) : 313 - 314