Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

被引:0
|
作者
Wang, Haowei [1 ]
Ji, Jiayi [1 ]
Zhou, Yiyi [1 ,2 ]
Wu, Yongjian [4 ]
Sun, Xiaoshuai [1 ,2 ,3 ]
机构
[1] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[3] Xiamen Univ, Fujian Engn Res Ctr Trusted Artificial Intelligen, Fujian, Peoples R China
[4] Tencent Youtu Lab, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the target regions of an image corresponding to the text description. Existing approaches for PNG are mainly based on a two-stage paradigm, which is computationally expensive. In this paper, we propose a one-stage network for real-time PNG, termed End-to-End Panoptic Narrative Grounding network (EPNG), which directly generates masks for referents. Specifically, we propose two innovative designs, i.e., Locality-Perceptive Attention (LPA) and a bidirectional Semantic Alignment Loss (SAL), to properly handle the many-to-many relationship between textual expressions and visual objects. LPA embeds the local spatial priors into attention modeling, i.e., a pixel may belong to multiple masks at different scales, thereby improving segmentation. To help understand the complex semantic relationships, SAL proposes a bidirectional contrastive objective to regularize the semantic consistency inter modalities. Extensive experiments on the PNG benchmark dataset demonstrate the effectiveness and efficiency of our method. Compared to the single-stage baseline, our method achieves a significant improvement of up to 9.4% accuracy. More importantly, our EPNG is 10 times faster than the two-stage model. Meanwhile, the generalization ability of EPNG is also validated by zero-shot experiments on other grounding tasks. The source codes and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/EPNG.git.
引用
收藏
页码:2528 / 2536
页数:9
相关论文
共 50 条
  • [31] Efficient end-to-end transport of soft real-time applications
    Antoniou, Z
    Stavrakakis, I
    NETWORKING 2000, 2000, 1815 : 470 - 482
  • [32] End-to-end absolute differentiated services for real-time traffic
    Yang, JM
    Huang, CC
    Performance Challenges for Efficient Next Generation Networks, Vols 6A-6C, 2005, 6A-6C : 1425 - 1434
  • [33] END-TO-END NEURAL SPEECH CODING FOR REAL-TIME COMMUNICATIONS
    Jiang, Xue
    Peng, Xiulian
    Zheng, Chengyu
    Xue, Huaying
    Zhang, Yuan
    Lu, Yan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 866 - 870
  • [34] End-to-end system for real-time bidirectional holographic communication
    Sinharoy, Indranil
    Budagavi, Madhukar
    Faramarz, Esmaeil
    Ni, Saifeng
    Sehgal, Abhishek
    REAL-TIME IMAGE PROCESSING AND DEEP LEARNING 2024, 2024, 13034
  • [35] End-to-End Real-Time Vanishing Point Detection with Transformer
    Tong, Xin
    Peng, Shi
    Guo, Yufei
    Huang, Xuhui
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5243 - 5251
  • [36] Towards end-to-end network resilience
    Vlacheas, Panagiotis
    Stavroulaki, Vera
    Demestichas, Panagiotis
    Cadzow, Scott
    Ikonomou, Demosthenes
    Gorniak, Slawomir
    INTERNATIONAL JOURNAL OF CRITICAL INFRASTRUCTURE PROTECTION, 2013, 6 (3-4) : 159 - 178
  • [37] End-to-end multitask Siamese network with residual hierarchical attention for real-time object tracking
    Huang, Wenhui
    Gu, Jason
    Ma, Xin
    Li, Yibin
    APPLIED INTELLIGENCE, 2020, 50 (06) : 1908 - 1921
  • [38] End-to-End Feature Pyramid Network for Real-Time Multi-Person Pose Estimation
    Luo, Dingli
    Du, Songlin
    Ikenaga, Takeshi
    PROCEEDINGS OF MVA 2019 16TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA), 2019,
  • [39] SP-YOLO: an end-to-end lightweight network for real-time human pose estimation
    Yuting Zhang
    Zongyan Wang
    Menglong Li
    Pei Gao
    Signal, Image and Video Processing, 2024, 18 : 863 - 876
  • [40] An End-to-end Delay Calculation Method for Airworthiness Verification on Real-time AFDX Priority Network
    Song Dong
    Zeng Xingxing
    Ding Lina
    PROCEEDINGS OF 2009 INTERNATIONAL SYMPOSIUM ON AIRCRAFT AIRWORTHINESS, 2009, : 321 - 324