Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

被引:0
|
作者
Wang, Haowei [1 ]
Ji, Jiayi [1 ]
Zhou, Yiyi [1 ,2 ]
Wu, Yongjian [4 ]
Sun, Xiaoshuai [1 ,2 ,3 ]
机构
[1] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[3] Xiamen Univ, Fujian Engn Res Ctr Trusted Artificial Intelligen, Fujian, Peoples R China
[4] Tencent Youtu Lab, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the target regions of an image corresponding to the text description. Existing approaches for PNG are mainly based on a two-stage paradigm, which is computationally expensive. In this paper, we propose a one-stage network for real-time PNG, termed End-to-End Panoptic Narrative Grounding network (EPNG), which directly generates masks for referents. Specifically, we propose two innovative designs, i.e., Locality-Perceptive Attention (LPA) and a bidirectional Semantic Alignment Loss (SAL), to properly handle the many-to-many relationship between textual expressions and visual objects. LPA embeds the local spatial priors into attention modeling, i.e., a pixel may belong to multiple masks at different scales, thereby improving segmentation. To help understand the complex semantic relationships, SAL proposes a bidirectional contrastive objective to regularize the semantic consistency inter modalities. Extensive experiments on the PNG benchmark dataset demonstrate the effectiveness and efficiency of our method. Compared to the single-stage baseline, our method achieves a significant improvement of up to 9.4% accuracy. More importantly, our EPNG is 10 times faster than the two-stage model. Meanwhile, the generalization ability of EPNG is also validated by zero-shot experiments on other grounding tasks. The source codes and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/EPNG.git.
引用
下载
收藏
页码:2528 / 2536
页数:9
相关论文
共 50 条
  • [1] Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network
    Lin, Yiming
    Jin, Xiao-Bo
    Wang, Qiufeng
    Huang, Kaizhu
    23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, ICDM 2023, 2023, : 1163 - 1168
  • [2] Panoptic Narrative Grounding
    Gonzalez, Cristina
    Ayobi, Nicolas
    Hernandez, Isabela
    Hernandez, Jose
    Pont-Tuset, Jordi
    Arbelaez, Pablo
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1344 - 1353
  • [3] TransVG: End-to-End Visual Grounding with Transformers
    Deng, Jiajun
    Yang, Zhengyuan
    Chen, Tianlang
    Zhou, Wengang
    Li, Houqiang
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1749 - 1759
  • [4] An End-to-End Network for Panoptic Segmentation
    Liu, Huanyu
    Peng, Chao
    Yu, Changqian
    Wang, Jingbo
    Liu, Xu
    Yu, Gang
    Jiang, Wei
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6165 - 6174
  • [5] Learning an interpretable end-to-end network for real-time acoustic beamforming
    Liang, Hao
    Zhou, Guanxing
    Tu, Xiaotong
    Jakobsson, Andreas
    Ding, Xinghao
    Huang, Yue
    JOURNAL OF SOUND AND VIBRATION, 2024, 591
  • [6] End-to-end dense video grounding via parallel regression
    Shi, Fengyuan
    Huang, Weilin
    Wang, Limin
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 242
  • [7] Real-time end-to-end network monitoring in large distributed systems
    Song, Han Hee
    Yalagandula, Praveen
    2007 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS SOFTWARE & MIDDLEWARE, VOLS 1 AND 2, 2007, : 365 - +
  • [8] End-to-end Multi-modal Video Temporal Grounding
    Chen, Yi-Wen
    Tsai, Yi-Hsuan
    Yang, Ming-Hsuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [9] Semi-Supervised Panoptic Narrative Grounding
    Yang, Danni
    Ji, Jiayi
    Sun, Xiaoshuai
    Wang, Haowei
    Li, Yinan
    Ma, Yiwei
    Ji, Rongrong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7164 - 7174
  • [10] End-to-End Network Delay Guarantees for Real-Time Systems using SDN
    Kumar, Rakesh
    Hasan, Monowar
    Padhy, Smruti
    Evchenko, Konstantin
    Piramanayagam, Lavanya
    Mohan, Sibin
    Bobba, Rakesh B.
    2017 IEEE REAL-TIME SYSTEMS SYMPOSIUM (RTSS), 2017, : 231 - 242