Textual Context-Aware Dense Captioning With Diverse Words

被引：21

作者：

Shao, Zhuang ^{[1
]}

Han, Jungong ^{[2
]}

Debattista, Kurt ^{[1
]}

Pang, Yanwei ^{[3
,4
]}

机构：

[1] Univ Warwick, Warwick Mfg Grp, Coventry CV4 7AL, England

[2] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, England

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[4] Shanghai Artificial Intelligence Lab, Shanghai 200032, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Dense Captioning; Enhanced Transformer Dense Captioner; Textual Context Module; Dynamic Vocabulary Frequency Histogram; NETWORKS;

D O I：

10.1109/TMM.2023.3241517

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Dense captioning generates more detailed spoken descriptions for complex visual scenes. Despite several promising leads, existing methods still have two broad limitations: 1) The vast majority of prior arts only consider visual contextual clues during captioning but ignore potentially important textual context; 2) current imbalanced learning mechanisms limit the diversity of vocabulary learned from the dictionary, thus giving rise to low language-learning efficiency. To alleviate these gaps, in this paper, we propose an end-to-end enhanced dense captioning architecture, namely Enhanced Transformer Dense Captioner (ETDC), which obtains textual context from surrounding regions and dynamically diversifies the vocabulary bank during captioning. Concretely, we first propose the Textual Context Module (TCM), which is integrated into each self-attention layer of the Transformer decoder, to capture the surrounding textual context. Moreover, we take full advantage of the class information of object context and propose a Dynamic Vocabulary Frequency Histogram (DVFH) re-sampling strategy during training to balance words with different frequencies. The proposed method is tested on the standard dense captioning datasets and surpasses the state-of-the-art methods in terms of mean Average Precision (mAP).

引用

页码：8753 / 8766

页数：14

共 50 条

[1] Hierarchical Context-aware Network for Dense Video Event Captioning
Ji, Lei
Guo, Xianglin
Huang, Haoyang
Chen, Xilin
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2004 - 2013
[2] Context-aware transformer for image captioning
Yang, Xin
Wang, Ying
Chen, Haishun
Li, Jie
Huang, Tingting
NEUROCOMPUTING, 2023, 549
[3] Image Captioning with Context-Aware Auxiliary Guidance
Song, Zeliang
Zhou, Xiaofei
Mao, Zhendong
Tan, Jianlong
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2584 - 2592
[4] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
Chen, Dave Zhenyu
Gholami, Ali
Niesner, Matthias
Chang, Angel X.
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3192 - 3202
[5] Learning visual relationship and context-aware attention for image captioning
Wang, Junbo
Wang, Wei
Wang, Liang
Wang, Zhiyong
Feng, David Dagan
Tan, Tieniu
PATTERN RECOGNITION, 2020, 98
[6] Stacked Multimodal Attention Network for Context-Aware Video Captioning
Zheng, Yi
Zhang, Yuejie
Feng, Rui
Zhang, Tao
Fan, Weiguo
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42
[7] Context-aware automated quality assessment of textual data
Mylavarapu G.
Viswanathan K.A.
Thomas J.
International Journal of Business Intelligence and Data Mining, 2023, 22 (04) : 451 - 469
[8] Memory-attended semantic context-aware network for video captioning
Chen, Shuqin
Zhong, Xian
Wu, Shifeng
Sun, Zhixin
Liu, Wenxuan
Jia, Xuemei
Xia, Hongxia
Soft Computing, 2021,
[9] Memory-attended semantic context-aware network for video captioning
Chen, Shuqin
Zhong, Xian
Wu, Shifeng
Sun, Zhixin
Liu, Wenxuan
Jia, Xuemei
Xia, Hongxia
SOFT COMPUTING, 2021, 28 (Suppl 2) : 425 - 425
[10] Dual dense context-aware network for hippocampal segmentation
Shi, Jiali
Zhang, Rong
Guo, Lijun
Gao, Linlin
Li, Yuqi
Ma, Huifang
Wang, Jianhua
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2020, 61

← 1 2 3 4 5 →