Bilingual video captioning model for enhanced video retrieval

被引：1

作者：

Alrebdi, Norah ^{[1
]}

Al-Shargabi, Amal A. ^{[1
]}

机构：

[1] Qassim Univ, Coll Comp, Dept Informat Technol, Buraydah 51452, Saudi Arabia

来源：

JOURNAL OF BIG DATA | 2024年 / 11卷 / 01期

关键词：

Artificial intelligence; Computer vision; Natural language processing; Video retrieval; English video captioning; Arabic video captioning; LANGUAGE; NETWORK; VISION; TEXT;

D O I：

10.1186/s40537-024-00878-w

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Many video platforms rely on the descriptions that uploaders provide for video retrieval. However, this reliance may cause inaccuracies. Although deep learning-based video captioning can resolve this problem, it has some limitations: (1) traditional keyframe extraction techniques do not consider video length/content, resulting in low accuracy, high storage requirements, and long processing times; (2) Arabic language support in video captioning is not extensive. This study proposes a new video captioning approach that uses an efficient keyframe extraction method and supports both Arabic and English. The proposed keyframe extraction technique uses time- and content-based approaches for better quality captions, fewer storage space requirements, and faster processing. The English and Arabic models use a sequence-to-sequence framework with long short-term memory in both the encoder and decoder. Both models were evaluated on caption quality using four metrics: bilingual evaluation understudy (BLEU), metric for evaluation of translation with explicit ORdering (METEOR), recall-oriented understudy of gisting evaluation (ROUGE-L), and consensus-based image description evaluation (CIDE-r). They were also evaluated using cosine similarity to determine their suitability for video retrieval. The results demonstrated that the English model performed better with regards to caption quality and video retrieval. In terms of BLEU, METEOR, ROUGE-L, and CIDE-r, the English model scored 47.18, 30.46, 62.07, and 59.98, respectively, whereas the Arabic model scored 21.65, 36.30, 44.897, and 45.52, respectively. According to the video retrieval, the English and Arabic models successfully retrieved 67% and 40% of the videos, respectively, with 20% similarity. These models have potential applications in storytelling, sports commentaries, and video surveillance.

引用

页数：24

共 50 条

[41] Learning Video-Text Aligned Representations for Video Captioning
Shi, Yaya
Xu, Haiyang
Yuan, Chunfeng
Li, Bing
Hu, Weiming
Zha, Zheng-Jun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
[42] Improving distinctiveness in video captioning with text-video similarity
Velda, Vania
Immanuel, Steve Andreas
Hendria, Willy Fitra
Jeong, Cheol
IMAGE AND VISION COMPUTING, 2023, 136
[43] Multi-Task Video Captioning with Video and Entailment Generation
Pasunuru, Ramakanth
Bansal, Mohit
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1273 - 1283
[44] Quality Enhancement Based Video Captioning in Video Communication Systems
Le, The Van
Lee, Jin Young
IEEE ACCESS, 2024, 12 : 40989 - 40999
[45] Video Captioning using Hierarchical Multi-Attention Model
Xiao, Huanhou
Shi, Jinglun
ICAIP 2018: 2018 THE 2ND INTERNATIONAL CONFERENCE ON ADVANCES IN IMAGE PROCESSING, 2018, : 96 - 101
[46] End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
Yu, Youngjae
Ko, Hyungjin
Choi, Jongwook
Kim, Gunhee
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3261 - 3269
[47] Approach for video retrieval by video clip
Peng, Yu-Xin
Ngo, Chong-Wah
Dong, Qing-Jie
Guo, Zong-Ming
Xiao, Jian-Guo
Ruan Jian Xue Bao/Journal of Software, 2003, 14 (08): : 1409 - 1417
[48] Video retrieval based on video clip
Hu, Zhen-Xing
Xia, Li-Min
Zhongnan Daxue Xuebao (Ziran Kexue Ban)/Journal of Central South University (Science and Technology), 2010, 41 (03): : 1009 - 1014
[49] Video Interactive Captioning with Human Prompts
Wu, Aming
Han, Yahong
Yang, Yi
PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 961 - 967
[50] Deep multimodal embedding for video captioning
Jin Young Lee
Multimedia Tools and Applications, 2019, 78 : 31793 - 31805

← 1 2 3 4 5 →