Bilingual video captioning model for enhanced video retrieval

被引:1
|
作者
Alrebdi, Norah [1 ]
Al-Shargabi, Amal A. [1 ]
机构
[1] Qassim Univ, Coll Comp, Dept Informat Technol, Buraydah 51452, Saudi Arabia
关键词
Artificial intelligence; Computer vision; Natural language processing; Video retrieval; English video captioning; Arabic video captioning; LANGUAGE; NETWORK; VISION; TEXT;
D O I
10.1186/s40537-024-00878-w
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Many video platforms rely on the descriptions that uploaders provide for video retrieval. However, this reliance may cause inaccuracies. Although deep learning-based video captioning can resolve this problem, it has some limitations: (1) traditional keyframe extraction techniques do not consider video length/content, resulting in low accuracy, high storage requirements, and long processing times; (2) Arabic language support in video captioning is not extensive. This study proposes a new video captioning approach that uses an efficient keyframe extraction method and supports both Arabic and English. The proposed keyframe extraction technique uses time- and content-based approaches for better quality captions, fewer storage space requirements, and faster processing. The English and Arabic models use a sequence-to-sequence framework with long short-term memory in both the encoder and decoder. Both models were evaluated on caption quality using four metrics: bilingual evaluation understudy (BLEU), metric for evaluation of translation with explicit ORdering (METEOR), recall-oriented understudy of gisting evaluation (ROUGE-L), and consensus-based image description evaluation (CIDE-r). They were also evaluated using cosine similarity to determine their suitability for video retrieval. The results demonstrated that the English model performed better with regards to caption quality and video retrieval. In terms of BLEU, METEOR, ROUGE-L, and CIDE-r, the English model scored 47.18, 30.46, 62.07, and 59.98, respectively, whereas the Arabic model scored 21.65, 36.30, 44.897, and 45.52, respectively. According to the video retrieval, the English and Arabic models successfully retrieved 67% and 40% of the videos, respectively, with 20% similarity. These models have potential applications in storytelling, sports commentaries, and video surveillance.
引用
收藏
页数:24
相关论文
共 50 条
  • [41] Learning Video-Text Aligned Representations for Video Captioning
    Shi, Yaya
    Xu, Haiyang
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [42] Improving distinctiveness in video captioning with text-video similarity
    Velda, Vania
    Immanuel, Steve Andreas
    Hendria, Willy Fitra
    Jeong, Cheol
    IMAGE AND VISION COMPUTING, 2023, 136
  • [43] Multi-Task Video Captioning with Video and Entailment Generation
    Pasunuru, Ramakanth
    Bansal, Mohit
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1273 - 1283
  • [44] Quality Enhancement Based Video Captioning in Video Communication Systems
    Le, The Van
    Lee, Jin Young
    IEEE ACCESS, 2024, 12 : 40989 - 40999
  • [45] Video Captioning using Hierarchical Multi-Attention Model
    Xiao, Huanhou
    Shi, Jinglun
    ICAIP 2018: 2018 THE 2ND INTERNATIONAL CONFERENCE ON ADVANCES IN IMAGE PROCESSING, 2018, : 96 - 101
  • [46] End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
    Yu, Youngjae
    Ko, Hyungjin
    Choi, Jongwook
    Kim, Gunhee
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3261 - 3269
  • [47] Approach for video retrieval by video clip
    Peng, Yu-Xin
    Ngo, Chong-Wah
    Dong, Qing-Jie
    Guo, Zong-Ming
    Xiao, Jian-Guo
    Ruan Jian Xue Bao/Journal of Software, 2003, 14 (08): : 1409 - 1417
  • [48] Video retrieval based on video clip
    Hu, Zhen-Xing
    Xia, Li-Min
    Zhongnan Daxue Xuebao (Ziran Kexue Ban)/Journal of Central South University (Science and Technology), 2010, 41 (03): : 1009 - 1014
  • [49] Video Interactive Captioning with Human Prompts
    Wu, Aming
    Han, Yahong
    Yang, Yi
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 961 - 967
  • [50] Deep multimodal embedding for video captioning
    Jin Young Lee
    Multimedia Tools and Applications, 2019, 78 : 31793 - 31805