Video Caption Based Searching Using End-to-End Dense Captioning and Sentence Embeddings

被引：3

作者：

Aggarwal, Akshay ^{[1
]}

Chauhan, Aniruddha ^{[1
]}

Kumar, Deepika ^{[1
]}

Mittal, Mamta ^{[2
]}

Roy, Sudipta ^{[3
]}

Kim, Tai-hoon ^{[4
]}

机构：

[1] Bharati Vidyapeeths Coll Engn, Dept Comp Sci & Engn, New Delhi 110063, India

[2] GB Pant Govt Engn Coll, Dept Comp Sci & Engn, New Delhi 110020, India

[3] Washington Univ, PRTTL, St Louis, MO 63110 USA

[4] Beijing Jiaotong Univ, Sch Econ & Management, Beijing 100044, Peoples R China

来源：

SYMMETRY-BASEL | 2020年 / 12卷 / 06期

关键词：

video captioning; embeddings; deep learning; sentence embeddings;

D O I：

10.3390/sym12060992

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Traditionally, searching for videos on popular streaming sites like YouTube is performed by taking the keywords, titles, and descriptions that are already tagged along with the video into consideration. However, the video content is not utilized for searching of the user's query because of the difficulty in encoding the events in a video and comparing them to the search query. One solution to tackle this problem is to encode the events in a video and then compare them to the query in the same space. A method of encoding meaning to a video could be video captioning. The captioned events in the video can be compared to the query of the user, and we can get the optimal search space for the videos. There have been many developments over the course of the past few years in modeling video-caption generators and sentence embeddings. In this paper, we exploit an end-to-end video captioning model and various sentence embedding techniques that collectively help in building the proposed video-searching method. The YouCook2 dataset was used for the experimentation. Seven sentence embedding techniques were used, out of which the Universal Sentence Encoder outperformed over all the other six, with a median percentile score of 99.51. Thus, this method of searching, when integrated with traditional methods, can help improve the quality of search results.

引用

页数：16

共 50 条

[1] End-to-End Dense Video Captioning with Masked Transformer
Zhou, Luowei
Zhou, Yingbo
Corso, Jason J.
Socher, Richard
Xiong, Caiming
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
[2] End-to-End Dense Video Captioning with Parallel Decoding
Wang, Teng
Zhang, Ruimao
Lu, Zhichao
Zheng, Feng
Cheng, Ran
Luo, Ping
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6827 - 6837
[3] End-to-End Video Captioning
Olivastri, Silvio
Singh, Gurkirt
Cuzzolin, Fabio
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1474 - 1482
[4] End-to-end Generative Pretraining for Multimodal Video Captioning
Seo, Paul Hongsuck
Nagrani, Arsha
Arnab, Anurag
Schmid, Cordelia
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
[5] End-to-End Video Captioning with Multitask Reinforcement Learning
Li, Lijun
Gong, Boqing
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
[6] SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Lin, Kevin
Li, Linjie
Lin, Chung-Ching
Ahmed, Faisal
Gan, Zhe
Liu, Zicheng
Lu, Yumao
Wang, Lijuan
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17928 - 17937
[7] Dense Video Captioning Using Graph-Based Sentence Summarization
Zhang, Zhiwang
Xu, Dong
Ouyang, Wanli
Zhou, Luping
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1799 - 1810
[8] End-to-End Transformer Based Model for Image Captioning
Wang, Yiyu
Xu, Jungang
Sun, Yingfei
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2585 - 2594
[9] A Novel End-to-End Image Caption Based on Multimodal Attention
Li X.-M.
Yue G.
Chen G.-W.
[J]. Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2020, 49 (06): : 867 - 874
[10] End-to-End Video Captioning Based on Multiview Semantic Alignment for Human-Machine Fusion
Wu, Shuai
Gao, Yubing
Yang, Weidong
Li, Hongkai
Zhu, Guangyu
[J]. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 22 : 1 - 9

← 1 2 3 4 5 →