Video Caption Based Searching Using End-to-End Dense Captioning and Sentence Embeddings

被引:3
|
作者
Aggarwal, Akshay [1 ]
Chauhan, Aniruddha [1 ]
Kumar, Deepika [1 ]
Mittal, Mamta [2 ]
Roy, Sudipta [3 ]
Kim, Tai-hoon [4 ]
机构
[1] Bharati Vidyapeeths Coll Engn, Dept Comp Sci & Engn, New Delhi 110063, India
[2] GB Pant Govt Engn Coll, Dept Comp Sci & Engn, New Delhi 110020, India
[3] Washington Univ, PRTTL, St Louis, MO 63110 USA
[4] Beijing Jiaotong Univ, Sch Econ & Management, Beijing 100044, Peoples R China
来源
SYMMETRY-BASEL | 2020年 / 12卷 / 06期
关键词
video captioning; embeddings; deep learning; sentence embeddings;
D O I
10.3390/sym12060992
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Traditionally, searching for videos on popular streaming sites like YouTube is performed by taking the keywords, titles, and descriptions that are already tagged along with the video into consideration. However, the video content is not utilized for searching of the user's query because of the difficulty in encoding the events in a video and comparing them to the search query. One solution to tackle this problem is to encode the events in a video and then compare them to the query in the same space. A method of encoding meaning to a video could be video captioning. The captioned events in the video can be compared to the query of the user, and we can get the optimal search space for the videos. There have been many developments over the course of the past few years in modeling video-caption generators and sentence embeddings. In this paper, we exploit an end-to-end video captioning model and various sentence embedding techniques that collectively help in building the proposed video-searching method. The YouCook2 dataset was used for the experimentation. Seven sentence embedding techniques were used, out of which the Universal Sentence Encoder outperformed over all the other six, with a median percentile score of 99.51. Thus, this method of searching, when integrated with traditional methods, can help improve the quality of search results.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] End-to-End Dense Video Captioning with Masked Transformer
    Zhou, Luowei
    Zhou, Yingbo
    Corso, Jason J.
    Socher, Richard
    Xiong, Caiming
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
  • [2] End-to-End Dense Video Captioning with Parallel Decoding
    Wang, Teng
    Zhang, Ruimao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Luo, Ping
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6827 - 6837
  • [3] End-to-End Video Captioning
    Olivastri, Silvio
    Singh, Gurkirt
    Cuzzolin, Fabio
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1474 - 1482
  • [4] End-to-end Generative Pretraining for Multimodal Video Captioning
    Seo, Paul Hongsuck
    Nagrani, Arsha
    Arnab, Anurag
    Schmid, Cordelia
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
  • [5] End-to-End Video Captioning with Multitask Reinforcement Learning
    Li, Lijun
    Gong, Boqing
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
  • [6] SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning
    Lin, Kevin
    Li, Linjie
    Lin, Chung-Ching
    Ahmed, Faisal
    Gan, Zhe
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17928 - 17937
  • [7] Dense Video Captioning Using Graph-Based Sentence Summarization
    Zhang, Zhiwang
    Xu, Dong
    Ouyang, Wanli
    Zhou, Luping
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1799 - 1810
  • [8] End-to-End Transformer Based Model for Image Captioning
    Wang, Yiyu
    Xu, Jungang
    Sun, Yingfei
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2585 - 2594
  • [9] A Novel End-to-End Image Caption Based on Multimodal Attention
    Li X.-M.
    Yue G.
    Chen G.-W.
    [J]. Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2020, 49 (06): : 867 - 874
  • [10] End-to-End Video Captioning Based on Multiview Semantic Alignment for Human-Machine Fusion
    Wu, Shuai
    Gao, Yubing
    Yang, Weidong
    Li, Hongkai
    Zhu, Guangyu
    [J]. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 22 : 1 - 9