Video description with subject, verb and object supervision

被引:0
|
作者
Yue W. [1 ]
Jinlai L. [1 ]
Xiaojie W. [1 ]
机构
[1] School of Computer Science, Beijing University of Posts and Telecommunications, Beijing
基金
中国国家自然科学基金;
关键词
CNN; DNN; LSTM; VD;
D O I
10.19682/j.cnki.1005-8885.2019.1006
中图分类号
学科分类号
摘要
Video description aims to generate descriptive natural language for videos. Inspired from the deep neural network (DNN) used in the machine translation, the video description (VD) task applies the convolutional neural network (CNN) to extracting video features and the long short-term memory (LSTM) to generating descriptions. However, some models generate incorrect words and syntax. The reason may because that the previous models only apply LSTM to generate sentences, which learn insufficient linguistic information. In order to solve this problem, an end-to-end DNN model incorporated subject, verb and object (SVO) supervision is proposed. Experimental results on a publicly available dataset, i. e. Youtube2Text, indicate that our model gets a 58. 4% consensus-based image description evaluation (CIDEr) value. It outperforms the mean pool and video description with first feed (VD-FF) models, demonstrating the effectiveness of SVO supervision. © 2019, Beijing University of Posts and Telecommunications. All rights reserved.
引用
收藏
页码:52 / 58
页数:6
相关论文
共 50 条