iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning

被引:0
|
作者
Lin X. [1 ]
Jin Q. [1 ]
Chen S. [1 ]
机构
[1] Multimedia Computing Laboratory, School of Information, Renmin University of China, Beijing
关键词
Large-scale dataset; Makeup; Video caption; Video segmentation;
D O I
10.3724/SP.J.1089.2019.17343
中图分类号
学科分类号
摘要
Automatically describing images or videos with natural language sentences (a.k.a. image/video captioning) has increasingly received significant attention. Most related works focused on generating one caption sentence for an image or a short video. While most videos in our daily life contain numerous actions or objects de facto, it is hard to describe complicated information involved in these videos with a single sentence. How to learn information from long videos has become a compelling problem. The number of large-scale dataset for such task is limited. Instructional videos are a unique type of videos that have distinct and attractive characteristics for learning. Makeup instructional videos are very popular on commercial video websites. Hence, we present a large-scale makeup instructional video dataset named iMakeup, containing 2 000 videos that are equally distributed over 50 topics. The total duration of this dataset is about 256 hours, containing about 12 823 video clips in total which are segmented based on makeup procedures. We describe the collection and annotation process of our dataset; analyze the scale, the text statistics and diversity in comparison with other video dataset for similar problems. We then present the results of our baseline video caption models on this dataset. The iMakeup dataset contains information from both visual and auditory modalities with a large coverage and diversity of content. Despite for video captioning, it can be used in an extensive range of problems, such as video segmentation, object detection, intelligent fashion recommendation, etc. © 2019, Beijing China Science Journal Publishing Co. Ltd. All right reserved.
引用
收藏
页码:1350 / 1357
页数:7
相关论文
共 41 条
  • [1] Shou Z., Wang D., Chang S.F., Temporal action localization in untrimmed videos via multi-stage CNNs, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049-1058, (2016)
  • [2] Hochreiter S., Schmidhuber J., Long short-term memory, Neural Computation, 9, 8, pp. 1735-1780, (1997)
  • [3] Russakovsky O., Deng J., Su H., Et al., ImageNet large scale visual recognition challenge, International Journal of Computer Vision, 115, 3, pp. 211-252, (2015)
  • [4] Lin T.Y., Maire M., Belongie S., Et al., Microsoft COCO: common objects in context, Proceedings of European Conference on Computer Vision, pp. 740-755, (2014)
  • [5] Laptev I., Marszalek M., Schmid C., Et al., Learning realistic human actions from movies, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, (2008)
  • [6] Soomro K., Zamir A.R., Shah M., UCF101: a dataset of 101 human actions classes from videos in the wild
  • [7] Jiang Y.G., Liu J., Zamir A.R., Et al., THUMOS challenge: action recognition with a large number of classes
  • [8] Kay W., Carreira J., Simonyan K., Et al., The kinetics human action video dataset
  • [9] Abu-El-Haija S., Kothari N., Lee J., Et al., YouTube-8M: a large-scale video classification benchmark
  • [10] Chen D.L., Dolan W.B., Collecting highly parallel data for paraphrase evaluation, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1, pp. 190-200, (2011)