The Role of the Input in Natural Language Video Description

被引：2

作者：

Cascianelli, Silvia ^{[1
]}

Costante, Gabriele ^{[1
]}

Devo, Alessandro ^{[1
]}

Ciarfuglia, Thomas A. ^{[1
]}

Valigi, Paolo ^{[1
]}

Fravolini, Mario L. ^{[1
]}

机构：

[1] Univ Perugia, Dept Engn, I-06123 Perugia, Italy

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2020年 / 22卷 / 01期

关键词：

Video description; multimodal data; input preprocessing; IMAGE; ATTENTION; TEXT;

D O I：

10.1109/TMM.2019.2924598

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Natural language video description (NLVD) has recently received strong interest in the computer vision, natural language processing (NLP), multimedia, and autonomous robotics communities. The state-of-the-art (SotA) approaches obtained remarkable results when tested on the benchmark datasets. However, those approaches poorly generalize to new datasets. In addition, none of the existing works focus on the processing of the input to the NLVD systems, which is both visual and textual. In this paper, an extensive study is presented to deal with the role of the visual input, evaluated with respect to the overall NLP performance. This is achieved by performing data augmentation of the visual component, applying common transformations to model camera distortions, noise, lighting, and camera positioning that are typical in real-world operative scenarios. A t-SNE-based analysis is proposed to evaluate the effects of the considered transformations on the overall visual data distribution. For this study, the English subset of the Microsoft Research Video Description (MSVD) dataset is considered, which is used commonly for NLVD. It was observed that this dataset contains a relevant amount of syntactic and semantic errors. These errors have been amended manually, and the new version of the dataset (called MSVD-v2) is used in the experimentation. The MSVD-v2 dataset is released to help to gain insight into the NLVD problem.

引用

页码：271 / 283

页数：13

共 50 条

[21] The Applications of Description Logics in Natural Language Processing
Cheng Xian-Yi
Cheng Chen
Zhu Qian
ADVANCED RESEARCH ON INDUSTRY, INFORMATION SYSTEMS AND MATERIAL ENGINEERING, PTS 1-7, 2011, 204-210 : 381 - +
[22] Natural Language Description of Videos for Smart Surveillance
Dilawari, Aniqa
Khan, Muhammad Usman Ghani
Al-Otaibi, Yasser D.
Rehman, Zahoor-ur
Rahman, Atta-ur
Nam, Yunyoung
APPLIED SCIENCES-BASEL, 2021, 11 (09):
[23] The Applications of Description Logics in Natural Language Processing
Cheng Xian-Yi
Cheng Chen
Zhu Qian
ADVANCED MATERIALS SCIENCE AND TECHNOLOGY, PTS 1-2, 2011, 181-182 : 236 - +
[24] Natural language agreement description for reversible grammars
Diaconescu, S
AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2003, 2903 : 161 - 172
[25] MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
Xu, Jun
Mei, Tao
Yao, Ting
Rui, Yong
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5288 - 5296
[26] ExpressEdit: Video Editing with Natural Language and Sketching
Tilekbay, Bekzat
Yang, Saelyne
Lewkowicz, Michal
Suryapranata, Alex
Kim, Juho
COMPANION PROCEEDINGS OF 2024 29TH ANNUAL CONFERENCE ON INTELLIGENT USER INTERFACES, IUI 2024 COMPANION, 2024, : 50 - 53
[27] ExpressEdit: Video Editing with Natural Language and Sketching
Tilekbay, Bekzat
Yang, Saelyne
Lewkowicz, Michal
Suryapranata, Alex
Kim, Juho
PROCEEDINGS OF 2024 29TH ANNUAL CONFERENCE ON INTELLIGENT USER INTERFACES, IUI 2024, 2024, : 515 - 536
[28] Translating Video Content to Natural Language Descriptions
Rohrbach, Marcus
Qiu, Wei
Titov, Ivan
Thater, Stefan
Pinkal, Manfred
Schiele, Bernt
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 433 - 440
[29] Mapping language to the world: the role of iconicity in the sign language input
Perniss, Pamela
Lu, Jenny C.
Morgan, Gary
Vigliocco, Gabriella
DEVELOPMENTAL SCIENCE, 2018, 21 (02)
[30] Natural language analysis of written description of impressions of science and language subjects
Shimoda, Hiroko
Okamoto, Vuji
Fukuyama, Hidenao
Matsuyama, Takashi
Takahashi, Ryosuke
INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2008, 43 (3-4) : 297 - 297

← 1 2 3 4 5 →