MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

被引：2

作者：

Wang, Jianrong ^{[1
]}

Huo, Yuchen ^{[2
]}

Liu, Li ^{[3
]}

Xu, Tianyi ^{[1
]}

Li, Qi ^{[4
]}

Li, Sen ^{[1
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China

[2] Tianjin Univ, Tianjin Int Engn Inst, Tianjin, Peoples R China

[3] Hong Kong Univ Sci & Technol Guangzhou, Guangzhou, Peoples R China

[4] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Audio-Visual Speech Recognition; Mandarin Audio-Visual Corpus; Azure Kinect; Depth Information; SPEECH; RECOGNITION; TECHNOLOGY;

D O I：

10.21437/Interspeech.2023-823

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction. However, the existing available Mandarin audio-visual datasets are limited and lack the depth information. To address this issue, this work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers. To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed to create a well-balanced reading material. In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition. We also provide a baseline experiment, which could be used to evaluate the effectiveness of the dataset. The dataset and code will be released at https://github.com/SpringHuo/MAVD.

引用

页码：2113 / 2117

页数：5

共 50 条

[41] Is Second-order Information Helpful for Large-scale Visual Recognition?
Li, Peihua
Xie, Jiangtao
Wang, Qilong
Zuo, Wangmeng
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2089 - 2097
[42] AUDIO-VISUAL SPEECH ACTIVITY DETECTION IN A TWO-SPEAKER SCENARIO INCORPORATING DEPTH INFORMATION FROM A PROFILE OR FRONTAL VIEW
Thermos, Spyridon
Potamianos, Gerasimos
2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 579 - 584
[43] VisRepo: A Visual Retrieval Tool for Large-Scale Open-Source Projects
Yue, Xiaoqi
Liu, Chao
Zhang, Neng
Hu, Haibo
Zhang, Xiaohong
PROCEEDINGS OF THE 15TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE, INTERNETWARE 2024, 2024, : 499 - 502
[44] CKM: A Shared Visual Analytical Tool for Large-Scale Analysis of Audio-Video Interviews
Xiao, Lu
Luo, Yan
High, Steven
2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
[45] Name-Face Association in Web Videos: A Large-Scale Dataset,Baselines, and Open Issues
陈智能
杨宗桦
张炜
曹娟
姜育刚
Journal of Computer Science & Technology, 2014, 29 (05) : 785 - 798
[46] Large-Scale Indoor Visual-Geometric Multimodal Dataset and Benchmark for Novel View Synthesis
Cao, Junming
Zhao, Xiting
Schwertfeger, Soren
SENSORS, 2024, 24 (17)
[47] Plant Disease Recognition: A Large-Scale Benchmark Dataset and a Visual Region and Loss Reweighting Approach
Liu, Xinda
Min, Weiqing
Mei, Shuhuan
Wang, Lili
Jiang, Shuqiang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 2003 - 2015
[48] LUCFER: A Large-Scale Context-Sensitive Image Dataset for Deep Learning of Visual Emotions
Balouchian, Pooyan
Safaei, Marjaneh
Foroosh, Hassan
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1645 - 1654
[49] PediCXR: An open, large-scale chest radiograph dataset for interpretation of common thoracic diseases in children
Hieu H. Pham
Ngoc H. Nguyen
Thanh T. Tran
Tuan N. M. Nguyen
Ha Q. Nguyen
Scientific Data, 10
[50] PediCXR: An open, large-scale chest radiograph dataset for interpretation of common thoracic diseases in children
Pham, Hieu H.
Nguyen, Ngoc H.
Tran, Thanh T.
Nguyen, Tuan N. M.
Nguyen, Ha Q.
SCIENTIFIC DATA, 2023, 10 (01)

← 1 2 3 4 5 →