EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset

被引:5
|
作者
Northcutt, Curtis G. [1 ]
Zha, Shengxin [2 ]
Lovegrove, Steven [3 ]
Newcombe, Richard [3 ]
机构
[1] MIT, Dept Elect & Comp Sci, Cambridge, MA 02139 USA
[2] Facebook AI, Menlo Pk, CA 94025 USA
[3] Oculus Res, Facebook Real Labs, Redmond, WA 98052 USA
关键词
Task analysis; Artificial intelligence; Visualization; Synchronization; Natural languages; Computer vision; Education; Egocentric; multi-modal data; EgoCom; communication; turn-taking; human-centric; embodied intelligence; VIDEOS;
D O I
10.1109/TPAMI.2020.3025105
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5 percent of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79 percent relative to a single perspective. Both applications exploit EgoCom's synchronous multi-perspective data to augment performance of embodied AI tasks.
引用
收藏
页码:6783 / 6793
页数:11
相关论文
共 50 条
  • [1] Towards Continual Egocentric Activity Recognition: A Multi-Modal Egocentric Activity Dataset for Continual Learning
    Xu, Linfeng
    Wu, Qingbo
    Pan, Lili
    Meng, Fanman
    Li, Hongliang
    He, Chiyuan
    Wang, Hanxin
    Cheng, Shaoxu
    Dai, Yu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2430 - 2443
  • [2] SDT: A SYNTHETIC MULTI-MODAL DATASET FOR PERSON DETECTION AND POSE CLASSIFICATION
    Pramerdorfer, C.
    Strohmayer, J.
    Kampel, M.
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1611 - 1615
  • [3] The IMPTC Dataset: An Infrastructural Multi-Person Trajectory and Context Dataset
    Hetzel, Manuel
    Reichert, Hannes
    Reitberger, Guenther
    Fuchs, Erich
    Doll, Konrad
    Sick, Bernhard
    2023 IEEE INTELLIGENT VEHICLES SYMPOSIUM, IV, 2023,
  • [4] Investigation on the Fusion of Multi-modal and Multi-person Features in RNNs for Detecting the Functional Roles of Group Discussion Participants
    Huang, Hung-Hsuan
    Nishida, Toyoaki
    SOCIAL COMPUTING AND SOCIAL MEDIA. DESIGN, ETHICS, USER BEHAVIOR, AND SOCIAL NETWORK ANALYSIS, SCSM 2020, PT I, 2020, 12194 : 489 - 503
  • [5] A multi-subject, multi-modal human neuroimaging dataset
    Wakeman, Daniel G.
    Henson, Richard N.
    SCIENTIFIC DATA, 2015, 2
  • [6] A multi-subject, multi-modal human neuroimaging dataset
    Daniel G Wakeman
    Richard N Henson
    Scientific Data, 2
  • [7] Multi-modal egocentric activity recognition using multi-kernel learning
    Arabaci, Mehmet Ali
    Ozkan, Fatih
    Surer, Elif
    Jancovic, Peter
    Temizel, Alptekin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 16299 - 16328
  • [8] Multi-modal egocentric activity recognition using multi-kernel learning
    Mehmet Ali Arabacı
    Fatih Özkan
    Elif Surer
    Peter Jančovič
    Alptekin Temizel
    Multimedia Tools and Applications, 2021, 80 : 16299 - 16328
  • [9] Multi-modal Sarcasm Generation: Dataset and Solution
    Zhao, Wenye
    Huang, Qingbao
    Xu, Dongsheng
    Zhao, Peizhi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5601 - 5613
  • [10] Multi-modal person identification in a smart environment
    Ekenel, Hazim Kemal
    Fischer, Mika
    Jin, Qin
    Stiefelhagen, Rainer
    2007 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-8, 2007, : 2984 - +