EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset

被引:5
|
作者
Northcutt, Curtis G. [1 ]
Zha, Shengxin [2 ]
Lovegrove, Steven [3 ]
Newcombe, Richard [3 ]
机构
[1] MIT, Dept Elect & Comp Sci, Cambridge, MA 02139 USA
[2] Facebook AI, Menlo Pk, CA 94025 USA
[3] Oculus Res, Facebook Real Labs, Redmond, WA 98052 USA
关键词
Task analysis; Artificial intelligence; Visualization; Synchronization; Natural languages; Computer vision; Education; Egocentric; multi-modal data; EgoCom; communication; turn-taking; human-centric; embodied intelligence; VIDEOS;
D O I
10.1109/TPAMI.2020.3025105
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5 percent of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79 percent relative to a single perspective. Both applications exploit EgoCom's synchronous multi-perspective data to augment performance of embodied AI tasks.
引用
收藏
页码:6783 / 6793
页数:11
相关论文
共 50 条
  • [21] A multi-modal dataset for gait recognition under occlusion
    Na Li
    Xinbo Zhao
    Applied Intelligence, 2023, 53 : 1517 - 1534
  • [22] A longitudinal multi-modal dataset for dementia monitoring and diagnosis
    Gkoumas, Dimitris
    Wang, Bo
    Tsakalidis, Adam
    Wolters, Maria
    Purver, Matthew
    Zubiaga, Arkaitz
    Liakata, Maria
    LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (03) : 883 - 902
  • [23] CSMR: A Multi-Modal Registered Dataset for Complex Scenarios
    Li, Chenrui
    Gao, Kun
    Hu, Zibo
    Yang, Zhijia
    Cai, Mingfeng
    Cheng, Haobo
    Zhu, Zhenyu
    REMOTE SENSING, 2025, 17 (05)
  • [24] PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking
    Doering, Andreas
    Chen, Di
    Zhang, Shanshan
    Schiele, Bernt
    Gall, Juergen
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20931 - 20940
  • [25] A new multi-modal dataset for human affect analysis
    1600, Springer Verlag (8749):
  • [26] A comprehensive video dataset for multi-modal recognition systems
    Handa A.
    Agarwal R.
    Kohli N.
    Data Science Journal, 2019, 18 (01):
  • [27] A New Multi-modal Dataset for Human Affect Analysis
    Wei, Haolin
    Monaghan, David S.
    O'Connor, Noel E.
    Scanlon, Patricia
    HUMAN BEHAVIOR UNDERSTANDING (HBU 2014), 2014, 8749 : 42 - 51
  • [28] Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos
    Zatsarynna, Olga
    Abu Farha, Yazan
    Gall, Juergen
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 2249 - 2258
  • [29] A Multi-Modal Person Recognition System for Social Robots
    Al-Qaderi, Mohammad K.
    Rad, Ahmad B.
    APPLIED SCIENCES-BASEL, 2018, 8 (03):
  • [30] CONTEXTUAL PERSON DETECTION IN MULTI-MODAL OUTDOOR SURVEILLANCE
    Robertson, Neil M.
    Letham, Jonathan
    2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 1930 - 1934