Lecture Presentations Multimodal Dataset: Towards Understanding Multimodality in Educational Videos

被引:2
|
作者
Lee, Dong Won [1 ,2 ]
Ahuja, Chaitanya [2 ]
Liang, Paul Pu [2 ]
Natu, Sanika [2 ]
Morency, Louis-Philippe [2 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
COGNITIVE LOAD THEORY;
D O I
10.1109/ICCV51070.2023.01838
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many educational videos use slide presentations, a sequence of visual pages that contain text and figures accompanied by spoken language, which are constructed and presented carefully in order to optimally transfer knowledge to students. Previous studies in multimedia and psychology attribute the effectiveness of lecture presentations to their multimodal nature. As a step toward developing vision-language models to aid in student learning as intelligent teacher assistants, we introduce the Lecture Presentations Multimodal (LPM) Dataset as a large-scale benchmark testing the capabilities of vision-and-language models in multimodal understanding of educational videos. Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects (e.g., computer science, dentistry, biology). We introduce three research tasks, (1) figure-to-text retrieval, (2) text-to-figure retrieval, and ( 3) generation of slide explanations, which are grounded in multimedia learning and psychology principles to test a vision-language model's understanding of multimodal content. We provide manual annotations to help implement these tasks and establish baselines on them. Comparing baselines and human student performances, we find that state-of-the-art vision-language models (zero-shot and fine-tuned) struggle in (1) weak crossmodal alignment between slides and spoken text, (2) learning novel visual mediums, (3) technical language, and (4) long-range sequences. We introduce PolyViLT, a novel multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches for retrieval. We conclude by shedding light on the challenges and opportunities in multimodal understanding of educational presentation videos.
引用
收藏
页码:20030 / 20041
页数:12
相关论文
共 50 条
  • [1] MM-AU:Towards Multimodal Understanding of Advertisement Videos
    Bose, Digbalay
    Hebbar, Rajat
    Feng, Tiantian
    Somandepalli, Krishna
    Xu, Anfeng
    Narayanan, Shrikanth
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 86 - 95
  • [2] MEmoR: A Dataset for Multimodal Emotion Reasoning in Videos
    Shen, Guangyao
    Wang, Xin
    Duan, Xuguang
    Li, Hongzhi
    Zhu, Wenwu
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 493 - 502
  • [3] Semantic indexing for recorded educational lecture videos
    Repp, S
    Meinel, C
    [J]. FOURTH ANNUAL IEEE INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING AND COMMUNICATIONS WORKSHOPS, PROCEEDINGS, 2006, : 240 - +
  • [4] Saliency of Omnidirectional Videos with Different Audio Presentations: Analyses and Dataset
    Singla, Ashutosh
    Robotham, Thomas
    Bhattacharya, Abhinav
    Menz, William
    Habets, Emanuel A. P.
    Raake, Alexander
    [J]. 2023 15TH INTERNATIONAL CONFERENCE ON QUALITY OF MULTIMEDIA EXPERIENCE, QOMEX, 2023, : 264 - 269
  • [5] MultiMET: A Multimodal Dataset for Metaphor Understanding
    Zhang, Dongyu
    Zhang, Minghao
    Zhang, Heting
    Yang, Liang
    Lin, Hongfei
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3214 - 3225
  • [6] Understanding the Users and Videos by Mining a Novel Danmu Dataset
    Lv, Guangyi
    Zhang, Kun
    Wu, Le
    Chen, Enhong
    Xu, Tong
    Liu, Qi
    He, Weidong
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2022, 8 (02) : 535 - 551
  • [7] Perspective on Developing Educational Lecture Videos for Power Electronics Courses
    Kim, Katherine A.
    Jeong, Hoejeong
    Liu, Yu-Chen
    [J]. 2017 IEEE 18TH WORKSHOP ON CONTROL AND MODELING FOR POWER ELECTRONICS (COMPEL), 2017,
  • [8] Towards Understanding of Deepfake Videos in the Wild
    Cho, Beomsang
    Le, Binh M.
    Kim, Jiwon
    Woo, Simon
    Tariq, Shahroz
    Abuadbba, Alsharif
    Moore, Kristen
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4530 - 4537
  • [9] A multimodal approach for extracting content descriptive metadata from lecture videos
    Balasubramanian, Vidhya
    Doraisamy, Sooryanarayan Gobu
    Kanakarajan, Navaneeth Kumar
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2016, 46 (01) : 121 - 145
  • [10] Towards a multimodal human activity dataset for healthcare
    Hu, Menghao
    Luo, Mingxuan
    Huang, Menghua
    Meng, Wenhua
    Xiong, Baochen
    Yang, Xiaoshan
    Sang, Jitao
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (01) : 1 - 13