Video-based learning has become a popular, scalable, and effective approach for students to learn new skills. Many of the challenges for video-based learning can be addressed with machine learning models. However, the available datasets often lack the rich source of data that is needed to accurately predict students learning experiences and outcomes. To address this limitation, we introduce the MSP-GEO corpus, a new multimodal database that contains detailed demographic and educational data, recordings of the students and their screens, and meta-data about the lecture during the learning experience. The MSP-GEO corpus was collected using a quasi-experimental pre-test/post-test design. It consists of more than 39,600 seconds (11 hours) of continuous facial footage from 76 participants watching one of three experimental videos on the topic of fossil formation, resulting in over one million facial images. The data collected includes 21 gaze synchronization points, webcam and monitor recordings, and metadata for pauses, plays, and timeline navigation. Additionally, we annotated the recordings for engagement, boredom, and confusion using human evaluators. The MSP-GEO corpus has the potential to improve the accuracy of video-based learning outcomes and experience predictions, facilitate research on the psychological processes of video-based learning, inform the design of instructional videos, and advance the development of learning analytics methods.