In this work, we propose a framework for integrating the information from behavioral and cognitive spaces to perform attention profiling of a learner while engaging with digital content. Attention profiling helps examine and comprehend students' concentration, attention, and cognitive engagement patterns. Attention profiling of learners enables educators to discern the digital content types that effectively engage students, identify potential distractors, empower educators to customize learning resources and enhance students' overall learning experience. Attention profiling integrated into the Learning Management System (LNIS) environment helps students by providing feedback on the content or resources that require more focus. Several studies focus on student engagement through behavioral cues, including click stream data, time spent watching the videos, number of GIT commits, and participation in discussion forums; however, limited research is available in measuring student attention using both behavior cues and cognitive measurements. We address the problem of attention profiling of a learner using the data from behavioral and cognitive spaces. Integrating the data from both spaces necessitates a fusion technique to enhance the performance of the attention profiling of a learner. We propose to use EEG and eye gaze information from cognitive and behavioral space, respectively. We used 'Stroup test,' Sustained Attention to Response Task' (SART), and 'Continuous Performance Task' (CPT) to invoke selective attention and sustained attention states among learners. The data collected during the mentioned tests served as ground truth. Further students watched three different types of videos and we collected the data from cognitive space using Emotiv+, a 14channel head mount EEG device, and the data from the behavioral space through eye gaze information using a web camera-based solution. The advantage of the Emotiv+ device is the comprehensive coverage of sensors across both brain hemispheres, and the device's real-time data stream includes raw EEG and FFT/band power. On the other hand, to capture the on-screen and offscreen behavior of the learners, we used the L2CS-Net gaze estimation architecture built on ResN et -50. We aim to develop a coordinated multimodal data representation framework by employing co -learning methods.