In the realm of video understanding and analysis, solely relying on the appearance features of individuals in video frames significantly falls short of enhancing the accuracy of group activity recognition. The comprehensive utilization of various feature information present is deemed crucial, playing a pivotal role in understanding group activities. Consequently, a three-stream architecture model for feature learning is proposed. This model not only considers the human appearance features and available scene-level context information for group activity recognition within videos but also emphasizes the model's perception of individual motion, uncovering valuable information about motion features. Integrating appearance, motion, and scene-level context information affords a more comprehensive and rich representation of individual features. Ultimately, these combined features are employed in relation analysis to better predict group activities. The effectiveness of the proposed method is validated on two benchmark datasets, volleyball and collective activities, demonstrating its efficacy for the task. (c) 2024 SPIE and IS&T