Human posture recognition using real-life images is a challenging task. In order to devise effective spatial-temporal representations, we propose a hybrid model by integrating object detection, segmentation and classification methods for the recognition of three human postures, i.e. jumping, sitting and standing, from challenging real-world images. Specifically, a well-known deep learning model, i.e. Mask R-CNN, is first employed to detect and segment each human subject in an image. The extracted regional features from each segmented region are then passed to a revised Inception-ResNet-v2 model for posture recognition. In particular, the revised Inception-ResNet-v2 model shows great efficiency in deep feature extraction. Evaluated using human action data sets, the proposed hybrid model outperforms two other deep learning methods for posture classification. This two-stage process provides a fundamental step for future research, such as human recognition within a commercial environment.