Housekeeping is a fundamental component of site safety management. Good housekeeping refers to maintaining a neat and orderly working environment, ensuring the walkway is free from obstruction and properly storing materials and equipment. These good practices improve construction safety and productivity, and they are a reflection of good safety culture. However, poor housekeeping is a common challenge in construction sites. To ensure good housekeeping, supervisors will typically perform on-site manual inspections. However, these inspections are costly and inefficient, and inspectors can be inconsistent in defining good and poor housekeeping. Computer vision approaches can help resolve some of these problems in an automatic manner. Specifically, image classification methods can identify good and poor housekeeping images in video streams supplied by inspection robots or drones. Furthermore, the computer vision system can alert relevant supervisors when it detects poor housekeeping, and the supervisors can tidy up the identified locations. Hence, this study aims to develop a supervised deep learning model to classify good and poor housekeeping images. This study utilized a state-of-the-art vision transformer-based backbone model and tuned for housekeeping images. The study shows that, unlike conventional image classification for general scene images, housekeeping images are highly varied. Nevertheless, despite the challenges, the computer vision models developed in this study achieved satisfactory accuracy for identifying good and poor housekeeping images.