Robust and accurate camera pose estimation is fundamental in computer vision. Learning-based regression approaches acquire six-degree-of-freedom camera parameters accurately from visual cues of an input image. However, most are trained on street-view and landmark datasets. These approaches can hardly be generalized to overlooking use cases, such as the calibration of the surveillance camera and unmanned aerial vehicle. Besides, reference images captured from the real world are rare and expensive, and their diversity is not guaranteed. In this article, we address the problem of using alternative virtual images for visual localization training. This work has the following principle contributions: First, we present a new challenging localization dataset containing six reconstructed large-scale three-dimensional scenes, 10,594 calibrated photographs with condition changes, and 300k virtual images with pixelwise labeled depth, relative surface normal, and semantic segmentation. Second, we present a flexible multi-feature fusion network trained on virtual image datasets for robust image retrieval. Third, we propose an end-to-end confidence map prediction network for feature filtering and pose estimation. We demonstrate that large-scale rendered virtual images are beneficial to visual localization. Using virtual images can solve the diversity problem of real images and leverage labeled multi-feature data for deep learning. Experimental results show that our method achieves remarkable performance surpassing state-of-the-art approaches. To foster research on improvement for visual localization using synthetic images, we release our benchmark at https://github.com/YuanXiong/contributions.