Cone beam scanners have evolved rapidly in the past years. Increasing sampling resolution of the projection images and the desire to reconstruct high resolution output volumes increases both the memory consumption and the processing time considerably. In order to keep the processing time down new strategies for memory management are required as well as new algorithmic implementations of the reconstruction pipeline. In this paper, we present a fast and high-quality cone beam reconstruction pipeline using the Graphics Processing Unit (CPU). This pipeline includes the backprojection process and also pre-filtering and post-filtering stages. In particular, we focus on a subset of five stages, but more stages can be integrated easily. In the pre-filtering stage, we first reduce the amount of noise in the acquired projection images by a non-linear curvature-based smoothing algorithm. Then, we apply a high-pass filter as required by the inverse Radon transform. Next, the backprojection pass reconstructs a raw 3D volume. In post-processing, we first filter the volume by a ring artifact removal. Then, we remove cupping artifacts by our novel uniformity correction algorithm. We present the algorithm in detail. In order to execute the pipeline as quickly as possible we take advantage of GPUs that have proven to be very fast parallel processors for numerical problems. Unfortunately, both the projection images and the reconstruction volume are too large to fit into 512 MB of GPU memory. Therefore, we present an efficient memory management strategy that minimizes the bus transfer between main memory and CPU memory. Our results show a 4 times performance gain over a highly optimized CPU implementation using SSE2/3 commands. At the same time, the image quality is comparable to the CPU results with an average per pixel difference of 10(-5).