We present an in-place algorithm for the parallel-partition problem with linear work and polylogarithmic span. The algorithm uses only exclusive read/write shared variables and can be implemented using parallel-for-loops without any additional concurrency considerations (i.e., the algorithm is EREW). A key feature of the algorithm is that it exhibits provably optimal cache behavior up to small-order factors. We also present a second in-place EREW algorithm with work O(n) and span O(log n . log log n), which is within an O(log log n) factor of the optimal span. By using this low-span algorithm as a subroutine within the cache-friendly algorithm, we obtain a single EREW algorithm that combines their theoretical guarantees: the algorithm achieves span O(log n . log log n) and exhibits optimal cache behavior. As an immediate consequence, we also get an inplace EREW Quicksort algorithm with work O(n log n) and span O(log(2) n . log log n). Whereas the standard EREW algorithm for parallel-partition is memory-bandwidth bound on large numbers of cores, our cache-friendly algorithm is able to achieve near-ideal scaling in practice by avoiding the memory-bandwidth bottleneck. Our algorithm's performance is comparable to that of the Blocked Strided Algorithm of Francis, Pannan, Frias, and Petit, which is the previous state-of-the-art for parallel EREW sorting algorithms, but which lacks theoretical guarantees on its span and cache behavior.