The quest for raw computing power has shifted from increasing processor clock speeds to increasing the number of processing cores. Currently, mainstream CPUs can be purchased in dual-slot quad-core and hex-core configurations. On the other hand, graphic cards provide hundreds of processing cores. Although there have been various implementations of scientific applications on graphics hardware, including underwater acoustic modeling, widespread use of this technology has been hampered by the often extraordinary effort needed to program this hardware, especially if the application architecture did not match the canonical graphics pipeline for gaming. In the last few years, the major graphics board manufacturers have stepped away from designing hardware specialized for particular new graphic special effects and made a concerted effort to provide general-purpose computing capabilities, of the sort that can be exploited for scientific computing. For example, Nvidia's CUDA environment currently provides many building blocks for scientific computing, such as (subsets of) BLAS, LAPACK, and FFTs. We will present our experiences implementing the split-step Fourier parabolic equation (PE) model in NVIDIA's "Compute Unified Device Architecture" or CUDA environment, showing how we have achieved a 10 times speedup relative to a multi-core CPU implementation, with a modest investment in programming effort. In the repertoire of wave propagation modeling approaches, a parabolic equation model is typically used for range-dependent problems in situations when a ray tracing approach would not provide enough fidelity (e. g. because a high frequency approximation was not warranted for the waveguide being modeled). PE models are narrowband models, so a broadband application would require running multiple frequencies to cover the band of interest, followed by a synthesis via inverse FFT to form the predicted time-domain waveform, which has obvious opportunities for parallelization. This application was initially selected because its key software component, the FFT, was available in a mature GPU-based implementation. In addition, a multi-core CPU implementation of the FFT was also available, enabling a very direct comparison of CPU versus GPU performance using nearly identical code bases. We will describe the key steps needed to adapt this model to the GPU architecture. For example, an important aspect of accelerating applications on GPU architectures is effectively taking advantage of the features of the different memory types that reside on GPUs. Since the bandwidth between cores within the GPU is 5-10 times greater than the bandwidth from the CPU to the GPU, it is important to minimize the amount of data transferred in and out of the GPU. Fortunately, GPUs also have a type of memory called texture memory, which conveniently provides hardware accelerated interpolation -thus, a sparse representation of the range-dependent waveguide parameters (sound speed profile, bathymetry, geo-acoustic parameters of the seabed) can be loaded into texture memory, where it can be interpolated to the resolution required by the PE calculations. We will present benchmark comparisons between our GPU-based PE implementation and two other PE approaches on several canonical range-dependent modeling problems, comparing accuracy and degree of acceleration.