This study evaluates three field-scale erosion models - GLEAMS, EPIC and WEPP - against high-quality measured erosion data for a hillslope site in the UK South Downs, collected during the period 1982-88. Calibrated and uncalibrated runs were carried out; however the values used for calibration were constrained so that they remained within an 'acceptable' range, and were consistent between models. Despite the relatively undemanding nature of this model evaluation, calibration is seen to be essential for all models used. Model results exhibit a wide inter-model scatter. This appears to be in part due to the constraints placed upon the values of calibrated variables: these appear to have prevented models from reproducing measured values with any exactitude. For this dataset at least, there may well be limitations in each model's process descriptions (e.g. regarding crusting for GLEAMS, and regarding the hydraulic implications of soil stoniness for WEPP) which necessitate 'excessive' compensatory calibration. In addition, it appears that identical input parameters can take on subtly different process 'meanings' in different models, and thus may require rather different values for each model.