Large-scale simulation modelling of agricultural and other biological systems is a young science that has grown with increasing computer power over the last twenty or thirty years. Recently, we have become accustomed to using models as an authoritative source of information for research and policy making in important areas such as climate change or carbon dynamics. In complex systems, models may offer the only way for us to examine the challenges and concerns we have about the future in a time of pressure on resources and uncertainty. The complexity of biological systems makes it unrealistic for either models or experimental observation to address the full scope of underlying processes. By working harmoniously with models and experimental data, there is great potential to increase our understanding of these complex systems and the predictive power of the models. To ensure this is possible, it is the responsibility of modellers to be transparent about their methodology, to present information in a manner that is accessible to review, and for the models to be reproducible. An essential part of model development is testing and evaluation. This is usually done by comparing the model with experimental data. Agricultural systems are complex and involve the integration of many processes such as: plant photosynthesis, growth and development; soil water and nutrient dynamics; and, for pasture systems, animal intake, metabolism and growth. Often, model testing is restricted to whole-system behaviour. If the model and data agree, then some level of 'validation' is said to have been achieved but if they disagree, then the usual conclusion is that the model is at fault and either the structure requires revision or parameters need adjusting. One difficulty with this process, particularly with complex biophysical models with a large number of parameters, is that similar results can be obtained for different combinations of parameter values, raising the question as to which parameter combinations are appropriate. Indeed, with validation defined this way, models will inevitably be proven to be invalid. Validation as generally applied is more appropriately termed 'verification' and it is well established that no scientific hypothesis, or model, can ever be verified. It is suggested that we focus on the notion of testing and evaluating models, recognising that they are always, at best, an approximation of the real world, just as the factors influencing observational data cannot be completely known. This opens the way to work closely with models and data together to gain greater understanding of the underlying system. It is virtually impossible to provide a complete mathematical description of large models for assessment in normal journal articles and so peer-review is generally restricted to whole systems behaviour and is unlikely to identify errors and limitations in the underlying model structure. However, peer-review is important and it is suggested that complete and clear descriptions of the models be made available, and that models are, in principle, completely reproducible from the available documentation. Some examples taken from models of canopy photosynthesis and whole-pasture simulation will be presented to illustrate these points.