Sound quality is not an inherent property of sounds, but sound quality happens in a complex process which is specified by judgments where the "character" of the sound is compared to conceptual references that represent the expectations of the listeners. The references are task- and listener-specific. Here, the question will be addressed of how the complex process of quality formation may be modeled with a comprehensive model of auditory processing that contains both bottom-up (signals driven) as well as top-down (hypothesis-driven) processes. The generation of hypotheses requires inherent knowledge, namely, a specific aural-world model. To obtain ground-truth data for the world model, sound quality can be analyzed in terms of plausibility. It can be attempted to assess plausibility in an indirect fashion, observing listeners' involvement and immersion into auditory scenes and, thereby, implicitly considering the meaning associated with it. In this way, instead of mere form-related fidelity, the ability of sounds under test to convey meaning to listeners could also be considered, in other words, taking the function aspect into account, in addition to the form aspect. In this paper, a general architecture for such a comprehensive modeling system will be proposed and discussed for the example of evaluating sound reproduction systems.