If neural network-based methods are praised for their prediction performance, they are often criticized for their lack of interpretability. When dealing with multi-omics or multi-modal data, neural network methods must be able learn the independent and joint effect of heterogeneous views while yielding interpretable results intra- and inter-views. In the literature, multi-view generative models exist to learn joint information in a reduced-size latent space. Among these models, multi-view variational autoencoders are very promising. In this work, we demonstrate how they provide a convenient statistical framework to learn the input data joint distribution and offer opportunities for the results interpretation. We design a method that discovers the relationships between one view and others. The generative capabilities of the model enable the exploration of a whole disorder spectrum through the generation of realistic values. While modifying a subject's clinical score, the model retrieves a representation of the subject's brain at this clinical status, so-called digital avatar. By computing associations between cortical regions measures and behavioral scores, we showcase that such digital avatars convey interpretable information in a multi-modal cohort with children experiencing mental health issues.