This paper describes the development of a corpus of multimodal emotional behaviors. So far, many databases of multimodal affective behaviors have been developed. These databases are divided into spontaneous and acted behavior databases. Acted behavior databases can easily collect words with a balanced number of emotions; however, it has been pointed out that acted speech differs from spontaneous speech. In this work, we aim to collect acted multimodal emotional utterances that sound as natural as possible. To this end, we first collected scenes from tweets in which emotional balance was considered. Then, we performed an initial corpus collection, demonstrating that we could collect various emotional utterances. Next, we collected the corpus using a crowdsourcing platform. Then, we evaluated the naturalness of the collected speech by comparing it with the naturalness of the read speech database (JTES) and the spontaneous speech database (SMOC). As a result, the collected corpus was more natural than JTES, which indicates that the recording program effectively collected naturally-sounding emotional behavior corpus.