Environmental Sound Classification (ESC) has advanced significantly with the advent of deep learning techniques. This study conducts a comprehensive evaluation of contrastive and metric learning approaches in ESC, introducing the ESC51 dataset, an extension of the ESC50 benchmark that incorporates noise samples from quadrotor Unmanned Aerial Vehicles (UAVs). To enhance classification performance and the discriminative power of embedding spaces, we propose a novel metric learning-based approach, SoundMLR, which employs a hybrid loss function emphasizing metric learning principles. Experimental results demonstrate that SoundMLR consistently outperforms contrastive learning methods in terms of classification accuracy and inference latency, particularly when applied to the lightweight MobileNetV2 pretrained model across ESC50, ESC51, and UrbanSound8K (US8K) datasets. Analyses of confusion matrices and t-SNE visualizations further highlight SoundMLR's ability to generate compact, distinct feature clusters, enabling more robust discrimination between sound classes. Additionally, we introduce two innovative modules, Spectral Pooling Attention (SPA) and the Feature Pooling Layer (FPL), designed to optimize the MobileNetV2 backbone. Notably, the MobileNetV2 + FPL model, equipped with SoundMLR, achieves an impressive 92.16 % classification accuracy on the ESC51 dataset while reducing computational complexity by 24.5 %. Similarly, the MobileNetV2 + SPA model achieves a peak accuracy of 91.75 % on the ESC50 dataset, showcasing the complementary strengths of these modules. These findings offer valuable insights for the future development of efficient, scalable, and robust ESC systems. The source code for this study is publicly available at https://github.com/flchenwhu/ESC-SoundMLR.