Diabetic retinopathy (DR) is the damage to the micro-vascular system in the retina, due to prolonged diabetes mellitus. Diagnosis and treatment of DR entail screening of retinal fundus images of diabetic patients. The manual inspection of pathological changes in retinal images is a skill-based task that involves lots of effort and time. Therefore, computer-aided detection and diagnosis of DR have been extensively explored for the past few decades. In recent years with the development of different benchmark deep convolutional neural networks (CNN), deep learning and machine learning have been efficiently and effectively adapted to different DR classification tasks. The success of CNNs largely relies on how good they are in extracting discriminative features from the fundus images. However, to the best of our knowledge, till date no study has been conducted to evaluate the feature extraction capabilities of all the benchmark CNNs to support the DR classification tasks and to find the best training-hyper-parameters for each of them in fundus retinal image-based DR classification tasks. In this work, we try to find the best benchmark CNN, which can be used as the backbone feature extractor for the DR classification tasks using fundus retinal images. We also aim to find the optimal hyper-parameters for training each of the benchmark CNN family, particularly when they are applied to the DR gradation tasks using retinal image datasets with huge class-imbalance and limited samples of higher severity classes. To address the cause, we conduct a detailed comprehensive comparative study on the performances of almost all the benchmark CNNs and their variants proposed during 2014 to 2019, for the DR gradation tasks on common standard retinal datasets. We have also conducted a comprehensive optimal training hyper-parameter search for each of the benchmark CNN family for the fundus image-based DR classification tasks. The benchmark CNNs are transfer learned and end-to-end trained in an incremental fashion on a class-balanced dataset curated from the train set of the EyePACS dataset. The benchmark models are evaluated on APTOS, MESSIDOR-1, and MESSIDOR-2 datasets to test their cross-dataset generalization. Experimental results show that features extracted by EfficientNetB1 have outperformed features of all the other CNN models in DR classification tasks on all three test datasets. MobileNet-V3-Large also shows promising performance on MESSIDOR-1 dataset. The success of EfficientNetB1 and MobileNet-V3-Large indicates that comparatively shallower and light-weighted CNNs tend to extract more discriminative and expressive features from fundus images for DR stage detection. In future, researchers can explore different preprocessing and post-processing techniques and incorporate novel architectural components on these networks to further improve the classification accuracy and robustness.