An Efficient Technique for Recognizing Tomato Leaf Disease Based on the Most Effective Deep CNN Hyperparameters

: Leaf disease in tomatoes is one of the most common and treacherous diseases. It directly affects the production of tomatoes, resulting in enormous economic loss each year. As a result, studying the detection of tomato leaf diseases is essential. To that aim, this work introduces a novel mechanism for selecting the most effective hyperparameters for improving the detection accuracy of deep CNN. Several cutting-edge CNN algorithms were examined in this study to diagnose tomato leaf diseases. The experiment is divided into three stages to find a full proof technique. A few pre-trained deep convolutional neural networks were first employed to diagnose tomato leaf diseases. The superlative combined model has then experimented with changes in the learning rate, optimizer, and classifier to discover the optimal parameters and minimize overfitting in data training. In this case, 99.31% accuracy was reached in DenseNet 121 using AdaBound Optimizer, 0.01 learning rate, and Softmax classifier. The achieved detection accuracy levels (above 99%) using various learning rates, optimizers, and classifiers were eventually tested using K-fold cross-validation to get a better and dependable detection accuracy. The results indicate that the proposed parameters and technique are efficacious in recognizing tomato leaf disease and can be used fruitfully in identifying other leaf diseases.


Introduction
Biologically called Solanum Lycopersicon, tomato is a commonly harvested crop around the world, which is high in principle antioxidants like Vitamin 'C' and 'A' accompanying beta carotene. There is an increasing trend in the production and consumption of tomatoes throughout the globe resulting in 38.54 million tons of production for the year 2020 1 . Tomatoes can be grown in any well-drained wet soil with a www.aetic.theiaer.org information that can be recovered from a single-color component is constrained since plant leaf pictures are complicated due to the background. As a result, the feature extraction approach produces less reliable information. Therefore, the high identification accuracy of CNN has attracted many researchers.
Pandian et al. [13] applied an innovative 14 layered deep CNN (14-DCNN) on a massive open dataset of leaves. Their research indicates that 14-DCNN is well suited to automated plant disease identification. A customized CNN model significantly outperformed a pre-trained model, as shown by their study.
Developing an effective CNN model to get higher detection accuracy is a difficult task. Zhang et al. [14] suggested a three-channel CNN model that combines RGB color components to recognize disease in vegetable leaves. Sibiya et al. [15] utilized CNN to classify maize plants' diseases. They demonstrated the model's impact using histogram approaches. They were able to obtain an overall model accuracy of 92.85%. For identifying diseases in tomato leaf, Zhang et al. [16] investigated a few CNN architectures such as ResNet, AlexNet, and GoogleNet. The maximum accuracy of ResNet was 92.28%, outperforming other networks. In the study presented by Amara et al. [17], the LeNet CNN model was utilized to identify banana leaf diseases. Here, the authors test the model using grayscale and color images utilizing the CA and F1-scores.
Ferentinos [18] used AlexNet, GoogleNet, and VGG CNN architecture to compare the classification accuracy of the leaf disease. The VGG surpassed all other networks with the plant, obtaining 99.53 percent disease performance. Yamamoto et al. classified tomato diseases using CNN utilizing high, low, and super-resolution to compare super-resolution accuracy to other approaches [19]. The paper's results showed that the super-resolution approach surpassed traditional methods by a great proportion in terms of detection accuracy. Durmus et al. [20] used pre-trained networks AlexNet and SqueezNet V1.1 to classify tomato plant disease. AlexNet, on the other hand, outperforms with a disease classification accuracy of 95.65%.
According to the review, deep neural networks have been effectively employed for learning in plant disease detection applications. The architecture of the network, where it is critical to accurately edge weights and map nodes from the input to the output, is the primary issue involved with developing deep neural networks. To train deep neural networks, it is necessary to fine-tune their network parameters using a procedure that maps the input layer to the output layer and gets better over time. In our work, some pre-trained deep models were used as a starting point and fine-tuned it using three hyperparameters: learning rate, optimizer, and classifier. The capability to use deep models with limited sample numbers is the main benefit of such transfer learning in image classification [21]. Lastly, outstanding values of hyperparameters that contributed the most to improving detection accuracy were recorded using a fivefold cross-validation approach.
A computer program can learn from data using deep learning. The learning process is the means through which the ability to conduct the classification with high precision is attained. The aim is to use pre-trained models to identify and classify ten types of plant disease using the ImageNet dataset. The classification job instructs the computer program to determine which of k categories a given input belongs. The learning algorithm is tasked with creating the function : ℝ → {1, … , }. The model allocates an input defined by a vector x to a category specified by a numerical value y when = ( ). In this study, ten different classes were used, nine of which were for leaf illnesses and one for healthy leaves.3.1. Dataset The dataset of diseased tomato plant leaves was collected from the well-known Plant Village dataset 2 . The dataset contains 56,048 images of plant leaves of 14 different species such as Apple, Blueberry, Cherry, Corn, and Grape. Among these, we chose tomato leaves in this experiment. The dataset of tomato www.aetic.theiaer.org leaf is composed of images of 9 non-identical classes and 1 healthy class as shown in Table 1 along with a brief information. The 9 diseased classes are Early blight, Bacterial spot, Leaf mold, Late blight, Septoria leaf spot, Target spot, Two-spotted spider mite, Tomato yellow leaf curl virus, and Tomato mosaic virus. There are 18,160 images in total, and 1591 of them are images of healthy tomato leaves. In the first and second part of the experiment, the dataset was split into training and testing datasets in an 8:2 ratio by randomizing pictures from the dataset based on the group label ratio. In the third part of the research, we did five-fold cross-validation. For that, the whole dataset was equally divided into five folders, where one of the folders was used for validation and the other four for training. In all cases, the images have all been downsized to the target size (64 × 64). The dataset was normalized before being divided into training and validation sets.

DenseNet
DenseNet was first introduced in the paper [8]. In a feed-forward manner, it connects each layer to any other layer. This network has L(L+1) direct connections between each layer and its following layer, whereas most conventional convolutional networks have just one link in between layer and its following layer. DenseNet design provides several advantages, including improving feature propagation, relieving the vanishing gradient problem, and, most importantly, lowering the parameter count.

ResNet
In the paper [11] ResNet was first introduced. This architecture was proposed primarily to solve the problem of numerous non-linear layers not being able to learn identity mapping and to address the degradation problem. There are three types of layers in the ResNet model, and they are 50, 101, and 152. Among those, ResNet50 is the most efficient and effective.
Thus, in this experiment, we choose ResNet50. This is a network-within-a-network design built on a large number of stacked residual units. Residual units serve as the foundation of the ResNet design. Convolution and pooling layers make up these residual units. This is kind of similar to the VGG [10] architecture but 8 times deeper. In this experiment, we loaded the pre-trained network and finally added a softmax layer in the end to perform image classification.

VGG
VGG [10], developed by the University of Oxford's Visual Geometry Group, placed second in the classification assignment at the ILSVRC-2014. The most astonishing feature of this architecture is that it consistently has the same convolution layer that uses 3X3 filters. We employed two of the best performing VGG architectures, VGG 16 and VGG 19, in this experiment. VGG-16 contains 13 convolution layers followed by 3 completely connected layers, whilst VGG-19 has a stack of 19 convolutional layers linked to a fully connected layer. In this case, we loaded pre-trained VGG-16 and VGG-19 weights and created an output layer with ten dimensions, which correspond to the ten tomato disease classes.

EfficientNet
EfficientNet [12] was introduced first to achieve more effective performance by evenly scaling width, depth, and resolution parameters utilizing a remarkably effective composite coefficient while scaling down the model. Unlike other CNN models, which employ ReLU as the activation function, this one proposes a unique activation function called Switch. EfficientNet has eight models ranging from B0 to B7. When the number of models increases the accuracy increases considerably while the quantity of estimated parameters does not increase that much. In this experiment, we have used the latest one, EfficientNet B7. www.aetic.theiaer.org The inverted bottleneck MBConc is the primary building block of the EfficientNet. Under similar FLOPS constraints, EfficientNet performs much better than most other neural network models by giving significantly better accuracy numbers. Here, we used the native model architecture to extract features for the output FC layer.

Hyperparameters Tuning
Hyperparameters are a set of parameters that can influence the model's learning. These parameters include the number of epochs, layers, activation functions, optimizers, learning rate, etc. The hyperparameter configuration utilized in the second half of the investigation is detailed below. After multiple tries, the authors advanced the effective learning rate, optimizer, and the activation function to the third stage of the experiment, which is the K-fold cross-validation procedure.

Learning rate
A hyperparameter, the main purpose of which is to change the model concerning the approximated error every time the weights of the model are updated is known as the Learning rate. Determining a fixed value for the learning rate is strenuous. Selecting a very tiny value might lead to a lengthy training process and even can get stuck. On the other hand, choosing a too big value might lead to an unstable training process or too fast learning of a sub-optimal set of weights. While configuring a neural network it might be one of the most important hyperparameters. To combat the problem of choosing a hyperparameter manually for each given learning session in the learning rate schedule, there are various adaptive gradient descent algorithms, including Adadelta, Adam, and RMSpro. But as we have seen from this experiment, choosing a suitable learning rate is important even for those adaptive learning rates, especially while working with fewer epochs [22,23].

Optimizer
Optimizer plays an important role while iteratively updating the parameters of all the layers in the training of the deep CNN model [24,25]. Optimization is quite important in training a neural network, as is responsible for reducing losses and providing the most accurate results. Gradient descent is a prominent approach for doing optimization in a neural network. This is used frequently in linear regression and classification algorithms. Moreover, the gradient descent algorithm is responsible for backpropagation in neural networks. Even though it is easy to implement and compute, it has a few drawbacks such as may often trap in local minima and requiring large memory to calculate the gradient descent of the whole dataset. In this article, we have worked with stochastic gradient descent, Adam, AdaBound, RMSProp, AdaDelta, AdaGrad, Nadam, and Ftrl to see their effect on our dataset.

Activation Functions
The main work of an Activation function or classifier is to sort data into labeled classes or categories. It mainly affects the outcome of deep learning models, including their performance and accuracy. The activation functions have a significant influence on the capacity and speed of neural networks to converge [26][27][28]. Moreover, activation functions help to normalize the output between -1 and 1 for any input. As weight and bias are essentially linear transformations, a neural network is simply a linear regression model with no activation function. Activation functions are available in a variety of forms, including Binary step, Linear, ReLU, Sigmoid, and many more. In the second part of the experiment, we experimented on Softmax, ReLU, SeLU, ELU, Exponential, Nadam, Softsign, Tanh, and Sigmoid.

K-fold Cross-Validation
K-fold cross-validation is a statistical method for measuring the ability of machine learning models. In the third part of experiment, the highest values of the hyperparameters acquired in the second part of the experiment were assessed using 5-fold cross-validation. This process aims to analyze the performance and relationship of these hyperparameters in enhancing classification accuracy. www.aetic.theiaer.org

Algorithms Evaluation
The CNN models considered in this study were executed in a machine equipped with Ryzen 3600x processor, AMD radeon RX 550, and 16 GB RAM. All codes were realized with keras 2.4.3 framework, written in python 3.9.5, and executed in Jupyter Notebook. For every experiment, we used categorical cross-entropy loss and accuracy metrics for evaluation. A similar layout was taken for every model and each experiment was run for 50 epochs. A dense layer "Softmax" activation function was employed for classification at the output. "Adam" was the optimizer we used with a learning rate of 0.01. The accuracy and loss of training and validation datasets are shown in Table 2. Furthermore, recall, precision, and F1 score are also shown are of weighted average. The average time (in seconds) taken for each epoch is also shown in Table-2. In the case of DenseNet 121, after 50 epochs we achieved an accuracy score of 99.55% in the training set and 99.12% in the validation set. The weighted average of recall, precision, and F1 score were 0.9912, 0.9911, 0.9911 consecutively. The average time it took for each epoch to complete was 410 seconds. DenseNet 169 performed almost similarly to DenseNet 121. Even though this architecture has more layers it performed worse with it. As a result, the average time of execution of each epoch increased to 495 seconds. After 50 epochs it achieved an accuracy score of 94.71% and loss was 19.89%. ResNet 50 has the closest results to the DenseNet 121. Its accuracy in both the train and validation set was almost the same, near 98%. In the case of the training set after 50 epochs, it achieved an accuracy of 98.92% and in the validation set, it achieved 98.76%. Both VGG 16 and VGG 19 performed similarly on the basis of the accuracy of the validation set which was close to 97%. Their average time per epoch was also almost adjacent. The weighted results of precision, sensitivity, and F1 score are shown in Table 3 for each type of diseases. EfficientNet B7 performed most poorly in terms of average time per epoch. Whereas other algorithms took less than 600 seconds to complete each epoch, EfficientNet took more than double time, around 1400 seconds to finish each epoch. Moreover, its accuracy score in the validation set was the second lowest of the bunch. Similar trends can be seen in training set accuracy, loss, precision, recall, F1 score.
The graph below (Figure 1) depicts the accuracy and loss of the models on classifying the tomato leaf diseases.

Performance Metrics Evaluation over Hyperparameters
From the above result analysis, we can see that DenseNet 121 surpassed other pre-trained models for the Tomato leaf disease diagnosis. To do further analysis, we tried tweaking different parameters and tried to find out if learning rate, optimizer, or activation functions had an impact on the overall effectiveness of the DenseNet architecture as depicted in table 4. If so, what are the optimal metrics for learning rate, optimizer, and classifier hyperparameters to use for the DenseNet 121 model? For that first started by changing the learning rate. We started with a 0.002 learning rate and kept gradually increasing to 0.0009. As the results were getting worse, we stopped there and then kept gradually decreasing the learning rate. Then we selected the learning rate at which the pre-trained model performed best. Then we tried other popular optimizers out there and analyzed the results. Finally, we selected the optimizer that performed best among those and tried different classifiers. Results of all these are given below Table 4. As we can see from Table 4 that for learning rate there is a range or a fixed point for which the algorithm performs well above which the accuracy decreases, and below which accuracy also drops. In our experiment, we observed the worst results when the learning rate was increased to 1. Here, accuracy dropped below 29%, and the F1 score was just 13%. In this study for a learning rate of 0.01, the algorithm performed best. Accuracy, in this case, was just above 99%, loss observed was 0.046, and F1 score was also above 99% mark. In the case of optimizers, seven out of nine algorithms scored more than 95%. Among them, Adabound's accuracy score was the highest. It had an accuracy score of 99.31% which was just above Adam's 99.12%. Its loss was also less than Adam's. Its precision, recall, and F1 score was 0.99506. RMSProp also performed well here, the accuracy score of which was 99.04%. Among all the classifiers tested Ftrl optimizer had the worst performance, with accuracy just above 28% the F1 score was just 0.1259. After selecting 0.01 as the learning rate and Adabound as the optimize we tested on different activation functions. Here, in total four activation functions scored more than 90%, Softmax, Softplus, Nadam, and Sigmoid. Among them, the score of softmax was the highest. Tanh scored least in terms of accuracy with just 9.85%. So, overall, we found optimum results when the learning rate is 0.01, Optimizer is AdaBound, and activation function is Softmax.
The   Table 4 shows 3 potential combinations of hyperparameters that produced the maximum accuracy, or 99%, in this case. Those are a combination of, (i) AdaBound optimizer and Softmax classifier with a learning rate of 0.01, (ii) Adam optimizer and Softmax classifier with a learning rate of 0.01, and (iii) AdaBound optimizer and Softplus classifier with a learning rate of 0.01. To check the authenticity of these results we further did five-fold cross validation on the dataset using hyperparameters that exhibited the highest performance metrics scores according to Table 4. The end result is shown in Table 5 below: Here, we can see that for our metrics in the case of all 5 folds the accuracy score was more than 97%, it got more than 99% accuracy in three out of five folds for learning rate 0.01, AdaBound optimizer, and softmax classifier. However, the accuracy score reached as high as 99.39% in the second fold. When we applied the same experiment for 0.001 and 0.1 learning rates, the results were much worse. Especially for the learning rate of 0.1, the accuracy was below 60%, and in three out of five cases; it was even below 30%. The other two combinations of Adam optimizer and Softmax classifier with a learning rate of 0.01, and AdaBound optimizer and Softplus classifier with a learning rate of 0.01 have not score more than 99% accuracy. Moreover, some folds of these combinations even score close to 92% accuracy. Therefore, we found that for learning rate 0.01, AdaBound optimizer, and softmax classifier the model performs best.

K-fold Cross Validation
The Figure 3 below depicts the model accuracy and model loss of each epoch while the model was learning from the dataset for learning rate 0.01, AdaBound optimizer, and Softmax classifier. As we can see from the 2nd diagram, the model loss did not change a lot after the 7th or 8th epoch and stayed almost the same as the train set loss. In the case of the accuracy, it fluctuated a lot before it stabilized at the 35th or 36th epoch, then it was almost as same as the train set accuracy.
So, we can see that indeed for the learning rate 0.01, Softmax classifier and AdaBound optimizer the DenseNet performs best.

Conclusions
This paper analyzed networks that are based on pre-trained deep convolutional networks of DenseNet, ResNet, EfficientNet, and VGG. Here, in the first step, we compared those networks with Adam Optimizer, 0.01 learning rate, and Softmax Activation function. The highest result was achieved with DenseNet. Then a performance evaluation was done with different optimizers, learning rates, and classifiers that was affecting the results of the DenseNet. We found out that a range of learning rates between 0.001 and 0.1 gives good results where above and below are not effective. In the case of activation functions with Softmax and Softplus activation functions, the best results were obtained. When different optimizers were evaluated Adam, AdaBound, and RMSProp performed well. Here, the best overall result was observed with a 0.01 learning rate, Softmax activation function, and AdaBound optimizer. Our study reveals a significant information that there is a relationship among learning rate, optimizer, and classifier in improving detection accuracy. In the third part of the experiment, a K-fold cross-validation check further justified those parameters. Using the most effective deep CNN hyperparameters realized, this work might be extended to a variety of leaf disease detection applications. Despite the fact that this study obtained the highest detection accuracy, performance evaluation with multiple hyperparameters consumes a substantial amount of time and computer power. In the future, the convolutional neural network (CNN) pruning strategy may be explored to solve this issue.