A Leading but Simple Classification Method for Remote Sensing Images

: Recently, researchers have proposed a lot of deep convolutional neural network (CNN) approaches with obvious flaws to tackle the difficult semantic classification (SC) task of remote sensing images (RSI). In this paper, the author proposes a simple method that aims to provide a leading but efficient solution by using a lightweight EfficientNet-B0. First, this paper concluded the drawbacks with an analysis of mathematical theory and then proposed a qualitative conclusion on the previous m ethods’ theoretical performance based on theoretical derivation and experiments. Following that, the paper designs a novel method named LS-EfficientNet, consisting only of a single CNN and a concise training algorithm called SC-CNN. Far different from previous complex and hardware-extensive ones, the proposed method mainly focuses on tackling the long-neglected problems, including overfitting, data distribution shift by DA, improper use of training tricks, and other incorrect operations on a pre-trained CNN. Compared to previous studies, the proposed method is easy to reproduce because all the models, training tricks, and hyperparameter settings are open-sourced. Extensive experiments on two benchmark datasets show that the proposed method can easily surpass all the previous state-of-the-art ones, with an outstanding accuracy lead of 0.5% to 1.2% and a remarkable parameter decrease of 78% if compared to the best prior one in 2022. In addition, ablation test results also prove that the proposed effective combination of training tricks, including OLS and CutMix, can clearly boost a CNN's performance for RSI-SC, with an increase in accuracy of 1.0%. All the results reveal that a single lightweight CNN can well tackle the routine task of classifying RSI.


Introduction
Remote sensing (RS) is an important technique for Earth observations, and the imaging picture is big data, containing spatial, spectral, and graphic information. Machine learning (ML) plays a central role in interpreting RSI to meet the domain-specific requirements for real-time and automation. Recently, with the rise of deep learning (DL), deep CNNs have dominated the recognition tasks of RSI. Among all the applications, SC is the foundation of all the others. As a hotspot, researchers have proposed different algorithms to improve the method's performance, though most of them are suboptimal.
At the beginning, CNNs were only used as fixed feature extractors without effective training on RSI datasets [1,2]. Therefore, the method's performance is poor. Then, with improved training on RSI, the fusion strategy consisting of deep and human-engineered features [3,4], as well as the one by redesigning loss functions [5,6], were proposed. These two roadmaps do show a clear advance on the fixed extractor because of the different training. Following that, researchers tend to employ more and more complicated and hardware-extensive technical pipelines to seek better performance [7], though the final benefit is very limited. As the vision transformer (VT) arose [8], more and more methods turned to employing this new architecture without considering the larger parameter size of VT. Based on the rich application scenarios, the larger volume of VT is acceptable for natural images. However, RS applications are basically focusing on the public's welfare, and more importantly, many tasks are using embedded systems, which have strict AETiC 2023, Vol. 7, No. 3 2 www.aetic.theiaer.org hardware limitations. Hence, compared to the VT, a CNN with fewer parameters is optimal if its performance can surpass the VT.
Generally, CNNs are more discriminative if their representation consists of more invariant features. Currently, CNN-based techniques for RSI-SC commonly employ pre-trained models on ImageNet-1K from natural images. The inherent difference in RSI requires a fine-tuning process for feature extraction. However, according to the author's knowledge, all the previous methods went in the wrong direction. First, some methods ignore or inappropriately implement a fine-tuning process. Natural images have common invariant features with RSI, although an inherent difference also exists. But many previous methods employed improper training strategies, e.g., an oversized learning rate (LR). It has made the pretrained model totally discard valuable general features and overfit on the small RSI dataset. Second, some previous methods made modifications to a pre-trained model's structure without retraining it on ImageNet-1K. It will face performance degradation or overfitting problems too. Third, currently, many training tricks developed on ImageNet-1K can greatly boost CNN's performance. But most of the previous methods proposed for RSI-SC simply copied the training trick without rechecking its usability. Fourth, data augmentation (DA) does improve a model's performance with finite training samples. But datasets processed by DA have a clear shift in data distribution compared to the original ones. In other words, a model will achieve suboptimal performance if the shifting problem is not well handled [9]. However, to the author's best knowledge, the problem has been totally ignored in previous studies. More importantly, all the arbitrary training algorithms that have long existed in the literature have made us encounter a dilemma. Too much randomness is correlating with the findings of previous studies. It is hard to tell what is really meaningful without an appropriate training procedure.
To solve this problem, at the beginning of this study, the author makes a deep analysis of all the previous methods and proves in mathematical theory that many of the complicated and hardwareextensive methods are much more difficult to achieve better performance because of improper algorithm choices. In the subsequent sections, the study proposes a novel CNN-based method for RSI-SC. Different from previous ones, the proposed method, called the leading and simple EfficientNet (LS-EfficientNet), only has a single EfficientNet-B0 with a much smaller number of 5.3 million (M) parameters [10]. The method still employs transfer learning but handles the above problems well through a concise training algorithm. Extensive experiments on two benchmark datasets show that LS-EfficientNet outperforms all the previous methods remarkably. The author summarizes the study's contributions as below: First, the author proposes a leading but efficient method for RSI-SC. It outperforms all the previous CNN-based methods before 2023 with the fewest parameters but excludes complicated architecture modifications. It can be easy to reproduce because the model is off-the-shelf.
Second, the author rechecked the availability of two training tricks developed on ImageNet-1K and redesigned their usage to boost the CNN's performance for RSI-SC. The rebuilt tricks can lift the model's accuracy by approximately 1.0%, and the hyperparameter settings are open source.
Third, the study reveals some fundamental mistakes in the CNN-based methods for RSI-SC. Taking all these findings together, we can see that all the previous CNN-based methods may have had suboptimal performance compared to their potential capabilities.

Related Works
With add-in attention modules, a CNN model can obtain better performance on ImageNet-1K. Based on this effective but hardware-cheap technique, researchers have proposed different methods by adding attention modules to pre-trained CNNs for RSI-SC. E.g., Tong et al. [11], Guo et al. [12], and Guo et al. [13] designed spatial or channel attention modules to boost a single pre-trained CNN's performance. Li et al. [14] employed attention maps to guide a pre-trained CNN to learn so-called discriminative representation. Alhichri et al. [15] also proposed another method by using a pre-trained EfficientNet-B3 model with built-in attention modules. With the help of the attention mechanism, all these methods using a single CNN show obvious improvements over the ones using feature fusion or modified loss functions.
Following that, more and more complicated and hardware-extensive methods emerge, though the advance is not obvious. E.g., Tang et al. [16] employed two CNNs with attention modules in a parallel pipeline and trained the models using a concatenate loss function. Sun et al. [17] designed a cascaded www.aetic.theiaer.org method consisting of a CNN, a gated bidirectional network, and a classifying module. Zhang et al. [18] combined a CNN with a CapsNet in a series-connected method. Li et al. [19] proposed a multi-process method by first extracting deep features with a pre-trained CNN, then refining the extracted features with attention modules, and finally feeding the refined features to deep-gated recurrent units. Chen et al. [20] fused five so-called context modeling blocks with a DesenNet-121. Putting the limited improvements aside, we can find that all these methods consist of multiple models or complex modifications to the model's architecture. It has an obvious larger hardware budget or is hard to reproduce due to a lack of open source.
As another effective and well-known technique, an ensemble of CNNs is also tested for RSI-SC. E.g., Minetto et al. [21] proposed a CNN ensemble consisting of twelve independent CNNs. Zhao et al. [22] proposed a compact ensemble consisting of a CNN backbone with multiple attention module branches. These two methods both have different classifiers in the ensemble, but the individual members of the models or modules are improperly trained on RSI or not retrained on ImageNet-1K. In theory, an ensemble is more effective only when each individual classifier in the ensemble is accurate and diverse. Therefore, the two ensembles' performances have shown a temporary lead over the single CNN-based method but are still suboptimal.
Speaking from another viewpoint, the training algorithm is also crucial for CNN's performance. E.g., He et al. [23] conclude that a set of training tricks, including label smoothing [24], Mixup [25], and so on, are very meaningful to the CNN's performance. These tricks were employed in [11,13,20] and demonstrated to be effective for RSI-SC. However, in the previous methods, the trick's usage was simply copied from ImageNet-1K without any modification. In fact, label smoothing is commonly used in CNN's training as regularization, but it sets an equal value for all subclasses, with the difference in similarity ignored. Despite the smaller category similarity in natural images, an alternative online label smoothing (OLS) technique proposed by Zhang et al. [26] clearly improved CNN's performance on ImageNet-1K by dynamically updating the subclass soft label. Similarly, as regularization in CNN's training, Mixup overlays different image patches on a training sample unnaturally without moving the original overlapped part of the original image. Yun et al. [27] proposed an alternative algorithm called CutMix that shows improved training efficiency. Hence, taking the inherent difference between RSI and the advance shown by updated training tricks into account, it is reasonable and meaningful to make a thorough analysis of the usability of tricks before applying them to RSI-SC.
In addition, CNN's training process commonly uses DA to boost a model's performance, but it also comes with a data distribution shift to the original datasets. Touvron et al. [9] demonstrate a fine-tuning or empirically correcting solution for ImageNet-1K. This compromised method is considerable for large-scale datasets but costly for RSI. Tan et al. [28] also propose a progressive learning method by employing more intensive regularization as the training goes deeper. Similarly, this optional plan is practicable if computing resources are sufficient. Previous studies for RSI-SC commonly employed DA in training; however, no work has mentioned handling the shifting problem. E.g., Zhang et al. [29] proposed an optimized training strategy by using multi-size images with triplet loss for RSI-SC, but with the shifting problem uncorrected, the method's accuracy showed litter improvement. Therefore, based on the above problems, the authors propose a novel training algorithm named shifting corrected CNN (SC-CNN), and its uniqueness is presented as follows: First, the training procedure consists of a continuous pipeline but can be viewed as two steps based on the different DAs and regularizations in training. Second, both steps have similar routine geometric transformations but with different regularizations. Third, based on the inherent difference in RSI, the usage of training tricks in SC-CNN is rechecked first and then set for different hyperparameters compared to the ones developed on ImageNet-1K. Last, SC-CNN does not include the ideas proposed in [9,28], and it is totally different from the other previous methods.

Mathematical Basis
Let be an RS image and be its category label, then a RSI set can be described as the form: www.aetic.theiaer.org Using this notation, the relationship between and can be defined as follows: = ( ) (2) Since different CNN architectures can be used in the classification, we can treat each CNN model as a hypothesis of the true function . In training, we employ the back-propagation algorithm to minimize the loss of a CNN prediction. In such cases, we can treat the training algorithm as a search program for the optimal solution to a certain hypothesis.
In ML, we commonly use finite iteration steps to fit the training dataset. As the dataset gets much larger in DL, we employ a mini-batch of samples and take the batch's mean value to optimize the training model. Therefore, unlike the shallow model used in ML, we always achieve many local optimal solutions for CNNs due to a lack of exhaustion.
Currently, deep CNN architectures commonly consist of a cascade of convolution layers. In each layer, it has different convolution kernels, and the assigned values for these kernels are the main parameters of a CNN. Let us suppose that the value of the kernel parameter, in the simplest case, can only be set to 0 or 1. If a layer has a certain number of kernels, then the total number of all kernel states can be described as follows: Let us suppose that a CNN has a number of layers, and in the simplest case, all the layers have the same structure. Then, the in this CNN can be described as follows: In fact, at each iteration step, we can only change all the parameter states of a CNN for a certain permutation. Hence, we can see that the searching space for the training algorithm is linearly correlated with the in a CNN. As a CNN's size grows, e.g., the number of layers gets larger, then, according to Eq. (4), we can see that the searching space for the training algorithm to get local optimal solutions will show an exponential growth. To tackle this problem, we generally have two choices. First, we choose a pretrained model, and then we will get a good starting point for searching. Second, we choose a larger LR, and then we will get a fast but salutatory searching process. It may make the search miss out on lots of optimal solutions. Recently, attention modules have been widely used for CNNs. Taking the famous channel attention, i.e., squeeze-and-excitation (SE) module [30], as an example, it works in this way.
Let ∈ ℝ × × be the original feature map of a certain layer, with a number of channels, a height of , and a width of . Let ∈ ℝ × × be the transformed feature map by SE modules. Let be the squeeze operation and be the excitation operation. Then, the SE working pipeline can be described as follows: In detail, the makes a squeeze on the , and its output ′ ∈ ℝ × 1 ×1 will have the same height and width as 1 × 1. Afterwards, the makes an excitation to the ′ , and its output, ′′ ∈ ℝ × × , will have the same shape as . In training, the ′′ will be a weighted feature map with larger and smaller values. We can see that, according to Eq. (5), the 0 will also be weighted by the multiplication. In the back-propagation process, a value of features close to zero means less important information. In other words, the SE module makes the input feature map partially meaningful for a CNN. Under this condition, according to Eq. (4), we can see that the searching space of a CNN with built-in attention modules is premarked through pre-training. Therefore, with finite training steps, a CNN with attention modules can be more discriminative only if the dataset has significant features.
Based on this explanation, however, if the attention modules have no pre-training and are initialized at random, we can see that the searching space will be the same as the CNN without attention. Hence, looking at previous studies in [11][12][13][14][15], we can see that the methods without pre-training on ImageNet-1K will probably be suboptimal. Taking previous studies [16][17][18][19][20] into account, according to Eq. (4), we can see that these methods commonly get a much larger searching space when multiple models are combined in series. More importantly, these methods commonly train the combined models with a single loss function simultaneously, resulting in a poor probability of getting optimal solutions. Hence, based on mathematical theory, we can find that these hardware-extensive strategies are unnecessary if the single model works well. www.aetic.theiaer.org CNN's classification ability relies on the various patterns in datasets. If we choose a model, then its capacity for patterns is fixed. Let the probability of a sample belonging to a subclass be ∈ [0,1], and then we can describe the model's prediction P as follows: where denotes all the learned patterns contained in a model and ∈ [0,1] is the weight of a pattern to determine its contribution to a certain subclass.
CNN recognizes different patterns through a combination of parameters. That is, the parameters of the convolution kernels control the extraction of features. Let us use the same as Eq. (6), and then we can describe the feature extraction as follows: where denotes the previous layer's output, denotes the convolutional operation of the current layer, and corresponds to the number of layers.
Let us use the same , , and in Eqs. (6) and (7), then we can describe the changing process of a model's prediction during training as follows: As the training algorithm updates the kernel's parameter, CNNs can extract more local invariant features from an RSI dataset and gradually replace the ones learned through pre-training on ImageNet-1K. As the fitting process goes further, a CNN's prediction is more accurate only if the local feature is more general and discriminative. However, if the training and testing datasets have very different distributions, overfitting occurs.
Currently, CNN-based methods completely leverage the pre-trained weights of ImageNet-1K to conduct RSI-SC. The reason lies in two facts. First, the feature of large-scale datasets is more general. Second, the RS domain lacks large-scale datasets. Compared to the one million samples in ImageNet-1K, the two benchmark RSI datasets, including the Aerial Image dataset (AID) and the Northwestern Polytechnic University Remote Sensing Image Scene Classification 45 dataset (NWPU), only have 10,000 and 31,500 samples, respectively [31]. Therefore, a good training algorithm should avoid overfitting on the smaller RSI datasets. In addition, some viewpoints believe that the feature from ImageNet-1K has a great domain gap with RSI. Nonetheless, we can easily evaluate the pre-trained model's ability for RSI-SC with a fast test. The experiment just includes a layer-frozen operation for a CNN model of EfficientNet-B0, including all convolution layers frozen except the classifier. Then it trains the model on the RIS datasets at a fixed training ratio (TR) to find OA results. As shown in Figure 1, the evaluation results on AID and NWPU, including four different TRs, prove that the pre-trained model on ImageNet-1K has an acceptable accuracy of approximately 75% for RSI-SC. Therefore, based on all the above explanations in mathematical theory, the author designed a concise and simple method consisting of a lightweight EfficientNet-B0 model with built-in attention modules and two modified training tricks as regularization in training to avoid overfitting problems. www.aetic.theiaer.org The proposed method's framework is illustrated in Figure 2. The algorithm's whole pipeline, as shown in Figure 2, consists of continuous procedures of 300 training epochs in total, but can be viewed as two successive steps according to the different DA and regularizations used in training. In detail, the training process starts with the red arrows corresponding to Step 1, called coarse training, and then, at epoch 61, the training procedures start with the green arrows corresponding to Step 2, named fine training. In Step 1, the pre-trained EfficientNet-B0 is coarsely trained on RSI datasets for 60 epochs, and successively, in Step 2, the model inherits the weights related to the best OA of Step 1 and keeps training on RSI datasets for another 240 epochs. The biggest difference between Step 1 and Step 2 is the training epochs, DA, and regularizations, which are presented in subsequent sections.

DA Strategies
The proposed method employs four kinds of routine transformations in a cascaded combination, and all the transformations are implemented via the PyTorch libraries. In Step 1, it consists of the color jitter, horizontal flip, vertical flip, and rotation in turn. In Step 2, it only consists of the horizontal flips, vertical flips, and rotation.

Regularization
The proposed method employs the modified OLS and modified CutMix as regularization, while the former exists throughout the whole pipeline but the latter is only in Step 2.

OLS Settings
The OLS quantifies the difference in similarity among subclasses by dynamically updating the soft label in training. The algorithm, as shown in [26], initializes the learnable soft labels at zero. Hence, the training loss still needs a traditional hard label to improve the speed of convergence. Let denote the final loss in training, then it can be described as follows: where is the hyperparameter to balance the hard and soft losses. Zhang et al. [26] proposed an empirical value of 0.5 for . Here, taking the larger similarity in categories of RSI and also based on extensive experiments, the author sets at an empirical value of 0.9.

CutMix Settings
The CutMix algorithm, as shown in [27], first randomly cuts an A-class image patch and then replaces an equal area of another B-class image with the cut patch. Let denote the cut-and-mixed image's label, then it can be described as follows: where is a hyperparameter equal to the ratio of the cropped area to the original one, and is another hyperparameter that controls the occurrence probability of a cut-and-mix operation. Yun et al., as shown in [27], proposed the beta distribution function to obtain the value of and an empirical value of 0.9 for . In this paper, however, the author uses the same method to obtain β but sets the value of at 0.1. The reason can be simply explained as follows: The cut-and-mixed samples, as shown in Figure 3, show a larger difference between RSI and ImageNet-1K. Looking at the top of Figure 3, the algorithm randomly cuts a bee-class patch and mixes it with an image of ants, and then, according to Eq. (10), the label of the cut-and-mixed image is 0.2 bees and 0.8 ants, which may be fine. Speaking of the bottom of Figure 3, however, the confidence level is obviously lower if the label of the cut-and-mixed image is 0.2 of beach. Extensive experiments prove that the model will be suboptimal if we use a larger occurrence probability for the cut-and-mix operation.

Model Architecture
The proposed method employs the EfficientNet-B0 as the single model, which is the smallest of the EfficientNets with significantly fewer parameters (5.3 M). The architecture of EfficientNet-B0, as shown in [10], has built-in SE blocks. In this paper, the model's default settings, including architecture, dropout, and stochastic depth, are unchanged. Given the pre-trained EfficientNet-B0, only its last classifier is reset according to the subclass number of the RSI datasets.  If Acc is the best then 10 Save Acc in Results

17
Updating parameters through back propagating 18 End If Acc is the best then 21 Save Acc in Results 22 End For 23 Return Results The SC-CNN algorithm, as shown in Algorithm 1, is a typical transfer learning strategy written in Python. Given the same resolution of 256 2 for training and testing, the total number of training epochs in Step 1 is 60, while the one in Step 2 is 240. Note that the setting epochs are empiric values according to the evaluation result in Figure 1.
The method employs cross-entropy as the object function. The error-back-propagation algorithm is the Adam-W [32], with a weight decay of 1E-06. In Steps 1 and 2, the initial learning rate is both 1E-04 with cosine decay, and for cosine decay settings, the maximum number of iterations is 60 and 240, respectively. The training mini-batch is fixed at 30 for all datasets. This study employs two RSI datasets as benchmarks, including AID and NWPU, and the samples from each category are shown in Figures. 4 and 5. More details about these two datasets can be found in [31]. To get a fair comparison, the TRs are the same as in previous studies, including 20% and 50% for AID but 10% and 20% for NWPU. All the training and testing subsets are chosen at random.

Evaluation Criteria
This study employs the OA and confusion matrix [31] as criteria for performance evaluation. Let be the total number of accurately classified samples and be the total number of tested samples, the OA can be described as follows: = (11)

Hardware and Software Environments
The experiments were performed on four personal computers equipped with a single RTX 2060 GPU. PyTorch 1.11.0 is installed on Windows 10. All the experimental results were averaged over five runs.  The model's fitting curves on AID and NWPU are shown in Figures. 6 and 7, in which the former is training loss and the latter is testing accuracy.

Fitting Curves
The model's loss curves, as shown in Figure 6, both show a fast decline in Step 1 but go oscillatory in Step 2 (marked with green rectangles). In Step 1, the label function is Eq. (9), but in Step 2, the function is Eq. (10). Therefore, the loss function changes as the number of training epochs surpasses 60. At the first several epochs, we can see that the loss curves show a fast increase from small values but then decrease clearly in the subsequent epochs (marked with green rectangles). The OLS algorithm, as mentioned before, initializes its soft labels for each category with zero and then dynamically updates the labels as training goes deeper. Hence, with these rapidly declining losses, we can see that the soft label generated by OLS is adaptive to different datasets.
Looking at the model's accuracy curves in Figure 7, we can see that the model shows rapid convergence rates since the first training epochs (marked with black rectangles), but its accuracy presents www.aetic.theiaer.org a slight decrease at epoch 60 (marked with green rectangles); afterwards, the accuracy curves both climb higher in the following epochs. The cut-and-mixed samples, as explained in Figure 3, have two subclass labels, but the model's prediction may not give larger probabilities for these two subclasses due to the similarity in categories of RSI. Therefore, we can still find notable yields in accuracy, though the loss curves have obviously rebounded. These results prove that, with CutMix as regularization in Step 2, the model has been forced to learn more discriminative features when the cut-and-mixed patch disturbs the image's original label. In addition, we can also see that the pre-trained model on ImageNet-1k can achieve very fast fittings even during the first several epochs; meanwhile, the TRs with more samples can make the model gain a more rapid convergence rate. Hence, these results also reveal that a deep CNN model pre-trained on ImageNet-1K is easy to overfit on the RSI datasets, though many researchers pay more attention to the domain gap between natural images and RSI.
Nonetheless, these results also prove the hypothesis in Eqs. (6) to (8). First, as training goes deeper, the CNN will learn more local features, but the model's performance may be suboptimal due to the data distribution gap between training and testing sets. Second, overfitting is easy to emerge for these RSI datasets, though we used a smaller LR of 1E-04 and only trained for a short period of 60 epochs. To verify the method's effectiveness, the author compares 21 CNN-based methods in previous literature. The presented data includes the method's OAs, the base model's architectures, and parameter sizes. As a fair comparison, the TR is the same, if not specifically stated. Note that most of the previous methods modified the model's architecture or employed multiple models. Hence, the real parameter sizes of these methods should be inconceivably larger. The comparable results for AID are shown in Table 1, while those for NWPU45 are shown in Table 2. Note that "None" means that no relevant results are presented in the literature.

OA Results
As shown in Table 1, compared to all the previous state-of-the-art CNN methods, the author's method on AID easily outperforms with an outstanding lead on OA; meanwhile, it is undoubtedly more lightweight with the fewest parameters. Compared to all the popular strategies in detail, the author's lead www.aetic.theiaer.org over the best of feature fusion is 0.81% [4], though the compared method used a smaller testing ratio of 20%; the lead over the best of attention module add-in [18] is 0.2% to 0.6% with a clearly decrease of 91.3% for parameters; the lead over the best of multiple models [21] is 0.5% to 0.9%, though the compared method's parameter size is not presented but undoubtedly be more huge. Besides, the author's lead over the CNN ensemble [22] is 0.5% at a TR of 20% and -0.3% at a TR of 50%. Technically speaking, based on Eq. (11), we can find that the TR of 50% is smaller. Therefore, taking the wrongly labeled samples by humans into account, the author argues that the method's performance evaluation is more persuasive with a larger number of testing samples. Nonetheless, these results show that, compared to the author's lightweight one, all the previous methods do not achieve outstanding improvements on AID, even though more handcrafted features, more parameters, and multiple models are used. As shown in Table 2, the author's method on NWPU still outperforms, with an outstanding lead on OA but undoubtedly the fewest parameters. Compared to all strategies in detail, the author's lead over the best of feature fusion is 2.6% [3], though the compared method used a smaller testing ratio of 20%; the lead over the best of attention module add-in [12] is 1.3% to 1.6% with a clearly decrease of 36% for parameters; the lead over the best of multiple models [21] is 0.9% to 1.0%, although the compared method's parameter size is not mentioned but undoubtedly be more huge. In addition, the author's lead over the best CNN ensemble [22] is 0.5% to 1.2%, and the improvement on a TR of 10% with more testing examples is more obvious. Based on Eq. (11) and putting the similar comparison results together, we can see that the author's method is more advanced when the testing sets become larger.
Therefore, as a short conclusion, based on all the above OA results on the two benchmark RSI datasets, we can find that the author's method presents a consistent advance compared to the other previous ones; these results also prove that the hypothesis and explanation from mathematical theory, as presented in Section 3.1, are reasonable and persuasive. That is, it is unnecessary to improve the method's computational complexity for transfer learning tasks like RSI-SC.  The confusion matrix for AID at a 20% TR is shown in Figure 8, while that for NWPU at a 20% TR is shown in Figure 9. As mentioned before, the author's method has proven to be more advanced with more testing samples. Hence, the matrix of AID at a TR of 20% is shown here in special.
In short, as shown in Figure 8, the model achieved an OA of 97.05% on AID with a TR of 20%, but the confusion results are different among the 30 categories. In detail, the most confusing categories are marked with red rectangles, including center, industry area, park, resort, school, and square, with OA less than 95%; the secondary confusing ones marked with green rectangles have OA slightly less than 97%, including church and commercial area; the other categories' OA are all above 97%. Compared to previous studies [7,12,17,19,21,22], we can see that the confusion is consistent, though the author's OA is higher. Giving a quick look to the prior leading methods [21,22], we can find that these methods have the most confusing subclasses similar to this work, but in particular, compared to the OA results of confusing categories in this work, the OAs in [21] are poorer but those in [22] are higher. In other words, the author's method is more discriminative for all subclasses in AID except the most confusing ones, including park, resort, and square. As mentioned before, a classifier ensemble has the advantage of diversity, but its final performance still depends on whether its individual classifiers are accurate enough. Therefore, the CNN ensemble method in [22] is still suboptimal due to its secondary individual classifiers, though it performs better in the three categories.
In summary, as shown in Figure 9, the model shows an OA of 96.04% on NWPU with a TR of 20%, and still, the confusion results are different among 45 categories. In Figure 9, the most confusing categories marked with red rectangles include church, dense residential, industry area, island, palace, railway, and railway station, with OA less than 94%; the secondary confusing ones are marked with green rectangles with OA less than 96%, including commercial area, desert, freeway, lake, meadow, medium residential, mountain, rectangular farmland, river, runway, sparse residential, terrace, and wetland; the other categories' OA are all above 96%. Looking at the comparable results in [7,12,17,19,22], we can see that the confusion is still consistent, though the author's OA is higher. Giving the same attention to the priors-leading method in [22], we can find that its most confusing subclasses are still different from this work. That is, compared to the ensemble in [22], the author's method is more discriminative for all subclasses on NWPU except the most confusing ones. However, if compared to the method in [7] with a lower OA of 92.55% for the whole dataset, the confusion results in [22] still show clear OA gaps of approximately 3% to 7% in some categories, including forest, roundabout, tennis court, and so on. Therefore, putting the results on two datasets together, we can see that the poor individual classifiers in the CNN ensemble [22] have made the method suboptimal, even though it has diverse individual classifiers.
In conclusion, taking all the confusion results shown in Figures. 8 and 9 into account, we can see that the author's method surpasses all the other previous methods clearly, with the obvious advantage of cheaper hardware overheads, and as the testing samples increase, the author's method becomes more superior; meanwhile, the most confusing categories are human settlements. Hence, based on the results of the OA and confusion matrixes, we can see that the pre-trained CNN on ImageNet-1K can achieve outstanding performance, though the RSI has clear domain gaps with natural ones.

Class Activation Mapping
To get a better understanding of CutMix, the author employs activation maps by the GradCAM algorithm [33] to analyze how the CNN's attention is changed for the cut-and-mixed samples, and the maps are shown in Figure 10, in which the A denotes the original scene-mixed images, the B represents activation maps for the beach subclass, and the C represents activation maps for the resort subclass. Note that the brighter areas indicate more discriminative information.
As shown in Figure 10a, the activated area of the EfficientNet-B0 model trained without CutMix is larger and more scattered, which indicates that the model's prediction corresponds to more principal features. On the contrary, as shown in Figure 10b, the activated area of the same model trained with CutMix is smaller and more targeted, and more specifically, as shown in the C part of Figure 10b, the activated area for resort is mainly focused on a swimming pool, which is the most general ground object www.aetic.theiaer.org of the resort category. Therefore, based on these activation mapping results, we can see that the CutMix has guided the CNN to learn more discriminative features in RSI.  To get a further verification of the method's effectiveness, this paper employed a technique [34], named t-Distributed Stochastic Neighbor Embedding (t-SNE), to intuitively show the similarity of classified samples, and the visualization results, as shown in Figure 11, have the same category number for AID and NWPU presented in Figures. 4 and 5.

Stochastic Neighbor Embedding
The obviously overlapped category pairs of AID (marked with red rectangles), as shown in the left of Figure 11, include the first pair of playground and stadium, the second pair of center and square, and the third pair of parking and resort. Looking at the right of Figure 11, the clearly overlapped category of NWPU (also marked with red rectangles) contains three pairs, including the first one of desert and mountain, the second one of lake and wetland, and the last one of church and palace. Putting the results for AID and NWPU together, we can see that most of the categories are clearly separated from each other, and the overlapped results are related to the confusion information shown in Figures. 8 and 9. Checking all the previous methods with comparable results [15,22], we can see that the t-SNE visualization results in this paper are more dispersed both for AID and NWPU, indicating a better classifying result.

Ablation Study
To validate the importance of regularization in training, in the ablation experiments, the whole pipeline of the SC-CNN algorithm is used as the baseline with the OLS and CutMix inactive, and the OA results, as shown in Table 3, include a 20% TR for AID and a 10% TR for NWPU.
The baseline strategy, as shown in row 1 of Table 3, can help the model prevail over all the previous methods in Table 1, with a 0.1% to 6.9% OA increase on AID and NWPU. These results reveal that there has been consistently suboptimal performance in previous studies. The author argues that an www.aetic.theiaer.org inappropriate training strategy may be the first reason. Looking at the second and third rows of Table 3, we can see that the OLS and CutMix both boost the EfficientNet-B0 performance by a 0.6% OA increase separately. Most importantly, as shown in the last row of Table 3, the combination of OLS and CutMix can boost the model's accuracy by approximately 1.0%. Therefore, all these results prove that regularization is important for RSI-SC, though it has rarely been mentioned in previous studies.  To verify the effectiveness of the combination of DAs and regularizations, this work also conducted similar ablation experiments, and the results are shown in Table 4. In detail, as described in Section 3.3, the "DA1" denotes the DA used in Step 1, consisting of the color jitter, horizontal flip, vertical flip, and rotation, and the "DA2" denotes the DA used in Step 2, equal to DA1 but without the color jitter. In addition, the suffixes "-1" or "-2" mean the DAs or regularizations used in Steps 1 or 2, and the baseline is the same as defined in Table 3.
Given the results in rows 1, 2, and 3 of Table 4, we can see that the model's performance degrades both on AID and NWPU, revealing that the training sets transformed by stronger DA have a larger data distribution shift, giving out a suboptimal solution. In particular, the performance degradation is more evident on AID when a stronger DA is active in Step 2 with CutMix inactive, revealing that the impact of intensive DAs on a CNN's performance is greater when the training set is smaller. Comparing the last two rows in Table 4, however, we can see that the model's performance still degrades lightly when CutMix is active in Step 1, meaning that a combination of stronger DAs and regularizations will also result in a suboptimal solution, though more training samples may alleviate the effect. Anyhow, based on the consistent ablation results in Tables 3 and 4, it proves that the proposed combination of DAs and regularizations is effective. To verify the impact of a larger LR on the CNN's accuracy, this study performs a simple but convincing test. It consists of five different LRs, including 0.0001, 0.0005, 0.001, 0.005, and 0.01, with the same baseline training strategy described in Table 3. The testing results are shown in Figure 11, where the baseline corresponds to a LR of 0.0001.

Discussions
As shown in Figure 12, it is clear that the model's accuracy declines sharply as the LR grows. We can see that the accuracy drops fast as the LR exceeds 0.001 both for AID and NWPU. The result reveals that the CNN is easier to overfit on a small RSI training set with a larger LR. However, to the author's best knowledge, previous studies in Tables 1 and 2 have not noticed this problem, and some of them have made mistakes.
To verify the impact of adding modules to a pre-trained CNN without re-training on ImageNet-1K again, this study also conducted another simple but persuasive experiment. The test employs a pretrained EfficientNet-B0 model with all its built-in SE-block parameters re-initialized at random, and then uses the same algorithm presented in Algorithm 1 to train the model both on AID and NWPU. The experiment results, as shown in Table 5, can be directly compared to the related ones in Tables 1 and 2.
Giving a quick look at Table 5, we can see that the same pre-trained CNN will meet significant performance degradation, with OA decreases of 0.15% to 0.24% on AID and those of 0.43% to 0.44% on NWPU, if its pre-trained weights of the SE attention blocks are re-initialized by random; meanwhile, the degradation is more obvious on the larger dataset. These results prove the viewpoints given in Eqs. (5)(6)(7)(8), i.e., that pre-training matters a lot if the model's architecture is modified. Taking the CNN ensemble [22] into account, by adding multiple branches to a pre-trained CNN with self-designed blocks plus built-in attention modules, the authors proposed an ingenious idea to condense the method's complexity and lift the individual classifiers' diversity; however, as proven in Table 5, this architecture modification requires a pre-training on ImageNet-1K again, but it is omitted in fact. Therefore, compared to the author's one, we can see that the method's performance in [22] struggles on the larger NWPU, just like the result in Table 5 behaves.
In general, this work proposed a simple but leading method for classifying RSI by using a lightweight EfficientNet-B0 model. Given the fewer parameters than previous studies, the LS-EfficientNet can perform better in those hardware-restricted fields for classifying RSI, e.g., embedded systems, onboard devices, field tasks, and so on. Given the simple pipeline consisting of an accessible pre-trained CNN model and open source training algorithms, the LS-EfficientNet is also easier to reproduce for those routine tasks for classifying RSI. Based on the following points, however, the LS-EfficientNet still has disadvantages that need improvement. First, putting hardware and time costs aside, the LS-EfficientNet may not be the most cutting-edge method to date. Second, given the experience in this work, a CNN ensemble may have a much better performance than the LS-EfficientNet while still maintaining simplicity and efficiency. Third, the LS-EfficientNet using a pre-trained CNN on ImageNet-1K may achieve suboptimal performance on RSI sets due to the feature's domain gap and other neglected problems. Anyhow, the author will try to propose more efficient methods for RSI-SC in the future.

Conclusions
In this paper, the author proposes a CNN-based method that aims to provide a leading but efficient solution for RSI-SC by using a lightweight EfficientNet-B0. For this purpose, the paper first investigates several popular strategies in mathematical theory and gives out a qualitative conclusion on these methods' theoretical performance in detail. Based on these findings, the work proposes a novel method using a simple pipeline consisting of a single CNN and its concise training algorithm. Far different from previous studies, the proposed method mainly focuses on tackling the problems, including overfitting, data distribution shift by DA, improper use of training tricks, and other incorrect operations on a pretrained CNN, which were commonly neglected in previous studies. Compared to the complex and www.aetic.theiaer.org hardware-extensive ones in previous studies, the proposed method is easy to reproduce due to the fact that all the models, training tricks, and hyperparameter settings are open-sourced. Extensive experiments on two benchmark datasets, including AID and NWPU, show that the proposed method can easily surpass all the previous state-of-the-art ones, with an outstanding accuracy lead of 0.5% to 1.2% if compared to the best prior one in 2022. It should be emphasized that the proposed method has the fewest parameters, which is only 22% of the best competitor in 2022. In addition, ablation test results also prove that the proposed effective combination of training tricks, including OLS and CutMix, can clearly boost a CNN's performance for RSI-SC, with an increase in accuracy of 1.0%.
Taking all the findings in the paper together, the author argues that it is unwise to improve the method's complexity and hardware costs for transfer-leaning tasks like RSI-SC; meanwhile, the consistent suboptimal results in previous studies proven in this paper also make it hard to tell what findings are truly meaningful due to the methods' very close performance.