A Novel Hybrid Signal Decomposition Technique for Transfer Learning Based Industrial Fault Diagnosis

: In the fourth industrial revolution, data-driven intelligent fault diagnosis for industrial purposes serves a crucial role. In contemporary times, although deep learning is a popular approach for fault diagnosis, it requires massive amounts of labelled samples for training, which is arduous to come by in the real world. Our contribution to introduce a novel comprehensive intelligent fault detection model using the Case Western Reserve University dataset is divided into two steps. Firstly, a new hybrid signal decomposition methodology is developed comprising Empirical Mode Decomposition and Variational Mode Decomposition to leverage signal information from both processes for effective feature extraction. Secondly, transfer learning with DenseNet121 is employed to alleviate the constraints of deep learning models. Finally, our proposed novel technique surpassed not only previous outcomes but also generated state-of-the-art outcomes represented via the F1 score.


Introduction
In manufacturing applications, the primary roles of fault detection and diagnosis are to produce a reliable indicator that can detect a process's faulty state. To optimize the efficiency of their activities, several significant industries have stressed the importance of fault prediction and detection [1]. Early detection can help avoid incredulous incidents from occurring and save potential industrial damage. The most common fault form appears to be a bearing fault, accounting for the majority of all machinery flaws [2].
From the Case Western Reserve University (CWRU) database, vibration data is retrieved from the Rolling Bearing Motor, which needs to go through signal decomposition. Also, detection methods heavily rely on signal processing for feature extraction to generate an accurate output. Several signal processing methods followed by machine learning and deep learning-based approaches have been proposed and implemented to date for detecting and diagnosing faults especially in rolling element bearings. Our developed system transforms one-dimensional vibration signals into two-dimensional images after decomposing the raw vibrational signals.
Empirical Mode Decomposition (EMD) is a signal processing technique that extracts signals from distorted and non-stationary data [3]. However, a more potent model known as Variational Mode Decomposition (VMD) has excellent noise immunity, better decomposing efficiency, and consistency for interconnected layers). It operates considerably better due to some unique characteristics, such as using Rectified Linear Units (ReLu) instead of the tanh function for rapid computation, enabling multiple GPUs to train a larger model in less time, and finally reducing error with concurrent pooling 1 . In relevant research from Nazir, to approach the robust turbine failure diagnosis challenge, a reference model-based strategy is adopted with GLR (generalized likelihood ratio) applied for residual evaluation [10].
Another paper used the SVM-CNN model for image classification to investigate indiscretion in rolling bearings after turning the vibration signal (1D) into time-frequency images (2D) and pre-training the ResNet18 network for feature extraction [11]. They achieved an overall accuracy of 98.75% after ten trials after converting the signal into images and pre-training the ResNet18 network for feature extraction. That being said, in addition to using transfer learning, some notable works on fault diagnosis have been done with auto-encoders and modified auto-encoders. They developed a hybrid feature pool composed of three different features (time-domain statistical, envelope power spectrum, and wavelet energy) to derive more precise data from raw vibration signals. This can go further than the ambulatory behaviour of the signals characterized by different crack sizes for a given fault form. They then used deep neural networks (DNNs) based on sparse stacked autoencoders to track down bearing flaws. The accuracy obtained from this proposed two-layered fault diagnosis model was 99.5% [12].
Jia et al. [13] presented a five-layer auto-encoder relying on a DNN that has been pre-trained layer by layer through an unsupervised manner. The DNN with a back-propagation algorithm for classification of the Fast Fourier transforms after fine-tuning After ten grades, the accuracy of transformed faults carrying data ranged from 99.68% to 99.95%. Moving on, Shao [14] introduces a robust deep belief network (DBN) with three layers of dual-tree complex wavelet packets, which is designed to upgrade the speed of convergence and prediction precision with multiple stacked adaptive restricted Boltzmann machines. The adaptive DBN is directly cultivated with the normalized function parameters, resulting in a 94.37% accuracy without picking features manually.
Cheng proposed a Wasserstein distance-based deep transfer learning method for sophisticated fault detection, which had the best transfer levels of accuracy with a 95.75 % average score with three transfer plots (two unsupervised and one supervised) and sixteen transfer fault diagnosis experiments with inadequate labeled data of both unsupervised and supervised [15]. Another paper demonstrated a fault identification algorithm hinge on global optimization of Generative adversarial networks (GAN) when the data are inconsistent and resolving the misdiagnosis [16]. They combined DNN's flexible feature derivation competence with GAN's data generation capabilities to overcome the issue. They used global optimization to create a new GAN generator and discriminator that produces more discriminant fault samples. The precision of this system for the 10:1 unbalanced accuracy ratio is 94.58%, 96.85%, and 93.28% for inner-race, roller, and outer-race faults.

Dataset Description
We have chosen the Case Western Reserve University (CWRU) dataset, making bearing test data for normal and faulty bearings easily attainable. This dataset works perfectly to feed on various machine learning or deep learning models [17]. Here, the dataset is gathered for normal bearings, single-point drive end (DE), and fan end (FE) defects 2 . There are 161 records in the dataset which are divided into four categories. Among those four classes (Normal Baseline Data, 12k Drive End Bearing Fault Data, 48k Drive End Bearing Fault Data, Fan-End Bearing Fault Data), we have elected 48k Drive End Bearing Fault Data to work with. The outputs in the experiment show that the variables of our dataset are named 'Fault Diameter', 'Motor Load (HP)', 'Approx. Motor Speed (rpm)', 'Inner Race', 'Ball', 'Outer Race' [17]. The vibration acceleration signals for normal bearings and bearings with three types of faults are included in this study's dataset. Three faults have diameters ranging from 0.007, 0.014, and 0.021 inches, were implemented 1 "Alexnet: The Architecture That Challenged Cnns", Medium, Last modified 2021, https://towardsdatascience.com/alexnet-thearchitecture-that-challenged-cnns-e406d5297951. 2  www.aetic.theiaer.org respectively at the inner raceway, ball, and outer raceway using SKF bearings. There are in total four signals in the dataset for normal bearings, one for each shaft load.

Proposed Hybrid Signal Decomposition Technique
All the signals of our dataset are in Matlab format. The signals were collected and concatenated to form one signal per environment specification using MATLAB R2021a. The signals are visualized in Figure 1:

Conventional Signal Decomposition techniques
There are three popular approaches to analysis in signal processing: time domain, frequency domain, and time-frequency domain [18]. For fault signal examination, a variety of signal processing schemes have been evolved and implemented, including the fast Fourier transform, wavelet transformation, EMD, empirical wavelet transforms, wavelet packet transform, VMD, and so on.

Empirical Mode Decomposition (EMD)
Empirical Mode Decomposition (EMD) is a method for signal decomposition where the system takes signal as input and provides the fundamental elements of the signal, which is based on Intrinsic Mode www.aetic.theiaer.org Function (IMF). The primary difference between Fast Fourier transform (FFT) and EMD is that FFT assumes the signal is periodic, but EMD is wholly based on data without any presumption 3 .
It is possible to have more than one frequency for the IMF and the signal that's been previously reduced by the IMF is identified as residuum 4 . The technique's general premise is to use partial signal transformations, with the relevant IMFs referring to the signal's most critical structures (low-frequency components) [19]. Sample extraction with residual denoting mostly noise and first three IMFs from our experiment with EMD is given in Figure 2:

Variational Mode Decomposition (VMD)
EMD is commonly practiced for signal processing but EMD is notorious for flaws such as noise sensitivity and sampling [20]. Where on the other hand, another signal decomposition technique known as Variational Mode Decomposition (VMD) prevents errors during the estimation of recursion and the end of recursion compared to EMD-based decomposition methods [21]. VMD is another process that uses nonrecursive methods to decompose the non-stationary signal into several band-limited intrinsic mode functions (BLIMFs) [21]. The functional decomposition findings for VMD on various artificial and real data are impressive. Moreover, the VMD-based model has shown noise robustness and precision for component separation at the same time in our own experiment as well. Figure 3 shows the first 3 IMFs and the residual signal extracted using VMD on a signal sample:

Proposed Hybrid Signal Technique
After retrieving data from MatLab, the signals are first passed on to EMD and VMD, respectively. EMD produces ten intrinsic modes using recursive functions, where VMD generates five signals. EMD converts signals from high frequency to low frequency where the benefit of VMD is to get rid of noise and sampling. Thus, the column which is denoted as '0' has the highest frequency and lowest noise. Gradually, the noise keeps increasing and ends up with the highest noise in the 9th column for EMD and the 4th column for VMD.
After decomposing them individually through EMD and VMD, a hybrid model is built by merging EMD and VMD as one column. Three hybrid signals represented as Sh are created following the below equation: IMFi is derived from EMD, and ui is derived from VMD equations represented in paper [19][20]. In the first hybrid signal, IMF 1 from both the outputs from EMD and VMD is taken, and the values are merged to create a new signal as shown in figure 4. This concatenation is done using Python 3.6.9 after the converting the mat files as Python dictionaries. We then separately selected the IMFs and used Numpy library to merge and create the hybrid signals as per equation 1. The figures (5-6) below illustrates the process of selecting corresponding individual signals for form one hybrid signal: We chose the hybrid signal decomposition method instead of using EMD and VMD, respectively, mainly because of the benefits of both while omitting most of the drawbacks at the same time. Since it can decompose the signal IMFs defined by the signal itself, EMD is an adaptive process [23]. Even though EMD is an excellent choice for signal decomposition, it tends to be prone to noise sensitivity. Due to mode juggling and end effect, EMD also struggles during the process of decomposition. However, for some signals, the improvement methods (e.g., ensemble EMD, complementary EEMD, partial EEMD) primarily solve the end effect and mode mixing challenge, but not all signals [24].
On the other hand, VMD reduces the noise handicap of EMD and aids in the solution of the mode mixing issue in decomposition outcomes. Even so, the pre-determination of the mode number represents a significant challenge for the VMD model as the efficiency of VMD is wholly reliant on it [24]. Over and under decomposition can occur if the mode number is incorrect while EMD does not have such constraints.
EMD decomposes the signal from high to low frequency, while VMD decomposes in the opposite direction. VMD decomposed signals have more high-frequency components than EMD signals.  Similarly, EMD modes have a higher proportion of low-frequency components than VMD modes. However, the amplitude of the signal in the EMD system is reduced significantly [25]. Since it is more sensitive to the high-frequency signal, the amplitude of the VMD processed signal is almost identical to that of the input signal. As a result, the efficacy of the forms on various frequencies varies. The VMD method's output (measured by signal-to-noise ratio) is notably greater for high-frequency processing signals, while EMD is better for low-frequency processing [25]. Weighing the advantages and disadvantages of both EMD & VMD separately, the proposed hybrid model is developed combining both and can leverage the best of both and tackle the corresponding challenges effectively. The result of the hybrid signal decomposition method abets the claim mentioned above successfully while outperforming both of the model's discrete performances.

Proposed Fault Diagnosis Method
This section first presents the workflow of our study, which is divided into the following steps. Figure  7 illustrates the workflow:

Data Pre-Processing
After applying our proposed signal processing technique, we converted the one-dimensional data into 2D (32*32) arrays [26]. This is done using the Numpy array library which allows us to reshape arrays into our desired shapes keeping the signals intact. Later, to convert this 2D array into 3D images, one extra dimension is added, generating grayscale images of shape = (32,32,1). Figure 8 illustrated the conversion steps: This conversion allows us to utilize advanced neural networks that exhibit efficient performance in classification tasks. The folder containing the images is now labelled as follows: '0' for images of Normal, '1' for images of Inner, '2' for images of Ball, '3' for images of Outer. At this stage, 20% of data from each class is separated and kept in a folder without any label. This segregation aims to ensure unseen data is passed later on to test the performance of deep learning models. The training set consisted of 3777 images, of which 388 were normal samples, 1186 with inner raceway defects, 1041 ball, and 1162 outer race fault samples. For the training samples available, the same amount of manual labelling is required by one-hot encoding. One hot encoding is a categorical variable binary depiction in which each row has one character with a value of 1, and the others have a value of 0. After that, this labelled data will be passed for training to learn and the other 20% data for validation. Figure 9 shows sample images from all four classes, each representing a subsignal of their corresponding full-length signal:

Data Normalization
The data is then normalized because it contains attributes of different scales which can affect the training process. After rescaling the data using a standard scaler (255.0), we used the ImageNet statistics to normalize our data. This is known as feature scaling, and the dataset is standardized using this formula: z=(x-μ)/σ The total amount of samples is split by 80:20 for training & testing. Then the remaining 80% of the training data is divided again into 80:20 for training and validation.

Deep Learning Classifiers
ResNet [27] proposed a skip connection-based solution to network output degradation, also known as the vanishing gradient problem. This issue commonly occurs in large networks and causes the network to lose feature information as it enters the deeper layers. Skip connections allow the network to preserve data and pass the feature maps into deeper layers. Inspired by this same simple methodology, DenseNet [28] uses dense connections to pass feature maps with an aim to solve the vanishing gradient problem. However, it uses fewer parameters than ResNet, as seen in Table 2. Instead of summation of feature maps, DenseNet concatenates the features maps by creating L(L+1)/2 connection for an L-layer network. These connections allow DenseNet to improve performance and preserve information even in intense layers. Table 1 shows the parameters and blocks set up to build a DenseNet with 121 layers: VGG16, ResNet18, ResNet34, ResNet50, and MobileNetV2 were selected in this paper to compare the effects of hybrid signals on the existing models. As we intend to reduce computational cost as well, we included architectures of varying depth to find the most optimized model. Without considering the dataset, increasing layers and neurons cause the models to converge rather than harness their full potential poorly. The specification of the models discussed in this paper is given in table 2:

Transfer Learning
By transferring information gained in one or more source tasks and enhancing learning in a related target task, transfer learning aims to enhance conventional machine learning. Information transfer approaches are a significant advancement in making machine learning as functional as human learning. Transfer methods are also called extensions of the machine learning algorithms used to learn the tasks since they rely on them. Inductive learning includes expanding well-known classification and inference algorithms such as neural networks, Bayesian networks, and Markov Logic Networks [31]. By leveraging information from the source task, transfer learning aims to enhance learning in the target task. Transfer learning is essential for deep learning techniques to succeed in a wide range of small-data situations. Although deep learning is widely used in science, many real-world scenarios do not have millions of labeled data points to train a model. To tune the millions of parameters in a neural network, deep learning techniques necessitate much data. This necessitates a large amount of (expensive) labeled data, particularly in supervised learning. Transfer learning is one method for minimizing the dimensionality of data sets needed for neural networks to be viable [31]. Many high-performing models have been developed for image classification and illustrated on the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 4 . Given the source of the image used in the competition, this challenge is commonly referred to as ImageNet. It has culminated in many advances in the design and training of 4 "A Gentle Introduction to the ImageNet Challenge (ILSVRC)", Machine Learning Mastery, https://machinelearningmastery.com/introduction-to-the-imagenet-large-scale-visual-recognition-challenge-ilsvrc/ convolutional neural networks. Different deep learning libraries, like PyTorch 6 , can download the model weights and use them in the same model architecture. The top three image recognition models can be downloaded and used to perform image recognition and other computer vision tasks. VGG (e.g., VGG16 or VGG19), GoogLeNet (e.g., InceptionV3), and Residual Network are some of them (e.g., ResNet50).
We used ImageNet weights 5 to initialize all models as our primary purpose was to evaluate the performance of the proposed decomposition technique. The fully connected layer of the pre-trained models was removed, and a custom head is created by adding two fully connected layers followed by adaptive pooling layers [32] instead of extensive hyperparameter tuning. Figure 10 shows the transfer learning process of our work where the head is replaces as described earlier: We also added two dropout layers that randomly drop information units to prevent the network from memorizing a specific pattern to reduce overfitting. AdamW [33] optimizer with a weight decay of 0.1 was used. Finally, the output layer was accompanied by a softmax activation function to classify the inputs. All layers were set to trainable, and each model was trained for five epochs with a batch size of 64. This enables our transfer learning model to be more adaptive and tuned to our provided data.

Results and Discussion
In the experimental evaluation, the test dataset was composed of 943 images (20%) that we fed to all seven models, and the results are discussed in this section. We have used the F1 score, the weighted average of Precision and Recall. As our dataset has fewer samples for the 'Normal' class, the F1 score of each class is taken into account to represent each model's results accurately. It is shown mathematically in equation 2, where P and R denote Precision and Recall.
The precision and recall scores are calculated for each class, and the micro average is taken. The formula for precision and recall to calculate our multiclass classification problem is given below: TP, FP, FN are true positives, false positives, and false negatives for each class, respectively, and c denotes the number of classes.

Proposed hybrid technique's results
At first, we present the performance of EMD and VMD processed signals. In the paper [20], the authors claimed the VMD method to be more robust than EMD in noise and sampling. We have investigated the performance of both ways in terms of fault classification and found the first extracted IMF using VMD stalls to represent the signals to be classified for all models accurately and drops more information than the first IMF of EMD. VMD (3 IMFs) exhibit a further drop in performance, whereas EMD (3 IMFs) is a more precise representation of the signal. The F1 score performance for each model is represented in figure (11-13):   To counter the disparity brought in by both methods, we now present a hybrid signal that removes the inconsistencies of the two methods. We trained all six models with hybrid 1, hybrid two, and hybrid three signals. ResNet18~50, Vgg16, DenseNet121 achieve outstanding results with 100% accuracy whereas they exhibit 98~99% 98%, 99% F1 score for EMD 1 and 86~85, 89 and 85% for VMD 1 respectively. As we include another intrinsic mode in all three techniques, we observed an increase in EMD 2 and VMD 2's performance resulting in 100% accuracy for ResNet18 and ResNet34 models. However, ResNet50 fails to improve its discriminative abilities like its counterparts and hits a threshold of 99%. After including three modes, we found applying VMD affects the performance of the models negatively, resulting in as much as a 14% performance drop. VGG16 achieves the highest F1 score with 90%, whereas all the models trained with hybrid three signals can accurately discriminate the faults with a 100% F1 score. It is seen in tables 3 and 4 that the hybrid technique outperforms the standalone techniques.  We propose this workflow using our hybrid signal technique to bring consistency and bring out the most use of deep learning architectures. Without correctly processing data, there is no additional advantage of increasing layers. As seen in table 5, VMD 3 results, ResNet50 can only detect 85% of samples, whereas ResNet18 detects 86%. In the case of EMD (3), it seems to perform as well as the hybrid version in terms of the F1 score. Figure 14 displays the error rates during training for DenseNet121 on EMD3 and Hybrid3 signals:  Figure 14 clearly shows that the error rates are much lower with our proposed hybrid signal technique which means the chances of signals getting classified accurately increases; for real time applications these creates a valuable impact. During the experiments, although we found the ResNet variants validated with 100% accuracy, Densenet121 exhibits 99.98 and 99.87 validation accuracy for Hybrid and EMD versions, respectively. MobilenetV2 shows 99.76 for hybrid and 99.59 for EMD. The particular reason for these techniques performing poorer than hybrid versions is that both pre-assume either the bandwidth or number of ideal mode functions, thus limiting the information that can be learned. As fault signals will vary from machine to machine, these limitations and the modes extracted may not be useful features for the deep learning model's learning process. Thus, the hybrid technique will bridge the gap and improve classification performance in a varied field.

Discussion and analysis
As seen in table II-IV, we see the convolutional networks achieve a 100% F1 score for all hybrid signals, except the MobileNetv2. In the paper [8], the authors propose a transfer learning-based resnet50 for fault classification with 98.95% mean accuracy with a depth of 51 convolutional layers. Even with pre-trained models, ResNet can have high computational complexity as the layer numbers increase. Our proposed approach demonstrates higher accuracy even with fewer layers and parameters. We suggest using the DenseNet model that reduces the complexity by reusing feature maps and thus preserves learned information intact throughout feedforwarding. We used DenseNet-BC architecture using a compression factor Ө (where Ө varies from 0 to 1) that modifies the hidden transition layer outputs according to previous dense blocks making the whole model compact. Modified DFCNN [34] shows 99.96 mean accuracy on five sub dataset variants of the CWRU dataset and outperforms previous methods such as Deep belief networks. However, the model's performance significantly drops if the workload is changed. Even if the generalized model is well under predefined conditions, in real-life scenarios, where loads and conditions can vary. This issue is also recognized in [35], where the authors introduce a cross-domain-based approach to evaluate their proposed method under three different load conditions (1hp, 2hp, and 3hp). They achieve 95.47 (± 1.32) and 98.29 (± 1.23) F1 scores in two different scenarios where samples are trained on 1~2hp loads and tested on 3hp load and so on. This attempt is taken as Deep Learning models often fail to generalize in varied conditions. However, our approach utilizes the full power of DL models and has high generalization abilities even under varied conditions, as described in section 3. Table 6 shows previous works done on CWRU bearing dataset and our results: Our DenseNet model having 121 layers has 7381 dense connections and thus can retain a more extensive set of feature maps making the model viable for a wider variety of information. As shown in Table  I, this model is the second smallest in size with only 7M parameters compared to the high computational cost of VGG16 or ResNet models. Hence, it is also computationally cost-efficient and feasible to be deployed in the industries.

Conclusion
This study demonstrates our developed novel hybrid signal decomposition approach for commercial fault diagnosis and the proposed transfer learning approach based on DenseNet121. The accompanying main two remarks highlight the primary contributions of our research. First, the time-domain fault signals are converted into grayscale image format, which is the input data type of DenseNet121, followed by a hybrid signal decomposition procedure that employs both EMD and VMD for signal processing. Second, by coupling transfer learning with it, a new transfer learning framework (DenseNet121) with 121 layers of depth is constructed. The proposed scheme has deeper network layers and more effective feature extraction layers. As a result, the suggested model would increase the final fault diagnosis prediction performance when tested on the CWRU dataset. When compared to other deep learning models, it has produced significant results with a 100% F1 score accuracy. Furthermore, the framework of transfer learning is examined on ResNet18, ResNet34, ResNet50, VGG16, and MobileNetV2, with the findings revealing that transfer learning (DenseNet121) is the best among them, indicating that it has outstanding fault diagnostic capability. The future work includes optimizing the transfer learning process; without hampering the performance or increasing training time. As per the limitation, we employed the CWRU dataset as the validating dataset for the consistency of the model. Because of noise or variations in motor speed, the accuracy can suffer if another dataset from disparate aspects is used. Our study was conducted on a wellbalanced dataset. In comparison, a structured dataset is more straightforward to use than an inconsistent one. An unbalanced dataset characterizes real-world implementations. We intend to test our proposed model on comprehensive real-world data to assess and enhance performance in the future. The hybrid signal technique should also be tested on other signals derived from various rotatory machines and sources. We intend to take our research work further to detect the conditions of the faulty signals on different industrial equipment.