Detection of Lung Nodules on CT Images based on the Convolutional Neural Network with Attention Mechanism

Khai Dinh Lai, Thuy Thanh Nguyen and Thai Hoang Le, “Detection of lung nodules on CT images based on the Convolutional Neural Network with Attention Mechanism”, Annals of Emerging Technologies in Computing (AETiC), Print ISSN: 2516-0281, Online ISSN: 2516-029X, pp. 77-89, Vol. 5, No. 2, 1st April 2021, Published by International Association of Educators and Researchers (IAER), DOI: 10.33166/AETiC.2021.02.007, Available: http://aetic.theiaer.org/archive/v5/v5n2/p7.html. Research Article


Introduction
Lung cancer is the leading cause of death from cancer for both men and women. According to the annual statistical report on the number of patients with lung cancer of the American Institute for Cancer Research, there were 2 million new cases in 2018 1 . The development of disease is silent without symptoms or warning signs, making diagnosis very difficult. As patients reach the final www.aetic.theiaer.org stages of long-term lung disease, the life will come to an end. Regular cancer screenings for early detection and treatment are needed to reduce mortality.
There are many noninvasive imaging manners for detecting and diagnosing lung nodules such as Positron emission tomography (PET), computed tomography (CT), low-dose computed tomography (LDCT), and contrast-enhanced computed tomography (CE-CT). However, they are used in different contexts. PET scans are used to differentiate between benign and malignant nodules while CT and LDCT are used as the basis for early detection of nodules. Figure 1. 2D CT scanning slice for an early stage nodule for lung cancer [1] Computer-aided diagnosis (CAD) can help doctors interpret medical images, making for a more sensitive and accurate cancer diagnosis which is crucial for patients [2]. The aim is to develop a highly accurate CAD system, improve processing speed and reduce cancer screening costs for patients. This machine is capable of obtaining diagnostic images as input and rapidly recognizing patients with or without tumors, as well as being able to distinguish these tumors as benign or malignant. One of the first requirements of the system to satisfy the early stage cancer diagnosis criterion is to detect tiny nodules (< 10 mm in diameter). More complicated is a CT scan packed with ambient tissue, bone and air noises, which need first to be pre-processed in order to make the search for the CAD system successful. The typical CAD systems for lung cancer are made up of the following modules shown in  Four phases of a typical CAD system diagnose lung tumor LUNA16 2 challenged researchers to focus on pulmonary nodule detection and false-positive reduction to establish a full nodule detection system based on data given by LIDC/IDRI. Pulmonary nodule detection schemes are a candidate detection phase designed to provide simpler representation of lung anatomy, meaning that aspects such as the lung wall and broad airways are omitted, leaving only data with a greater potential to be a nodule candidate. The aim of a false-positive reduction step is to exclude false-positive candidates from the pool of items obtained by the candidate detection phase. The removal mechanism should retain a relatively high sensitivity. Therefore, the falsepositive reduction phase should be fitted with a strong discriminating capacity for the separation of nodule and non-nodule objects.
From the LUNA16, we have a dataset of potential tumor candidates. The problem is how to reduce this number of candidates so that only the most potential candidates are achieved (most likely to contain tumors). In order to solve this challenge, it is necessary to create a classifier with the candidate input and to infer that the result is a candidate containing or not a tumor? Once this classification has been done, we will shorten to achieve the number of candidates most likely to be chosen. The classification of the candidate is seen in the figure 3.
To classify nodule candidates into nodules and non-nodules is performed using the following two broad categories of methods: (i) conventional feature-based classifiers and (ii) convolutional neural networks (CNNs). Conventional feature-based classification is performed using feature www.aetic.theiaer.org extraction and nodule candidate classification techniques. In recent years, Deep Learning has progressed rapidly, especially in image processing problems such as handling facial expressions, analyzing medical images, etc [3]. Many works applied deep learning in medical imaging processing, a large area of this are CT images. So, we would like to present a summary of some related research that used CNN for pulmonary nodule detection in section 2. In recent times, based on the candidate coordinate dataset given by LUNA16, the author 3 captured them in 50x50-sized image frames and created a classification model to classify tumor candidates or not. This model has a precision of 89.3 %, recall of 71.2 % and specificity of 98.2 %. However, this model has not exploited the specific characteristics of medical images.
To address the drawbacks, in [4], we suggested the model was defined in Figure 4. This was called the Nodule Candidate Detection in general. The experimental findings were depended on data collected at a scale of 50x50 for each candidate by the author 3 . Based on the structure of the model in figure 4, three models were generated as in figures 4,5 and 6 in section 3. In more details, all of these models consisted of many sub-convnets combined using the attention mechanism which took into account the specific characteristics of data. We also considered selecting a softmax or triplet loss for the classification stage. On the same dataset, the models are compared with each other on performance, precision, recall and specificity, then best one will be selected. The final model chosen is compared with the one in the study 3 . The results show the feasibility of the proposed models. The order of our work is as follows: (1) we built a model with Attention convolutional layers with double triplet functions for both training and validation. So we call it ATT model; (2) possibly the ATT, except the triplet loss was replaced with softmax loss in the classification stage, meaning we've used the softmax loss both for training and validation. This is ASS model; (3) we trained a model with the name is AST that used ASS model as a pre-trained model and validated it with the triplet loss on the validating dataset; (4) we trained the model which was the best of all, on both training and validating subsets.
The following sections of the paper are organized as follows, section 2 describes literature review of paper. Section 3 describes in detail how the architectures of the proposed models are created. In section 4, we explain the step of preparing the data, then we present how to evaluate the models as well as the results obtained through the models. We also discuss and pick the best model for our problem of concern. www.aetic.theiaer.org

Lung nodule database
Image data is taken from the Association for Research on Infectious Diseases and Lung Database [(LIDC / IDRI)] 4 . It is obtained from the biggest public lung nodule reference database, comprising 888 CT scans . But actually, initially the dataset consisted of 1018 CT scans. Then, scans with a thickness of more than 3 mm or scans with inconsistent slice spacing are discarded. These scans have been supported as MetaImage (.mhd) files. Each LIDC-IDRI scan is annotated and labelled as: nodule ≥ 3 mm; nodule < 3 mm; nodule (any other pulmonary abnormality) by experienced thoracic radiologists in a two-step reading process: blinded reading phase and unblinded reading phase. we use the reformatted version, LUNA16. This dataset includes CT scans with notes describing the coordinates of the nodule region and ground truth labels [5].
LUng Nodule Analysis 2016 (LUNA16 challenge), which is completely open, concentrate on a large-scale evaluation of autonomous nodule detection algorithms on the LIDC/IDRI dataset. LUNA16 offers data and annotations to every participant to ensure that all participants are equitable while training models and evaluating the performance of algorithms in the same database. LUNA16 challenge encourages the science group to take part in one or two of the following challenge tracks: complete Nodule detection (NDET) and False positive reduction (FPRED) corresponding to phases 2 and 3 of the CAD system.
In the complete nodule detection track, participants are asked to construct a complete nodule detection system. It is known that the input of images is taken from the database given and the output of the system identifies the candidates as nodules.
In the false positive reduction track, the research teams are expected to identify some of the positions in the scans as nodules or non-nodules. In particular, this is a prerequisite to conduct a twoclass classification (nodule/non-nodule) problem, and this task encourages research groups to also have experience in classification problems without background in the analysis of medical images.

False positive reduction systems
The false positive reduction competition attracted a variety of study groups to participate, and the findings were reported on the LUNA16 website. However, LUNA16 decided to stop accepting new submissions in 2018. In this section, we briefly cover some of the impressive outcomes of the False-positive-reduction challenge track over the last few years.

CUMedVis
Dou et al. [6] were using multi-level contextual 3D ConvNets to build a framework called CUMedVis in 2016. A framework consisting of three separate 3D ConvNets architectures (Archi-1, Archi-2, Archi-3) was proposed to solve the challenges resulting from differences in nodule size, form, and geometry characteristics. Each network utilizes an input image of a different receptive field to incorporate different levels of contextual information surrounding the pulmonary nodules. In the last stage of this method, three subsystem architectures are fused with a weighted linear combination to achieve the final classification outcome for the candidate.

DIAG_CONVNET
Setio et al. [7] proposes a multiview convolutional network-based lung nodule detection system obtained by combining three candidate detectors specifically designed for solid, subsolid, and large nodules. This system is called as DIAG_CONVNET on the luna16 website. The authors extract nine 65 × 65 patches of 50 × 50 mm from various views corresponding to different planes of symmetry in cubes for each candidate. These views are processed using a stream of 2D ConvNets composed of 3 convolutional layer and max-pooling layers. The convolutional layers are of turn size: 24 kernels of 5 × 5; 32 kernels of 3 × 3; and 48 kernels of 3 × 3. The final detection step is performed using multiple streams of 2-D convolutional networks and a dedicated fusion method. Their method reaches high detection sensitivities of 85.4% and 90.1% at 1 and 4 false positives per scan, respectively. Authors in [7] need high cost to have a dataset which classified subsolid nodules, solid nodules, large subsolid nodules. Besides, the training for each detection is quite time-consuming. On the other hand, the cost of mix-fusion detection and operating the system is high.

3D DCNN
Ding et al. [8] proposed a lung nodule detection system involved the application of a regionbased CNN for nodule detection on image slices and employed a 3-D CNN to reduce false-positive per scan. First, a deconvolutionary framework is applied to the Faster Region-based Convolutional Neural Network (Faster R-CNN) for candidate detection on axial slices. A three-dimensional DCNN is then presented for the resulting false-positive reduction. It was evaluated using the Lung Nodule Analysis Challenge (LUNA16) dataset and achieved a high sensitivity (94.4%) with only four FPs/scans. Authors took advantage of VGG 16 and used the regional proposal network to attain high efficiency, but the expense of developing the area proposal network is not low.

Attention mechanism
In 2017, Jie Hu et al. [9] proposed an attention mechanism in order to improve the accuracy of image classification deep networks. The motivation is that inside a feature map with C channels, not all C are crucial to the final decision of a deep network. In which, there are some important ones that should be focused on. Therefore, it needs an attention (self-attention) mechanism so as to emphasize to those essential ones. Authors proposed a block (as ResNet block [10]), called Squeeze-and-Excitation block. This block can be built upon a transformation to map the input feature ∈ ′× ′× ′ to output feature ∈ × × . Then, the output feature is squeezed by applying a Global Average Pooling operation to form a vector = ( ) ∈ 1×1× . Subsequently, the vector is nonlinearly transformed by a Fully-connected layer, which maps to = ( ( , )) ∈ 1×1× . In which, ∈ × is the weight of the Fully-connected layer, and is the sigmoid function. The final sigmoid function constraints the value rage of the vector within [0,1]. Here, can be seen as the attention gate. More specifically, each element corresponds to a channel . If the value of is close to 1, the channel is more important. After obtaining the attention gate , the attentioned output feature can be computed as: ̂= × . By multiplying the output feature with the vector , less crucial channels will be suppressed, while other important ones will be remained.

Loss function: Triplet loss and Softmax loss 2.3.2.1. Triplet loss
The main purpose of triplet loss [11] is to distinguish identities by minimizing the distance between an anchor and a positive, both of which have the same identity, and maximizing the distance between the anchor and a negative of a different identity. Thus, desirable criteria is Where is an anchor image, is a positive image and is a negative image. is a margin that is made compulsory between a pair of positive and negative. T is the set of all possible triplets in the training set. The loss is computed with the formular (1), where N is number of candidates in training set.

Softmax loss
Softmax loss takes a vector of K real numbers elements as its input, which normalizes it into a probability distribution made up of K probabilities equal to the exponential origin values. That is, prior to applying softmax, some vector components could be random numbers; and might not sum to 1; but after applying softmax, each component will be in the interval (0,1) and sum to 1. Softmax loss with a cross-entropy loss that has the form: Where mean the j-th element of the vector of class scores .

Attention sub-Convnet
As described section 2, the "attention" approach was used to improve the precision of the image classification in deep-networks. The channels were chosen in the feature maps which were useful. We developed a small convolution network called Attention sub-Convnet. Its architecture is mentioned in table 2. Specifically, a sub-convnet is made up of components. The first component is the convolutional layer that accepts HxW-sized images with C' initial channels. Then use convolutional multiplication by a kernel of KxK size to get the output of feature maps of size HxW with C channels. Second, the pooling layer conducts average pooling for HxW-sized feature maps, and produces C 1x1-size feature maps for C channels. The scalars channels together form a vector z of length C. Third, The fully connected layer on vector z generates C neurons of output. In the next step, the sigmoid function is used to normalize the components of the vector z to the corresponding values of the domain value [0; 1]. Finally, the sub-convnet performs the normalized vector z multiplication with feature maps of size HxWxC where C is the number of channels.

ATT model
In this scenario, we used three Attention SubConvnets. Concurrently, for both training and validating, we take triplet loss in the classification stage. Attention sub-Convnet is based on the www.aetic.theiaer.org standard frame of the model referred to in section 2.1 with specific values: H and W are equal to 50 pixels in height and width of the image. The initial channel number C' is 32. We use a 5x5 kernel with a 2-stroke to perform a convolution multiplication on each image. Since using Attention sub-Convnets for the first time, features maps size 50x50x32 can pass through the max pooling layer with a 2x2 kernel to transform to feature maps size 25x25x32. The second Attention sub-Convnet is applies with C equal to 64 and the kernel having a size of 5x5 to receive feature maps with size 25x25x64. The feature maps continue to move through the third Attention sub-Convnet with the same number of channels, except this time the kernel is 3x3 in size. Then feature maps are transformed by the maxpooling layer to reduce the size to 13x13x64 before being flattened out to a vector of 10816 dimensions.
To eventually obtain the 512 neurons, we use double fully connected layers. (see in figure 6) (a) Training phase of ATT (b) Validating phase of ATT Figure 6. The architectures of ATT

ASS model
We use Attention sub-Convnets for convolution layers, as well as ATT. Howerver, softmax loss is used as an alternative to triplet losses during the classification phase in all three training, validating and testing processes. (see in figure 7). In this architecture, the output consists of only 2 neurals representing the probability distribution of the two classes: positive (nodule) and non-nodule layers (negative).

AST model
ASS model was used as pre-trained model. we used the triplet loss to adjust the parameters of the model. Finally, the triplet loss was used once again for testing. (see in figure 8)  There are 551065 annotations totaling. 1351 are labeled as a positive in all these notes, often known as a nodule, while the rest are labeled as negative, not a nodule. The author 3 cropped the images around the coordinates provided in the captions for the new images were created with a size of 50 x 50 pixels in gray scale to train, validate and test the CNN model. However, the data is out of balance because the number of images containing the tumor is too small for the entire data. The solution to this problem was that the author 3 made a rotation of 90 degrees and 180 degrees to create more images with the tumor (see in figure 7). The last dataset that we used in this study includes 8106 images with 50x50 pixel size divided into 3 directories, in particular: The training folder includes 5187 images, with 845 positive labels and 4342 negative labels. The validating folder includes 1297 images with 224 positive labels and 1073 negative labels. The testing folder contains 1622 images, with 282 positive labels and 1340 negative labels. Therefore, the ratio for the two classes is 20:80.

Environment for experiments
Models were implemented in the same environment, in particular the same training, evaluation and testing datasets, were calculated on the same computer system. Initializing a set of weights based on a Guassian distribution with mean of zero, std (standard deviation) as 0.02, bias as zero. We chose adam [12], an algorithm for gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. This method requires little memory, is simple to implement, has computational efficiency, is invariant to diagonal rescaling of the gradients. We used Adam with learning rate is 1e-4 and weight decay is 1e-6. For each case, we performed about 50,000 www.aetic.theiaer.org steps corresponding to 50 epochs with a batch size of 8 images. We implemented early stopping technique for our convolutional neural networks. This helps us not only to avoid spending time training after the performance has converged but also help avoid overfitting.

Evaluation
We used the Holdout method to evaluate models. To get objective results, we splitted the data into three independent directories with ratio between training set, validating set and testing set, respectively 64%, 16% and 20%. We performed as follows on each model sequentially, with four different steps. Firstly, to use data from training to identify parameters and create models. Second, to use cross-validation method to determine the accuracy of the models. If the validation accuracy is poor, then the back-propagation method should change it. After the final model has been obtained, an accuracy test will be conducted with the testing data.
Our task is a binary classification problem that mean we only get results in two cases: positive and negative. The result is positive if the image contains tumor and vice versa, the result is negative if the image has no tumor, we determined the results of the model testing process were split into four groups in confusion matrix. They were true positive (TP) , false positive (FP), true negative (TN), and false negative (FN), respectively. True positive: the image containing the tumor is classified as positive; false positive: the image without the tumor is classified positive; true negative: the image without the tumor is classified negative; false negative: the image containing the tumor is classified negative. The measurements we used for our comparative analyses were precision, recall, specificity, AUC, loss of validation [13].
The specificity is that the measurement indicates the proportion of negatives that are correctly defined. Properly defined as the percentage of segmented slices without the nodule cancer.

Experimental results
After implementing these models in the section 3, the results were obtained visually in the diagrams and matching tables (see in figure 10 and table 3,4). Observing the results in the tables, it was clear that for all the comparison measures ATT has yielded worse results than ASS and AST. We have some clarity why at the section 4.5. Experiments showed that AST, the proposed softmax network trained with a combination of multi-Attention sub-convnets with double fully connected and validated by triplet loss gave us the best result. Therefore, we decided to re-train the AST model, but this time on both the training and validating datasets. We calculated the area under the ROC-in the final model, scoring as high as 0.9923 (see in figure 11). The FP value dramatically decreased whereas the TP and TN ratio deals better. We matched our results with the results in the study 3 , as they are performed in the same datasets and comparison results are shown in Table 5.

Discussion
The aim of the training phase is to change the parameters to achieve the optimational function according to the training data and to be highly efficient when tested. For classification point, ATT used the triplet loss by minimizing the distance between it and the co-label samples and maximizing the distance between it and other samples. If the samples with the same label are identical in form, the evaluation of the model by a triplet loss is simple and highly efficient. Unfortunately, the peculiarities of the nodule dataset are very complex, the samples are not the same. This variation appears in images of both nodule and non-nodule. That is, while two samples containing the same www.aetic.theiaer.org label are positive, their form is quite different. In other words, the distance between co-label samples is not even short. Therefore, when conducting training by using a triplet loss may fall into the case of overfitting, the explanation is that the triplet attempts to change the parameters to match the training dataset the best. In fact, the nodule images are so complicated as mentioned above, so that the system can easily make mistakes during the test process. Additionally, the data is so complex that target value is hard to reach, which is why ATT has the highest loss in the models. Nevertheless, the use of the triplet was successful in the testing phase, as the two classes were now explicitly partitioned. Softmax is flexible in weight adjustment as long as the input is provided into classes without depending on the data characteristics. The ability to prevent overfitting by softmax loss is evaluated in the papers [14]. In specific applications, a suitable loss will be chosen in order to obtain the highest performance.

Conclusions
In this paper, we suggest a new approach to solve one of the two problems of lung detection, in particular it is the false-positive reduction challenge. we are interested in classifying whether a candidate is a nodule or not as a nodule. We have proposed the general model include many Attention Sub-Convnets. From the general model, we generated 3 models ATT, ASS, AST using Attention sub-convnet. The choice of softmax loss or triplet loss has demonstrated the effectiveness of each model through both experimental and theoretical assessments. Therefore, it is proposed to select the most effective AST model for the nodule detection problem on LUNA16. Our final result has a precision of 95%, recall of 86.4% and specificity of 98.9 %. After classification, the system dramatically decreased non-tumor candidates.