Hand Gesture-based Sign Alphabet Recognition and Sentence Interpretation using a Convolutional Neural Network

Md. Abdur Rahim, Jungpil Shin and Keun Soo Yun, "Hand Gesture-based Sign Alphabet Recognition and Sentence Interpretation using a Convolutional Neural Network”, Annals of Emerging Technologies in Computing (AETiC), Print ISSN: 2516-0281, Online ISSN: 2516-029X, pp. 20-27, Vol. 4, No. 4, 1st October 2020, Published by International Association of Educators and Researchers (IAER), DOI: 10.33166/AETiC.2020.04.003, Available: http://aetic.theiaer.org/archive/v4/v4n4/p3.html. Hand Gesture-based Sign Alphabet Recognition and Sentence Interpretation using a Convolutional Neural Network


Introduction
Sign language (SL) involves movements of different parts of the body, for example the face and hands, which deaf and hearing-impaired people use to interact with hearing people. Hand gestures have particular importance in the recognition of SL, which has its own structure and grammar, and changes with the fluency of signing, involving the use of different types of movements such as static and dynamic. In this work, we use only static images for all American Sign Language (ASL) gestures, and present an ASL alphabet recognition system that recognizes the different hand gestures, assesses various approaches to the recognition of the signs and eventually interprets them into a meaningful sentence. We develop a sign dataset based solely on the shape and orientation of the hand, and movements of the face and head are not considered. Many researchers have contributed to the study of human-computer interaction with the aim of improving communication between deaf people and the general public. McKee et al. [1] developed a health literacy system for www.aetic.theiaer.org ASL users to establish adequate health literacy connections between deaf ASL users and English speakers [1]. In [2], the authors described the importance of SL recognition with respect to hand and face signs. However, the recognition of sign gestures with varying levels of illumination and complex backgrounds remains a major concern. An ASL character recognition system was proposed in [3], although the authors considered only five ASL characters. Hand gesture-based recognition systems for isolated sign words were proposed in [4], in which feature fusion techniques were used to detect the sign words. In [5], the authors discussed the issue of SL recognition based on hand size and hand movement information via wearable devices. Glove-based gesture recognition requires the user to wear a data glove in order to generate gesture-related information, and this can be uncomfortable and unhygienic. In [6], a hidden Markov and depth sensor device-based classification system was proposed to learn the sign gestures. However, there is no clear explanation for fingertip detection in different aspects of light illumination. The fingerspelling alphabet of the ASL recognition system was presented in [7]. Here an SVM technique was used to classify the sign. However, it is unable to detect when the fingers overlap. In [8], the authors proposed preprocessing the HSV color space and applied hand gesture segmentation based on skin pixels. However, the proposed system is unable to reduce noise from the input image. Depth sensor-based ASL recognition was proposed [9]. This system was developed using 26 characters and 10 numbers. However, leap motion has a large number of features, therefore, it is difficult to identify the effective features.
In the present paper, a non-wearable device is used to collect input from 26 ASL alphabet gestures and to pre-process it using the proposed method. Feature fusion with a CNN is used to extract the features. Our system takes various features of the gesture image as input and executes a convolution process. The features are then analyzed for gesture classification.
The rest of this paper is structured as follows: Section 2 briefly discusses the details of the image dataset, the pre-processing of input images, feature extraction, and the classification processes of the proposed system. Section 3 explains the experimental results and their implications. Section 4 summarizes this work and gives an outline of future work.

Proposed System
In this study, the proposed system is used to detect hand gestures and to interpret a meaningful word or sentence from a continuous performance of ASL gestures. The system consists of four main steps: data collection, segmentation, feature extraction, and finally classification. Fig. 1 illustrates the proposed ASL alphabet recognition and sentence interpretation system.

Description of the Image Dataset
In order to create an image dataset, data were taken from a live camera, and hand gesture images were obtained from the region of interest (ROI) and used as input. In this approach, no accessories such as gloves or wearable devices are needed. The dataset contained 26 ASL gestures representing the alphabet, and the images had a resolution of 50 x 50 pixels. A total of 23,400 images www.aetic.theiaer.org were collected for the 26 ASL gestures, with 900 images captured for each gesture. In this work, each image contained hands at different locations, and these images were obtained under different lighting conditions and against differing background environments. Fig. 2 shows examples of the images in the ASL dataset.

Image Pre-processing
In this system, an image of a static hand gesture was used as input, which was captured via a webcam. We captured these images from the region of interest (ROI) locations. The specific area of the frame expressed as the ROI is considered to identify hand gestures by which unnecessary areas of the video frame can be ignored. A Gaussian blurring technique was used to make the image smoother and to reduce the noise in the input. We then applied the Otsu method to pre-process the input image; this provides an intuitive way of automatically selecting a threshold based on the statistics of the image's histogram, which acts as a form of image segmentation. However, the quality of thresholding is dependent on the selection of the threshold intensity, as the optimal threshold intensity is determined based on a histogram of the image. The proposed method reduces the intra-class intensity [10] and determines the marginal threshold using Equation (1).
where the class probabilities are 1 and 2 , the threshold is t, and the weighted sum of the variance is 2 ( ). The probabilities of the class were calculated from the histogram. The smallest results of 2 ( ) is applied to the threshold. To calculate this value, we created a normalized histogram that defines P(i). This P(i) gives the percentage of intensity of the pixels i. In addition, we calculated weights, mean, and variances for each class. Table 1 represents the descriptions of calculation process of weights, means, and variances. Table 1. Calculation process of weights, means, and variances.

Function
Equations Description Left class intensity values 0 to t, right class intensity values t+1 to L-1.
Probability and weight are used to measure mean. Weights and means are used to measure variances

Feature Extraction and Classification
The use of deep networks is remarkable in technological advances that can be used to classify images and to identify and distinguish different aspects of images [11]. It consists of one or more convoluted layers that are primarily used for image processing, classification, and other co-related data processing. In this perspective, feature extraction is used to obtain significant information by analyzing the input image via the proposed image processing technique. The extracted feature vectors are then fed to the classification process. However, these features must contain contextual information from the input, as this is used to categorize them from other gestures to specific www.aetic.theiaer.org gestures. We use a convolutional neural network (CNN) to extract the properties of the hand gestures, as shown in Fig. 3. The feature is extracted step by its convolutional layers. Although feature data can be fed directly to deep convolutional networks for training and testing, we used filtered and segmented images as the input in this architecture. We used a kernel function to create a feature map and padding to prevent the features from shrinking, and max-pooling was also applied to reduce the size of the features and retain important information. In the proposed architecture, fusion of features occurs at the fully connected layer, producing the output of a Softmax classification.

Experimental Results and Analysis
In this section, we evaluate our ASL recognition system based on our dataset 1 . Two types of tests were performed to evaluate the recognition and interpretation of sentences using the ASL alphabet. The training and test datasets contained hand gestures performed by the same individual, and different individuals performed real-time ASL alphabet recognition and sentence interpretation based on trained model.

Hand Gesture Segmentation
We segmented the hand gestures from the input image. To avoid the presence of unnecessary background and noise, the input image was filtered and the hand gesture images were segmented. These two types of images were used as input to the proposed model. Fig. 4 shows examples of segmented images from the image dataset.

ASL Recognition and Sentence Interpretation
In order to recognize the ASL alphabet and interpret meaningful sentences, we evaluated the different hand gestures. The CNN was trained on the entire dataset, and the architecture was evaluated based on two different types of input data: (i) input data containing grayscale filtered www.aetic.theiaer.org images; and (ii) input data containing images with grayscale segmentation. We used a built-in dataset containing 26 ASL gestures (letters of the alphabet) with a total of 23,400 images. We used 70% of the images for training and 30% for testing, meaning that a total of 46,800 images (i.e. both filtered and segmented images) were used, with 32,760 used for training and 14,040 for testing. The average recognition accuracy of the ASL alphabet is illustrated in Fig. 5. The confusion matrix for the accuracy of recognition of ASL gestures is presented in Fig. 6. Some misrecognition occurred using this system due to the presence of similar shapes, as illustrated in Fig. 7. The highest accuracy was seen for the letter 'Y', and the lowest for 'M'. Our system achieved an average of 96.59% recognition accuracy. Table 2 shows a comparison of the accuracy of our method with state-of-theart alternatives. We have achieved better accuracy in our method than the accuracy mentioned in [6], [12] and [13]. In 3D images of the hand, the thumbs are often interspersed with the other finger, and the number of training information can affect the accuracy of the classification [6,12]. The reported accuracy of [6], [12], and [13] are 86.1%, 96.15%, and 94.2%, respectively. In this paper, we proposed segmentation techniques and introduced the fusion of features that enhance the validity of recognition.    We also tested this system with different individuals in real time, in terms of recognizing the ASL alphabet and interpreting these continuous sign gestures to give meaningful sentences. Six respondents were asked to perform sign gestures at the ROI location. The user continued to perform gestures for creating sentences like 'MY UNCLE DIVORCE', 'SHE HAPPY BABY'. The system evaluated each gesture after pre-processing, feature extraction, and classification. We classified the performed gestures based on the trained model. Fig. 8 shows the average recognition accuracy for each user, and Fig. 9 presents an example of a sign gesture being performed continuously in real time.

Conclusion
In this paper, we introduce a stable hand gesture system for ASL recognition that is able to detect hand positions, translate gestures into text and interpret meaningful sentences from continuously performed gestures. A dataset was created using a low-cost webcam and preprocessed images. A Gaussian blurring technique was used to filter the images in the dataset and the Otsu method was applied to determine the standard threshold value for segmentation of the hand. The filtered and segmented images were then fed into the feature extraction process. A twochannel CNN architecture was proposed to extract the features from the input images, and fusion of the features was implemented in the fully connected layer. A Softmax classifier was applied to classify the gestures. The experimental results show that the recognition accuracy for the ASL alphabet was 96.59%, i.e. better than other state-of-the-art systems. This system can also create meaningful sentences from the continuous performance of gestures. To improve our scheme in future work, we intend to collect more data on dynamic hand gestures and to enrich it with more signs/words.