Identification of the Exclusivity of Individual’s Typing Style Using Soft Biometric Elements

Mohd Noorulfakhri Yaacob, Syed Zulkarnain Syed Idrus, Wan Azani Wan Mustafa, Mohd Aminudin Jamlos and Mohd Helmy Abd Wahab, “Identification of the Exclusivity of Individual’s Typing Style Using Soft Biometric Elements”, Annals of Emerging Technologies in Computing (AETiC), Print ISSN: 2516-0281, Online ISSN: 2516-029X, pp. 10-26, Vol. 4, No. 5, 1st April 2021, Published by International Association of Educators and Researchers (IAER), DOI: 10.33166/AETiC.2021.05.002, Available: http://aetic.theiaer.org/archive/v5/v5n5/p2.html. Review Article


Introduction
Keystroke dynamics (KD) is a method of identifying users based on how the operator uses a keyboard to type [1,2]. This method does not require high hardware investment but only involves changes in the system or application. The system which uses this method must record the time interval between the two letters that the user type. This KD can be categorized as behavioral biometrics. Generally, the recognition using KD or biometric behavior is less popular compared to other biometric methods such as the use of fingerprint, iris and DNA. Various recognition techniques have been used in KD studies to enable higher accuracy when using KD. The combination of KD recognition with the soft biometric features available to a person has been done in previous studies. The earlier study of this combination was done by Idrus, Cherrier [3] in 2012 by studying the use of hands when typing using one or both hands.
Soft biometrics is a technique that can be used for user recognition. Each individual has his or her own unique identity that can be distinguished from each other [4]. The soft biometric information of each individual is not sufficient to distinguish a person accurately, but it does enhance the ability to identify when combined with other biometrics [5].
In the 18th century, the soft biometric recognition method was first used by GALTON [6]. Their research used three main features of soft biometrics: anthropometric measurements (arm length), www.aetic.theiaer.org Scar effect and Mole, and Body shape. In 2001, soft biometric recognition techniques using the criteria of gender, race, eye colour and height were developed by Heckathorn, Broadhead [7].

Soft Biometrics Application for Keystroke Dynamics
The incorporation of soft biometric criteria in KD has been performed since 2011 to date. Various soft biometric criteria have been used in their study. The study on the integration of soft biometrics and KD was started in 2011 by Epp, Lippold [21]. Their research was on the combination of emotional elements as soft biometric features with KD. The results of the study using emotion elements (Confidence, Hesitance, Nervous, Relax, Sad, Fatigue, Anger, Joy and Happiness) had achieved an accuracy of between 77.4% and 87.8%. In the same year, Fairhurst and Da Costa-Abreu [22] conducted a study on the classification of gender by typing, male and female. The 10-fold cross-validation method was used to analyze the obtained data. The result acquired was 95% accuracy.
Later, a similar study to Fairhurst was carried out by Giot and Rosenberger [23] in 2012 which identified the gender based on typing. The results of this study ranged from 87.32% to 91.63%. Subsequently in 2013, the emotional stress aspect as a soft biometric feature was used in the joint study of KD [24]. The results of this study showed whether the user is in depression or not.
Similarly, a study conducted by Nahin, Alam [25] have shown that a user's emotions can be identified based on one's typing style. There are seven categories of emotions studied, namely anger, disgust, guilt, fear, joy, sad and shame. The results obtained was above 80% accuracy by classifying the users according to the emotions studied.
Bakhtiyari, Taghavi [26] has explored the use of emotional elements as a factor on how a user uses a keyboard, touch screen, and mouse. This research compared other normal methods regularly used to identify emotion such as Electroencephalography (EEG) machines, facial expression, voice and body language. The highest accuracy percentage obtained was 93.20%.
The combination of soft biometric and keystroke dynamics has attracted Idrus, Cherrier [27], [28] and Idrus [29] to conduct a study to identify the users based on gender, age, left or right hand and handedness. The best EER obtained from his study was 5.41% using the majority voting technique. Subsequently, in 2015, the study was continued by Idrus using the penalty combination and reward combination techniques to reduce the EER rate obtained in 2014 [30]. The EER results obtained for the reward combination are better than the penalty combination of 23.11%.
In 2016, a study on gender identification by typing was done by Antal and Nemes [31], but the aspect of their research was to identify gender typing using the touch screen. The results showed that the detection accuracy was 64.76% on the keystroke dataset and 57.16% on the touch screen. Also, in 2016, Idrus, Cherrier [32] conducted a study to classify typing in several soft biometric categories, namely gender, age range and handedness. The results showed an accuracy of 63% to 96%.
Latest KD studies related to soft biometric were conducted by Kołakowska [33]. They continue a study conducted by Nahin, Alam [25] which identify users' emotions during typing. Five emotions were studied in their research: Happiness, Boredom, Fear, Anger and Sadness. The results of their study concluded that the user's emotions influence a person typing at that time. However, a person's emotional control or personal strength will influence this type of typing. Katerina and Nicolaos [34] did the next KD-related study by studying KD with mouse and hand movements while using a computer. Table 1 illustrates the results and soft biometric elements used in previous studies.
Based on the previous studies, it is apparent that the method of using a keyboard can be distinguished using soft biometric elements. Various methods in utilizing soft biometric can be www.aetic.theiaer.org adapted in the industry by constantly checking the typing changes of the recorded actual system user, whether the user is authentic or not [35].
The research done in this paper incorporates four soft biometric elements in the use of KD. The soft biometric elements used in this study were cultures in Malaysia, gender, the region of birth in Malaysia and educational level.

Identification Approach
Research on authentication system using keystroke dynamics requires specific hardware and software to record information about each user's typing pattern. Computers that were used to record information on the typing style were equipped with special software. Other applications besides windows were terminated to prevent computer from being in normal condition. Each user was instructed to type several words using the software available in the provided computer. Each time or interval for each character typed by the user was recorded in a database in the application. Each interval between letters obtained during typing was stored in the database and later analysed. Each category of soft biometrics surveyed would involve two phases, namely training phase and testing phase. Support Vector Machine (SVM) had been selected as a technique to analyse and classify the raw data obtained. User authentication accuracy rate was measured and calculated for each category of soft biometric involved through SVM methods. The software used to execute SVM is MATLAB. Figure 1 shows an overview of the methodology used to perform this classification.

Individual Profiles Based on The Way of Typing
Classification of how to type was executed based on four categories of soft biometrics which are culture (Malays, Chinese and Indians), gender, educational level (CGPA -Cumulative Grade Point Average) and region of birth. This classification aims to isolate how users use the keyboard of each category. There are two approaches used in KD study which is the study based on a free text or a fixed text. The free text analysis is based on time-lapse between two consecutive letters or better known as digraphs whereas fixed text, the entire time category recorded in a word by the user is compared against the previous recorded time of the respective user. For example, the user is directed to type multiple times text / password during the enrolment process. The time interval of typing for each letter in the corresponding sentence was recorded. After that, the comparison is made by the system if the user types the same sentence for the second time and afterward. However, this study focuses on fixed text approach. www.aetic.theiaer.org

Data Analysis
The analysis of keystroke data done in this study is based on four soft biometric criteria namely culture, region of birth, gender and educational level. The cultures to be studied are the three main residents in Malaysia namely Malay, Chinese and Indian [36], whereas the ROB category is divided into 4 parts: north of the peninsular, east of the peninsular, center of the peninsular and south of peninsular Malaysia. For measurements based on educational level, the CGPA result was used as two parts, 3.0 and above and 3.0 below. In addition, this study also incorporated other minority in Malaysia (Bajau, Murut, Siam, Suluk, Iban, Kadazan, Bisaya, Kedayan, Iranum, Tidong etc) and classified them into one group labelled as "Others". Hence, for the soft biometric gender category, the total number of categories are four, namely the three main races in Malaysia and one other category. Support Vector Machine (SVM) was used during the classification process.
The kernel used in this study is Radial Base Function (RBF) [37]. RBF kernel was selected due to its suitability for analyzing non-linear data recorded in keystroke dynamics [38]. This kernel is also able to isolate data to high dimension data. The data analyzed was separated into two parts, namely training data and test data. For example, if 1% of the total data analyzed is used as training data, then 99% is used as test data. This process is named as a training process within the SVM. This process was repeated 100 times starting from 1% training ratio up to 90% of the training ratio and the average was recorded. The classification of data for each category analyzed was labeled 1 and -1. For example, in the culture category, comparison of Malays and Chinese on their way of typing, Malays data was labeled as -1 and Chinese data was labeled as 1. This process was imposed on each of the two classes analyzed. The items analyzed are shown in Table 2 below. This section clarifies the breakdown of statistical data collection obtained based on the four soft biometric criteria studied. Total number of volunteers involved was 250 people. Everyone was required to type 5 sentences given as many as 10 times correctly. Therefore, the total amount of keystroke data obtained in total is 12500. The statistical breakdown of the data is described in Figure  2 to Figure 5 below.

Experimental Results
This section describes the results of keystroke data analysis obtained.

Result Based on Culture
This study divides the culture into three main residents in Malaysia namely Malays, Chinese and Indians. Another minority group in Malaysia is in one category. Based on the entire data obtained (figure 2 -5), the statistical breakdown of the data is given: Malays -148; Chinese -53; Indians -38, Others -11. According to the statistics of collected data, keystroke data from Indians was the lowest among the three main races. Malays made up the biggest number of volunteers, which is 148 personnel participated to contribute keystroke data. For analysis purposes, the number of participants of each culture was made equal for each category. For example, only 53 randomly selected Malays and all 53 Chinese records were used to compare these two classes. With regard to the Indians category, only 38 randomly selected records for each Malays or Chinese category were chosen for comparison with Indians. This is to balance the amount of data to be analyzed using SVM because according to a study by Idrus, Cherrier [27] , the number of data between two different classes should be equivalent to enable the best analysis results by SVM.
As explained in the previous chapter, all volunteers were required to type 5 sentences given as many as 10 times correctly. The first 3 of the 10 typing attempts were not analyzed to provide the volunteers opportunities to familiarize themselves with the word and sequence of letters in the given word. All five letters provided to users are simplified in Table 3. The total number of records analyzed for the class of each word is as below (Table 4).   Figure 6 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between Chinese and Indian using 5 words. A total of 38 samples of data typing volunteers from Chinese and 38 samples from Indian volunteers had been analyzed. The results obtained were quite good because of the 50% learning ratio, the accuracy earned had reached 75% for "instagram facebook twitter" and "the sound of music", while for the other 3 words the accuracy ranged from 78.5% to 83%. Table 5 shows a summary of the accuracy obtained from 50% to 90% of the learning ratio.   Figure 7 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between the Malays and Chinese. The results obtained among the Malays and Chinese are better than the Chinese and Indians. Based on the 50% learning ratio, the accuracy obtained was between 83.5% and 88.5%. Table 6 shows a summary of the accuracy obtained for 50% of the learning ratio.   Figure 8 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between the Malays and Indians. The results obtained between the www.aetic.theiaer.org two culture are 82% up to 88% accuracy for the learning ratio of 50% and above. Table 7 shows a summary of the accuracy obtained for 50% of the learning ratio.   Figure 9 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between Others and Indians. The results obtained between the two culture were 72.9% up to 90.4% accuracy for the learning ratio of 50% and above. Table 8 shows a summary of the accuracy obtained for 50% of the learning ratio.   Figure 10 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between Others and Chinese. The results obtained between the two culture were 65.3% up to 81% accuracy for the learning ratio of 50% and above. Table 9 shows a summary of the accuracy obtained for 50% of the learning ratio.   Figure 11 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between Others and Malays. The results obtained between the two culture were 67.8%% up to 81% accuracy for the learning ratio of 50% and above. Table 10 shows a summary of the accuracy obtained for 50% of the learning ratio.

Summary of Analysis Based on Culture
The results of six classes keystroke data analysis based on culture are summarized in Table 11 below. Overall, it can be concluded that the Malays VS Chinese had the highest average accuracy rate of 86.02% for the 50% learning ratio. The rate of accuracy increased for every additional 10% learning ratio. This proves that typing for the same user group can be distinguished by category. The more learning data supplied to the system for identification, the higher the accuracy obtained.

Result Based on Education Level Using CGPA
This study uses CGPA as a benchmark to differentiate the educational achievement of a person. The CGPA was divided into two sections, 3.0 and above and 3.0 and below. Based on the statistics in Figure 4, data obtained during the data collection process recorded the number of volunteers who received a CGPA of less than 3.0 was 60. Therefore, the data of volunteers who obtained CGPA 3.0 and above were randomly selected for 60 people to enable the balance of data between the two classes during the analysis process. Figure 12 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between the CGPA 3.0 above and below 3.0 using 5 words. The results in Table 12 showed that recognition performance ranged from 78% to 87% at the 50% learning ratio. Based on the results obtained from the classification using educational level measured using CGPA, it was found that there were good significant differences. This may be due to the number of volunteers involved among university students. www.aetic.theiaer.org

Result Based on Gender
Gender is the final feature of soft biometric used for user classification using KD. Based on the results obtained from Figure 13, it is shown that the classification of gender based on typing methods was between 62% and 80.9% at 50% learning ratio. The results obtained show that the classification of users based on typing style for men and women is indistinguishable because it provides inconsistent results. This may be due to the similar preferences of these two different groups. For example, a man possessed a woman characteristic and vice versa.

Result Based on Region of Birth
Malaysia consists of 13 states and 3 federal territories. These states can be grouped into 4 sections: North of Peninsular Malaysia, East of Peninsular Malaysia, Central of Peninsular Malaysia, South of Peninsular Malaysia, Sabah and Sarawak [39,40]. This research focused on the states in Peninsular Malaysia only and each state was grouped according to the breakdown in Table 13. Based on the overall data obtained from Figure 3 above, the statistical breakdown of data by each region is, Northern (NN) -136; Eastern (EN)-43, Southern (SN) -33 and Central (CL) -33. 5 data are classified as 'Others (OS)' because they are not included in the list of regions of birth studied. The five data are volunteers from Saudi Arabia, Thailand and three Indonesian. The Others category will not be analyzed because the data set obtained is too small. From the statistics, keystroke data for the www.aetic.theiaer.org Central and Southern region of Peninsular Malaysia is the lowest among the other regions. Volunteers in the north region were the largest participants, i.e. 136. For analysis purposes, the number of participants in each region were made equal. For example, only 33 records randomly selected from the Northern and Eastern regions of Peninsular Malaysia were used to be compared against the Central region. This is to balance the amount of data to be analyzed using SVM. The same is done for other regions of birth by comparing the lowest number of records between the two classes. The total number of records analyzed for the class of each word is shown in Table 14:  Figure 14 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between the northern and southern regions using 5 words. The results obtained were poor because of the 50% learning ratio, the accuracy obtained only reached between 69% to 74% for 4 words, while only 1 sentence reaches 75% which is Langkawi island. This means the typing pattern for volunteers in the Northern and Southern regions cannot be clearly distinguished for these two categories. Table 15 shows a summary of the accuracy obtained for 50% of the learning ratio.   Figure 15 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between the central and eastern regions using 5 words. The results obtained were quite good because of the 50% learning ratio, since the accuracy obtained from three sentences exceeded 80%, only two sentences reached between 73% -78% which is langkawi island www.aetic.theiaer.org and tunku abdul rahman. Table 16 shows a summary of the accuracy obtained for 50% of the learning ratio.   Figure 16 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between the eastern and southern regions using 5 words. The results obtained between the two classes were 75.02% up to 87.44% accuracy for the learning ratio of 50% and above. Table 17 shows a summary of the accuracy obtained for 50% of the learning ratio.   Figure 17 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between the northern and central regions using 5 words. The results obtained between the two classes were 85% up to 93% accuracy for the learning ratio of 50% and above. Table 18 shows a summary of the accuracy obtained for 50% -90% of the learning ratio.   Figure 18 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between the central and southern regions using 5 words. The results obtained between the two classes were 70.65%% up to 83.21%% accuracy for the learning ratio of 50% and above. Table 19 shows a summary of the accuracy obtained for 50% -90% of the learning ratio.   Figure 19 shows the results obtained for an average of 100 times the iterations of recognition rate accuracy with the learning ratio between the northern and eastern regions using 5 words. Table 20 shows the results obtained between the two classes can be categorized as good because three of the five sentences tested had more than 80% accuracy, while the remaining two sentences had 77.07% and 79.71% accuracy.

Summary Analysis Based on Region of Birth
The results of six classes keystroke data analysis based on region of birth can be summarized as illustrated in Table 21. Overall, typing classification using region of birth gave a rather impressive result because at the 50% learning ratio, the lowest accuracy rate was 73.79%%, while the highest accuracy was 91.17%. The most distinguishable typing method was for the North versus Central and North versus East categories due to at 50% SVM learning, the accuracy obtained was over 80%.

Summary
It can be concluded from the results obtained that the application of the soft biometric elements in the Keystroke Dynamic study can be used for several categories. This classification proves that the combination of soft biometric and KD can be used as an additional security feature in system authentication. The system can compare user profiles detected by using the keyboard and profile registered in the system. The results are expected to help other researchers study the different aspects of soft biometric that can be used to classify users by typing. The results show that the best classifications of typing pattern that can be identified via soft biometric are culture and region of birth. The best classification for culture is obtained from Malay vs. Chinese category with an average reading accuracy of 88.52% at 90% learning ratio. The best classification for region of birth was obtained from the Northern vs. Central category with an average reading accuracy of 91.17% at 90% learning ratio. This shows that there are clear differences in typing patterns for a culture and region of birth. This may be because the writing and speaking languages used by each culture are quite different.
The results of this study can be used in daily environment, especially in the field of computer forensic. With the existence of a database to record the typing patterns of each group of people, KD can help the authorities in identifying groups of cyber-criminals who commit offenses. In addition, this KD can be used in the control of access to all systems that use a username and password where it can be used as a second security filter after the username and password. To further enhance the study in the field of KD and soft biometric, future researchers can use other soft biometric elements in the study of KD and use different identification techniques such as Fuzzy Logic and Neural Network.