A Predictive Cyber Threat Model for Mobile Money Services

: Mobile Money Services (MMS), enabled by the wide adoption of mobile phones, offered an opportunity for financial inclusion for the unbanked in developing nations. Meanwhile, the risks of cybercrime are increasing, becoming more widespread, and worsening. This is being aggravated by the inadequate security practises of both service providers and the potential customers' underlying criminal intent to undermine the system for financial gain. Predicting potential mobile money cyber threats will afford the opportunity to implement countermeasures before cybercriminals explore this opportunity to impact mobile money assets or perpetrate financial cybercrime. However, traditional security techniques are too broad to address these emerging threats to Mobile Financial Services (MFS). Furthermore, the existing body of knowledge is not adequate for predicting threats associated with the mobile money ecosystem. Thus, there is a need for an effective analytical model based on intelligent software defence mechanisms to detect and prevent these cyber threats. In this study, a dataset was collected via interview with the mobile money practitioners, and a Synthetic Minority Oversampling Technique (SMOTE) was applied to handle the class imbalance problem. A predictive model to detect and prevent suspicious customers with cyber threat potential during the onboarding process for MMS in developing nations using a Machine Learning (ML) technique was developed and evaluated. To test the proposed model's effectiveness in detecting and classifying fraudulent MMS applicant intent, it was trained with various configurations, such as binary or multiclass, with or without the inclusion of SMOTE. Python programming language was employed for the simulation and evaluation of the proposed model. The results showed that ML algorithms are effective for modelling and automating the prediction of cyber threats on MMS. In addition, it proved that the logistic regression classifier with the SMOTE application provided the best classification performance among the various configurations of logistic regression experiments performed. This classification model will be suitable for secure MMS, which serves as a key deciding factor in the adoption and acceptance of mobile money as a cash substitute, especially among the unbanked population.


Introduction
Innovations such as mobile money, enabled by the proliferation of mobile phones, have facilitated an exceptional opportunity for the financial inclusion of a large number of unbanked populations in developing nations [1]. Due to a lack of financial incentives, traditional nationalised banks do not have branches in villages. Furthermore, with the ubiquitous presence of mobile devices, the number of connections to cyberspace has astronomically increased. The number of World Unbanked Adults (WUA) is 1.7 billion, with nearly half (46%) living in less developed countries, 80% of whom are Sub-Saharan Africans www.aetic.theiaer.org [2]. However, mobile phone penetration is rapidly increasing in these countries 1 . Hence, Mobile Financial Services (MFS) applications are among the most promising mobile applications in the developing world [3][4]. The advent of and increased access to mobile devices have created opportunities for various self-service innovations such as MFS solutions, mobile money, and mobile commerce in cyberspace. Such innovations have helped to provide financial instruments to many unbanked populations in the financial systems of third world countries and, as such, have been a major contributor to the financial inclusion of the unbanked in these emerging markets [5][6]. Consequently, Sub-Saharan Africans are responsible for 75% of global Mobile Money Service (MMS) transactions 2 . In Figure 1, the adoption of MMS by the unbanked is exemplified. Mobile money is an excellent alternative for bridging the Financial Inclusion (FI) gap in mobile commerce [7]. According to [7], MMS must be used for money transfers, making and receiving payments via mobile phone, unbanked accessibility, providing a network of physical transactional points (e.g., agents) outside of bank branches and ATMs, and while MMS exclude mobile banking or payment services (such as Apple Pay and Google Wallet). Despite the potential opportunities of MMS in terms of adoption, the fear of losing money to cybercriminals remains a major concern among customers. This innovation is supposed to be widely adopted, but the perceived trust and security awareness of the service have remained the principal adoption determinants for this new innovation [8][9]. Knowledge related to the security of the environment and the framework to uncover and detect mobile money cyber threats in developing nations is underrepresented in the literature [10]. Furthermore, fraudsters are getting more innovative and finding loopholes in new security controls very quickly. The risks of cybercrime are increasing, widespread, and exacerbating. This is being aggravated by the poor security practises of both service providers and the attendant criminal mind-set of many of the customers or potential customers whose goal is to compromise the system for financial gain.
Traditional threat modelling and standard security requirements for mobile payment solutions such as mobile money for the unbanked are no longer effective and comprehensive enough to curb cybercrime because they are based on standard checklists (e.g., PCI-DSS, ITSEC) and implement standard protocols (e.g., SSL, DNSSEC). Thus, security measures are limited to the implementer's responses to each checklist item and the standard security requirements, which have been flawed by the increasing trend of fraudrelated cases on MFS even after meeting the standard security requirements. As a result, the ability of tools, methods, or models to automate the prediction of these cyber threats would be useful in addressing MMS's www.aetic.theiaer.org cyber threat challenges, because anything that can be predicted is owned, and anything owned can be decided on and the desired action can eventually be taken as desired. Having models that can predict potential mobile money cyber threats will provide an opportunity to take action on such threats before cybercriminals exploit them to impact mobile money assets or perpetrate financial cybercrime. An ability to predict a cyber-threat event in mobile money cyberspace helps to take ownership of the decision in order to take the necessary actions. Although joint efforts from industry and government stakeholders have culminated in the publication of standards, frameworks, and guidelines, for example, by the National Institute of Standards and Technology, to mitigate the risks of cybercrime, the wave of increased MFS security is still on the rise. For example, in Nigeria, MFS fraud cases increased by 3,015% between 2015 and 2016; Nigeria lost N12.3 billion between 2014 and 2017. Also in Nigeria, MFS fraud in 2018 was the highest in the last four years [11].
In many of the successful cyberattacks against MFS, the role of humans cannot be overemphasized. They can be the originator, the medium, or the actual executor of the attack. Hence, MMS providers' methods or processes of onboarding, modifying, and terminating have important security implications. If a customer's intention could be predicted from the information supplied at any of the customer management process stages, such as customer onboarding or Subscriber Identity Module (SIM) registration, modification, or termination, it would help beef up MMS security. Meanwhile, some organizations have implemented a second stage of validation, such as manually going through each customer record (eyeballing) after registration to determine whether or not the customer has fraudulent intent before activating the MMS. For instance, most developing nations' MMS providers use the mandatory SIM registration as KYC (Know Your Customer). The second-stage validation process is tedious, ineffective, and inefficient. Therefore, for human-vectored cyber threat prevention to be effective, countermeasures must be robust and intelligent enough to predict and prevent it [12]. There is a need to focus on the on-boarding stage in MMS activation to build predictive models that would detect and prevent cyber threats vectored via the on-boarding process without the need for manual human checks to prevent fraudulent customer onboarding. Hence this study.
The remainder of the paper is divided into the following sections: Section 2 discusses related works, while Section 3 presents the ML technique used in identifying and forecasting cyber threats associated with MMS. Section 4 covered the findings, and Section 5 gave the conclusion.

Related Works
Secure MMS is a major determinant of mobile money adoption [9]. Because of the rapidly increasing use of the World Wide Web and the Internet nowadays, there is an increase in the volume and complexity of MMS insecurities. The upsurge of security flaws has significantly deteriorated the Quality of Services (QoS) of the MFS. Critical cyber security issues relating to mobile payment systems include identity theft, agent-driven fraud, sharing personal identification numbers (PINs), phishing, vishing, and authentication attacks [13][14]. Several fraud cases that threaten the security of MMS include false transactions and the misuse of PINs [15]. The use of better access controls, customer awareness campaigns, agent training on acceptable practises, strict measures against fraudsters, service providers' monitoring of high-value transactions, and the creation of an extensive legal document to operate MMS, among others, were some of the proposed mitigation measures in the literature. Studies found that MMS operators are aware of the need to improve mobile money security and such improvements will enable operators to protect themselves, their customers, and agents and assist in the successful provision of MMS. It was stressed that a mobile money-enabling environment should be properly regulated to avoid any potential risks. Thus, the user management or activation process of MMS for customer onboarding, for example, via SIM registration, requires thorough regulatory monitoring and considerable research attention to uncover the risks inherent in the process.
Studies on mobile money financial crimes have also been conducted 3 , with the goal of providing guidelines on regulatory policy and frameworks [16][17]. It was established that the regulations to fight www.aetic.theiaer.org crime should not impede MMS adoption, but instead adapt the traditional financial systems of combating crime to the mobile money industry in appropriate ways. The use of "frameworks, standards, and countermeasures" for MFS to provide mechanisms for mitigating cybercrime threats has a lot of challenges, as no workable solution has specifically been provided, especially in the context of developing nations [11].
There have been few studies that have focused on formal methodologies for threat modelling of network systems, such as mobile money solutions, as available for specific software systems [18]. For threat modelling purposes, a network system is viewed via a network model for threat analysis, which allows analysts to determine communications between computers with different roles [18][19][20]. While threat modelling of network-based solutions and its methodology are relatively scarce in the literature, it is more common to find works that provide threat models for specific software applications. A software application threat model can be modelled using a Data Flow Diagram (DFD) to describe the system. Examples of such targeted, specific application threat modelling include threat analysis of on-line banking systems by combining the STRIDE threat model and the threat tree analysis [21], the threat analysis of Web services and grids [22][23], and the threat modelling of identity federation protocols [24][25]. A goal-oriented approach to security threat modelling and analysis has been applied to model different systems, for instance, using visual model elements to explicitly capture threat-related concepts [26].
In mobile money solution services, research has been conducted on strengthening MFS technical security countermeasures [27][28][29] and improving MFS security [30]. Some of these techniques are structural equation modelling [31], biometric techniques [32], two-factor authentication [8], quantitative analysis of subject matter experts (SME) [33], and a host of others. Biometric techniques were proposed for providing the highest security to mobile payments in e-banking, particularly at the wireless transmission level. In the model, the image of a fingerprint is captured in real time and sent to the server for authorization. A fuzzy logic-based fingerprint matching algorithm was used on the server side for authorization [32]. The detection of fraud in mobile banking was also investigated with user input patterns when mobile banking services are being used, as well as the transaction pattern. The study's findings revealed that user input and transaction pattern data contain information that can be used to identify a specific user, allowing abnormal transactions to be detected [34].
A probabilistic-based model that was leveraged for the formulation of a mathematical derivation for an adaptive threshold algorithm for detecting anomalous transactions was reported in [35]. The model was optimized with Baum-Welsh and hybrid Posterior-Viterbi algorithms. A credit card transaction dataset was simulated, trained, and predicted for fraud. And finally, the proposed model was evaluated using different metrics. The results showed that the detection model performed well for credit card anomalous transaction detection; however, this has not been established for MMS. A framework was used to study the banking environment's information systems as they relate to information security initiatives; the case study used the Kenya banking sector [36]. The research objectives were to identify common banking information system vulnerabilities; analyse and define gaps in existing frameworks in order to evaluate banking programme initiatives and security; develop a framework for use in evaluating security programmes for the banking industry; and validate the developed security investment framework. The findings revealed that people pose the greatest threat to information systems, and customer security awareness was identified as a major barrier to security effectiveness. The increased risk exposure in banks was also traced to fraud, careless or unaware employees, and internal attacks, which were cited as the causes. The study concluded that people, process, and technology alignment are very important in transforming an organisation's information security.
Meanwhile, the use of all these traditional mechanisms for preventing fraud in MMS is not effective. They do not provide effective security as cybercrime issues persist in MFS, and when such security serves as the last resort, user experiences are often impacted [37]. The frameworks, standards, and countermeasures for MFS to provide mechanisms for mitigating cybercrime threats have a lot of challenges as no workable solution has been specifically provided, especially in the context of developing nations, which showed in the survey conducted in [30] that MFS was the least preferred method of payment compared to instruments like payment cards and cheques [12]. Meanwhile, there is a dearth of information in the literature about the AI-based detection and prevention models for cyber threats associated with MMS. The developing world security issues for MFS are peculiar, and the security issues are not well focused in the literature [38].
www.aetic.theiaer.org Since the use of Artificial Intelligence (AI) in cybersecurity has become ubiquitous, it may be trained to create threat warnings, recognise novel malware strains, and safeguard sensitive data for organisations in different domains. Therefore, defending mobile money solution services against cyber threats in realtime using AI-based cyber threat detection and prediction models is imperative.
Thus, an attempt is made in this study to develop a predictive cyber threat model for MMS by employing ML techniques.

Methodology
This work focused on predictive models for MMS cyber threats (detection and prevention) vectored via the customer life cycle management process in developing nations, using Nigeria as a case study. The study focused only on mobile phone subscriber biodata registration details for MMS, with respect to the following research questions: What security threats are involved in customer management across different mobile money solution ecosystems? And how can predictive models be built to detect and prevent these cyber threats based on intelligent software defence mechanisms? Therefore, a ML algorithm was employed for the formulation of the predictive model for cyber threat detection and prevention.

Architectural Modelling
Modelling the conceptual view of the mobile money customer life-cycle management process from information gathered from technical interviews revealed that the mobile money customer onboarding process was an integral part of the customer life cycle and security management. In this study, the customer management life cycle was defined as comprising three basic activities. Customer creation (C), Modification (M), and Blocking or Termination (B), or CMB. These were defined as follows with respect to the totality of MMS subscribed to by the customer at a given instance of time.
Customer Creation (C): encompasses the entire onboarding or customer setup process, from SIM registration to MMS activation on the mobile money system. This study was focused on customer creation or customer registration, otherwise called Subscriber Identity Module (SIM) registration processes, which are vectored cyber threats.
Modification (M): is any change to a customer's profile in the system, such as a Subscriber Identification Module (SIM) exchange for a customer, also known as a SIM swap.
Blocking or Termination (B): also known as customer service termination or removal. The customer termination function disconnects the customer's MMS from the system. For financial inclusion for unbanked customers, mobile money heavily relies on customer SIM registration details and is used for "Know Your Customer" (KYC).
To ease cyber threat challenges in the existing model, i.e., the eyeballing model shown in Figure 2, the proposed predictive model architecture shown in Figure 3 was developed to utilize ML techniques as well as intelligent software agents to help in notifying system administrators of detections of cyber threats. A ML algorithm was used to detect anomalies in customer biodata registration records for both new and existing customers during SIM registration and MMS activations. The algorithm (supervised ML) was used for model prediction to flag cyber threats or anomalous customer data records. From Figure 3, the customer creation process flow can be summarised as follows.
Step 1: Customer approaches a Subscriber Identity Module (SIM) registration agent to purchase a SIM and requests SIM registration from a GSM service provider's designated agents. This registration is activated on the GSM network and stored in the customer database or Customer Relationship Management (CRM) database. A customer may be an adversary or a genuine customer.
Step 2: The customer dials the USSD code for mobile money registration after SIM activation on the GSM network.
Step 3: The mobile money system then pulls the KYC information from the customer's database to fulfil the requirements for registering the customer for MMS. www.aetic.theiaer.org  Taking the components as threat vectors, the cyber threats induced by the adopted customer management approach for Mobile Money System can be expressed as an overall threat profile, denoted as Pt, and summarised mathematically as Equation 1 as follows: = ( ) (1) Where: Pt = the threat profile for the mobile money solution. Ct = the threat profile elicited by the customer creation and SIM registration processes. Mt = the threat profile induced by the customer modification process, e.g., SIM swaps. Bt = threat profile resulting from a customer service blockage or termination process. If the function in Equation 1 is subjected to a continuous probability density distribution, then the threat profile probability can assume a non-negative value of a to b, where a=0 to b=1. www.aetic.theiaer.org

Description of the Proposed Model
In the proposed model shown in Figure 4, the analytics module (i.e., the ML module) examined the incoming SIM registration data records for mobile money applicants, existing or new, and classified them based on cyber threat infections. In this operation, clean records were flagged as compliant, while unclean records were flagged as non-compliant, and both were eventually stored in a permanent repository. In reallife operations, this is usually the organization's Customer Relationship Management (CRM) database repository.
The second ML analytics module is another layer for deeper analysis, and this module scans and classifies both new and existing customers for mobile money eligibility. The system administrator reviews the classified records and/or updates the rules database as required.
The predictive model was formulated to check incoming online applicants' registration data in realtime for cyber threats. A supervised ML algorithm was employed to determine whether a mobile money applicant's incoming registration or activation data record detail is a legitimate transaction or not. The classification algorithm works by classifying applicants' registrations into compliant (non-fraudulent) and non-compliant (fraudulent) records. The non-compliant records are the suspicious records for cyber threats based on predictive ML algorithms. If an applicant's records are compliant, the MMS is activated; if not, it is flagged as an anomalous registration, and the customer is rejected for MMS activation. The processed applicants' records were subsequently added to the historical data records database for future algorithm learning, and the intelligent agents logged the anomalous registrations. The flow diagram is shown in Figure 5.

Framework for Implementation of the Predictive Analytical Model
An implementation design of the analytical model is presented in Figure 6. The model design framework would detect mobile money application fraud in real time using ML algorithm models. This would detect anomalous transactions from incoming mobile money applicants' biodata registration details. This framework uses the Apache Spark stack, that is, the spark streaming module for collecting online registration data and the spark ML module for building, training, and retraining of the predictive model. ML packages (scikit learn) in the Python programming language contain packages to train and re-train ML algorithms to build the predictive model. This ensures that the model is updated in real time, so that realtime registration data analysis can be performed and fraudulent customer registrations can be flagged and rejected Data from different sources is meant to be pre-processed into a format that ML algorithms can work on. The historical data for SIM registrations with the right predictive features selected and stored in a database is used by ML algorithms to process and generate the predictive models.

Dataset Collection, Pre-processing and Analysis
Five (5) million dataset records of applicants' registration for MMS were gathered from the mobile money practitioners' interviews with the Nigerian Telecoms on the issues with the registration of customers' data for the purpose of mobile money registration. Because providing customer transaction details is considered a breach of confidentiality, the practitioners masked and transformed the majority of the features in the dataset. The sample masked registrations for valid and invalid registrations were obtained, and the generated dataset was highly imbalanced as there were more valid registrations than invalid registrations.

Data Preprocessing
The following pre-processing steps were carried out: www.aetic.theiaer.org  Dataset creation and cleaning: The datasets were generated based on the features in the sample data using the python faker library, Faker(). Faker is a Python library that generates fake data to anonymize data taken from a production service for confidentiality reasons or to generate large quantities of data. Irrelevant features and those with null values were removed.
Features Selection: Regardless of the classification algorithm used, a feature selection procedure was performed on all factors suggested by mobile money technical and business experts as the most likely to affect the fraudulent behaviours of mobile money applicants. These resulted in 13 factors for each applicant.
Surname, first name, gender, mother's maiden name, region, customer reputation, agent reputation, and agent identity are all examples.

Bag-of-Words (BoW):
The dataset features comprise many strings or character data types, such as applicants' names, regions, or addresses, that are not usable by ML algorithms, which only work on numeric data types. Hence, the BoW approach was used to convert string features to numeric representation. www.aetic.theiaer.org Figure 6. A framework of the proposed model Classification Rules: Rules were developed for different predictive indicators in the dataset features for proper labelling of the dataset to label the applicant's historical records or the dataset as fraudulent or non-fraudulent. The term "non-fraudulent" refers to mobile money applicant records that are valid and compliant for MMS activation. The fraudulent records were grouped into two categories in relation to the cyberthreat risk potential of the customer, i.e., high and low.

Handling the Data Imbalance Problem
It was noted that the number of observations for the majority and minority classes in the acquired dataset was not equally distributed. Random oversampling could lead to an overfitting problem and, consequently, biased classification. In order to avoid a class imbalanced classification, the widely used Synthetic Minority Over-sampling TEchnique (SMOTE) [39], which is an oversampling approach that creates a synthetic minority class, i.e., synthesizes new minority class samples, was employed. This is accomplished by concentrating on the feature space and interpolating the positive instances that span together.

Historical Records Labelling, Fraud Score Calculation, and Risk Categorisation Design
For predictive classification, the dataset records were labelled as mostly "non-fraudulent intent" and "fraudulent intent" applicants. The fraudulent category was further broken down into two categories based on risk profile: high and low, as defined in Table 1. The table shows the fraud score and category based on the weight of fraud intent, which impacts the final determination of the status of the customer record as follows: Fraud Rating: A fraud rating was assigned per issue rule per feature in the applicant's data record for likely potential cyber threat issues.
Fraud Score: Each record's fraud score was calculated from the fraud rating for each feature per the defined rule fraud rating as in Table 2, and the correct label for each dataset record was determined based on the number of invalid or valid rating rule summaries are expressed mathematically in Equation 2. Hence, for a record with features i to n, the fraud score was calculated thus: www.aetic.theiaer.org That is, the sum of the fraud ratings per issue was compared with the risk range in Table 1 to arrive at the risk category. This was used to determine the critical level of the issues per feature per applicant registration record, hence the label and the risk category for the potential cyber threat per record.

Formulation of the Predictive Model
For modelling cyber threat prediction during MMS activation via customer onboarding or SIM registration, a Logistic Regression ML algorithm model was used. This approach entails predicting the continuous value of one field (the target) from a set of values of the other fields (attributes or features).
A Regression model usually produces a continuous prediction value, which is usually in the form of a probability and is described as follows: ML classifiers require a training corpus of M input/output pairs (x (i), y (i)). Logistic regression uses the logistic curve for fraud detection, and it is a probabilistic statistical supervised learning model. There is data on a dummy dependent variable yi (with values of 1 and 0) and a column vector of explanatory variables xi (including a 1 for the intercept term) for a sample of n cases (i = 1... n). The logistic regression model is shown in Equation 3, as follows: where β is a row vector of coefficients. In logit form, i.e. by taking, natural logarithms, the model may be written in "logit" form in Equation 4 as follows: The goal of maximum likelihood estimation is to find a set of values for β that maximizes this function in the model. Hence, this model used a dependent variable, yi, for each mobile money applicant, i, for each feature x of the applicant's SIM or mobile money registration biodata record details, x, representing the occurrence of fraudulent intent (1 = fraud; 0 = non-fraudulent or compliant registrations). The fraudulent category was divided into two based on risk quantification levels: low risk (class 1) and high risk (class 2).
The logistic curve ranges from a value between 0 and 1, so it can be interpreted as the probability of class membership to predict the occurrence of cyber threats or intent in a customer MMS registration or activation request, as shown in Equation 5.
( = 1) = ( 0 + 1 1 + ⋯ + + ) 1 + ( 0 + 1 1 + ⋯ + + ) Defining: = ( = 1) and 1 − = ( = 0) The Logistic Regression has been known to be sensitive to the class-imbalance problem of the dataset, which may impede the classification capabilities in terms of its predictive accuracy, precision, and sensitivity and thus may be biased. In this study, to investigate the effect of the class imbalance problem, the Logistic Regression algorithm for predicting mobile money threats was designed using various configurations based on two variants: www.aetic.theiaer.org Classification Configurations: This involves the Binary Classification Configuration, which classified the mobile money applicants' dataset as compliant (class 0) and non-compliant (class 1), and also the Multiclass Classification Configurations, which classified the mobile money applicants' dataset into three distinct categories of cyber threat risks: low risk (class 1), high risk (class 2), The biodata of the applicant was classified into several categories based on cyber threat non-existence as compliant, class 0, existence as low-risk, class 1, high-risk, and class 2.
Dataset Distribution: This involves imbalanced data (NO-SMOTE), which are the datasets acquired with unequal class distribution, and balanced data (SMOTE), which are the processed datasets using the Synthetic Minority Over-Sampling Technique (SMOTE).
This brought four different algorithms for Logistics Regression variants based on configurations, as shown in Table 3, finally resulting in a total of four (4) predictive models. The models were trained on the historical data set of SIM registrations for new applications and existing customers. This was done in many iterations to learn the model.

Results and Discussions
The data for the entire sample was subjected to a logistic regression analysis model, with the applicant's mobile money registration biodata record details serving as the predictor variables and the dependent variable being the applicant's mobile money registration status. The simulation of the Logistic Regressionbased predictive models for the detection and prevention of mobile money cyber threats during the customer registration process was carried out using the Python programming language with its ML library (i.e., Pandas, NumPy, SciPy, Scikit-Learn, Matplotlib). A Jupyter notebook was used for the coding, while Pandas, a data analysis library, was used for the pre-processing of the dataset. The dataset was split into 70% training and 30% testing according to accepted heuristics (other split values yielded similar results). The detailed results are presented as follows:

Data Pre-processing Output
During the data collection stage, the Python faker library was used to generate 1000 names, which were then combined with 8000 harvested Nigerian names. Pseudonymized data records gathered from the field with valid and invalid registrations from the practitioners were used as the basis for generating a larger number of valid and invalid records with data record-based simulations. Bag-of-Words (BoW) in Python was used to transform, store, and convert string values into names and other string features in the dataset. The Label encoder library in Python was used to encode the categorical variables in the dataset, e.g., male = 1, female = 0, while some were encoded with codes. The dataset was then simulated to five million records, of which 25 thousand were sampled for subsequent simulations.
The dataset for the analytical model building was drawn from the 5 million original records after each record was labelled according to the labelling rules to conserve processing resources. The distribution histograms of the imbalanced sampled dataset and the balanced dataset by Synthetic Minority Over-Sampling Technique (SMOTE) application are presented in Figures 7 and 8, respectively. After preprocessing, the ML algorithm classifiers were trained on the dataset with and without the SMOTE operation to observe the differences in performance and select the best algorithm. The SMOTE was performed on the dataset to observe the algorithm's performance under the two scenarios.
www.aetic.theiaer.org Other pre-processing activities include generating applicants' names with the faker () library and converting string features in the dataset to numeric values in a format that can be processed by a ML algorithm.

Simulation Results
The proposed predictive algorithm was simulated on the defined dataset to determine the effectiveness of the logistic regression classifier for cyber threat prediction during the mobile money applicant onboarding process. The total dataset was divided into eighty percent for training and twenty percent for testing. For validation of the dataset, a cross-validation technique was applied using the train_test_split of the scikit-learn library.
Simulations were used to systematically explore the behaviour of the logistic regression-based classifier with both the binary and multiclass classification capabilities for classifying the applicants' records into compliant (0) and bi-level fraudulent registration statuses-Low risk (1) and High risk (2). The simulation experiments were grouped as follows: Group I: Binary classification algorithm configurations with and without SMOTE for dataset rebalancing. In Experiment I, an investigation was carried out on the classifier's binary ability to predict multi-level cyber threat categories in relation to SMOTE's pre-application to the dataset. While in Experiment II, investigation was carried out on the binary ability of the classifier with respect to the nonapplication of SMOTE to the dataset.
Group II: Multiclass Classification Algorithms configurations with and without SMOTE for dataset rebalancing. In Experiment I an investigation was carried out on the multiclass ability of the classifier with respect to SMOTE application to the dataset. While in Experiment II, investigation was carried out on the multiclass ability of the classifier with respect to the non-application of SMOTE to the dataset. www.aetic.theiaer.org Based on the binary and multiclass classifications in Tables 5 and 6, respectively, the simulation results are presented as confusion matrices in Figures 9 and 10. It was shown that in the binary classification, the number of True Positives (TP) in the NO-SMOTE configuration is higher than that of the SMOTE configuration, whereas the SMOTE configuration has a higher number of TP in the multiclass classification. Also, the predicted probabilities are shown in Figures 11 and 12. These results showed that the Logistic Regression algorithm predicts better with two-class classifications (i.e., binary) and not with multiple classes of compliant (0), low-risk (1), and high-risk (2). This implies that the balancing nature of the dataset affected the predictive capability of the classifier with respect to its four (4) variants.   Accuracy: With imbalanced datasets (NO SMOTE), a prediction accuracy of 72% was observed with the default binary classification feature of the algorithm, while multiclass classification gave an accuracy of 69%. With a balanced dataset (SMOTE), binary classification's accuracy dropped to 42% while multiclass classification's accuracy increased to 72%. This implies that the binary classification feature of the logistic regression classifier has the ability to predict effectively with the imbalanced dataset, while its multiclass classification feature predicts well with the balanced dataset.
MCC: The coefficient of the prediction was evaluated. It was revealed that the best classifier configuration was multiclass with SMOTE among the four logistic regression experiments performed, which had the highest MCC of 58% when compared with other experiments' MCC's of 27%, 15%, and 16%.
Precision, Recall, and F1-Score: As presented in Table 5, the precision, or specificity, for the LR binary with No-SMOTE classification experiment was 0.76, and the recall, or sensitivity, was 0.48 when compared with the multiclass classification logistic regression model specificity of 0.86 and sensitivity of 0.72.
ROC and Predicted Probabilities: The Receiver Operating Characteristics (ROC) Area Under Curve (AUC) for multiclass logistic regression of 0.84 were also higher than for all other logistic regression experiments.
The evaluation results showed that SMOTE improves the performance of the logistic regression classifiers, although the multiclass with logistic regression gave the best performance.

Validation Results
The model validation was done to benchmark the cyber threat compliance and correctness of the applicant's record in relation to existing methodologies. The predictive cyber threat model was benchmarked with the manual eyeballing process used by human agents to verify the sanctity of onboarding mobile money applicants' registration data. Figure 15 depicts the schematic view of the human agent and ML algorithm predictive models. The result showed that the ML algorithm performed better and faster than the existing manual eyeballing method in use. The two methods are expressed in details as follows: Manual Eyeballing method: Eyeballing each record after customer registration before final acceptance for mobile money registration was the main method in use to validate the textual correctness and completeness of a customer record. According to expert interviews, sample manual validation of customer records takes 8-10 minutes on average. This was further expressed as Average Eyeballing Duration and Eyeballing Accuracy. In an extensive eyeballing operation, say, agents are given a total of 100 customer records per day to manually eyeball. If validating a record takes t seconds, then validating n records will take Q seconds per agent: Q = n*t. From the table of observations of five agents, as in Table 6, eyeballing 20 records each for a total of 100 observations, the average eyeballing duration per record is 9.98 minutes, which is far more than the duration of validating 25 thousand records, which is 0.4 minutes when using Random Forest algorithms. Also, in the simulation with the proposed predictive algorithm, the accuracy was 91%. However, compared with the manual eyeballing method, the human agent becomes fatigued and tired, which frustrates the accuracy of this method. From expert interviews and observations, the accuracy is always below 71%, and the number of records validated is typically 100 per day per agent. This is far less than the ML output per second.
Machine Learning Methods: This provides improved efficiency, a lower error rate, and better data quality. From the results obtained above, it is evident that the average processing time for a record is a fraction of a second.

Conclusion
This research aimed to develop a predictive cyber-threat model that detects and prevents any suspicious customer with cyber threat potential during the on-boarding process of mobile money customer lifecycle management in developing nations, using Nigeria as a case study. Logistic regression, a ML algorithm, was used to inhibit or prevent suspicious customers from joining the MMS by predicting the customers' intent and thus eliminating the financially criminal mindset customer. A dataset was collected, and the Python programming language was used for data analysis, data conversion, and transformation functions. SMOTE was applied to handle the imbalance of the dataset class. The effectiveness of the logistic regression-based predictive algorithm for cyber threat predictions from customer data details during the mobile money on-boarding process was determined by evaluating the predictive model based on the dataset balance status and classification configurations.
Thus, the proposed model for the mobile money initiative will provide a sustainable drive for financial inclusion and cashless policies, as well as accelerate mobile service adoption in developing nations. As shown and concluded in this study, the adoption of this predictive model would go a long way toward reducing financial fraud at MFS. Based on this conclusion, MMS providers should consider using this predictive cyber threat model to prevent suspicious customers from being onboarded onto mobile money platforms.  1  10  6  7  10  9  2  8  7  11  11  7  3  10  8  12  12  8  4  10  13  10  13  9  5  10  11  14  15  10  6  8  12  12  16  11  7  10  10  15  12  7  8  10  8  13  15  10  9  8  11  12  11  7  10  10  12  15  14  8  11  10  10  10  13  9  12  7  6  8  14  10  13  10  5  9  10  9  14  10  7  13  9  9  15  10  6  7  16  11  16  6  8  8  14  8  17  10  9  9  12  9  18  10  5  7  14  8  19  10  8  8  10  9  20 10 11 9 10 10 Meanwhile, this study only focused on mobile phone subscriber biodata registration details for MMS. Other components of the customer lifecycle management process such as modification (SIM SWAP), customer biometrics, profile modification, and customer termination processes as cyber threat vectors for www.aetic.theiaer.org MMS were not explored. Further research will be conducted to investigate the performance of other classifiers capable of predicting the likelihood of a cyber threat or fraudulent intent applicant during the MMS onboarding or service activation process, with the goal of determining the best ML model for the predictive model solution. Furthermore, investigations will be carried out using Ordinary Differential Equation (ODE) models or epidemiological models to predict cyber threat spread rates in MFS or other domains.