The Theory of Probabilistic Hierarchical Learning for Classification

: Providing the ability of classification to computers has remained at the core of the faculty of artificial intelligence. Its application has now made inroads towards nearly every walk of life, spreading over healthcare, education, defence, economics, linguistics, sociology, literature, transportation, agriculture, and industry etc. To our understanding most of the problems faced by us can be formulated as classification problems. Therefore, any novel contribution in this area has a great potential of applications in the real world. This paper proposes a novel way of learning from classification datasets i


Introduction
Main faculty of human consciousness is its ability to classify.It is our ability of classification through which we become aware of things in existence by distinguishing them from each other.Providing this ability to machines now constitutes major part of artificial intelligence.Its application has now touched every aspect of human life, including but not limited to renewable energy e.g.[1], chemometrics e.g.[2], cyber security e.g.[3], natural language processing e.g.[4], finance e.g.[5], microbiology e.g.[6], ecology e.g.[7], and healthcare e.g.[8].Therefore, proposal of any novel way through which machines could learn to classify has an enormous potential of application in wide variety of areas.The theory of probabilistic hierarchical learning for classification was introduced in a conference paper [9].It was moulded into a theory after a series of earlier papers [10][11][12][13].This theory introduces the hierarchical model of learning.In the hierarchical learning, multiple models are learnt hierarchically over their subdomains, each containing elements from several flat classes.The model in each hierarchy is learnt on a subset of the training set.The configuration of that subset is also decided during training in that corresponding hierarchy.Therefore, a model and its application subdomain are learnt altogether.Since this subdomain may contain elements from several classes therefore, this subset is not a part of class hierarchy, this is a part of a hierarchy of www.aetic.theiaer.orgsubdomains corresponding to a hierarchy of learnt models.The interesting thing about these models is that they are errorfree over their respective subdomains.
The hierarchical learning should not be confused with a hierarchical classification or any of its contexts such as a natural hierarchical classification (NHC) or a methodological hierarchical classification (MHC).The hierarchical learning does not have hierarchical classes like NHC, such as classification of all biological organisms on earth e.g.[14] and hierarchical classification of diseases by world health organization 1 .The classes in the hierarchical learning are flat and cannot be represented through a directed acyclic graph e.g.[15] as can be done in the NHC [16].The hierarchical learning is not even an artificial hierarchy of classes such as one done in MHC where classes are flat, rather, classification itself is performed hierarchically.One of its kind are Hierarchical Support Vector Machines [17], where hierarchies are decided manually.The MHC is also done by automated generation of meta-classes such as in a handwriting character recognition system [18].
The hierarchical learning should not be confused with ensemble learning [19] which does also deal with multiple models but differs from hierarchical learning in several diverse ways based on the method of learning, the domains of models, the error handling, and the application methodology.The ensemble learning models are not hierarchically learnt, their domain covers the whole training set, their errors are averaged, and they can be applied simultaneously in parallel.Whereas in the hierarchical learning, each model has a domain which is a unique subset of the training set, it is errorfree, and models can only be applied sequentially.
The hierarchical learning is a supervised classification learning method, but it should not be confused with other methods of supervised classification learning such as Neural Networks e.g.[20][21][22], Decision trees e.g., [23] and Naïve Bayes e.g.[24], as none of them follows the scheme of hierarchical learning.The multiple hierarchies in the hierarchical learning should not be confused with multiple layers of learning such as in Deep Learning e.g.[20][21].This is because the number of layers in these networks are set prior to learning, however, number of hierarchies are not decided prior to learning, they are part of the learning process instead.Therefore, hierarchical structure in hierarchical learning is a learnt model rather than a predefined structure such as in Recurrent Neural Network e.g.[22].Since the hierarchical learning uses the probabilistic model for the class discrimination as most of the linear e.g.[25] and nonlinear e.g.[26] discriminants do, it is termed as the theory of probabilistic hierarchical learning.The interesting thing about its probabilistic model is its flexibility that allows for the negative probabilistic values, a concept that was first introduced in an entirely different field i.e., Quantum mechanics [27].
Earlier, the theory was emphasized through mathematical means.In this paper, the theory is redefined and revised by introducing four mathematical principles including the principle of successive bifurcation, the principle of two-tier discrimination, the principle of class membership and the principle of selective data normalization.The first principle provides structure for the hierarchical application of learning, the second principle supports this hierarchical structure with mathematical logistics, the third principle tailors the rule of probabilistic class membership to suit the hierarchical structure, and the fourth principle applies data normalization in a selective way which again helps in hierarchical learning.These four strong mathematical principles provide more fidelity to theory that its application is now extended from 5 to 10 datasets.
The rest of the paper divides into eight sections.Section 2 introduces the principle of successive bifurcation, section 3 covers principle of two-tier discrimination, section 4 discusses principle of class membership, and section 5 details the principle of data normalization.The theory's ability to develop errorfree models is tested in section 6. Section 7 presents experimental results on generalization ability of theory.The generalizing ability of theory is compared with literature in section 8. Lastly, conclusions and future work are orchestrated in section 9.

Principle of Successive Bifurcation
The concept behind the hierarchical learning is creating a set of disjoint subsets whose union is the full training set and a developing a simple model corresponding to each of these subsets.The model is learnt using the unclassified samples of the training set and then domain is assigned to the model consisting of www.aetic.theiaer.orgthe samples which are classified by it correctly.The rest of the samples are retained in the unclassified set of the training set to train another model in the next hierarchy.This means that at each hierarchy, the learnt model bifurcates the training set into classified and unclassified subsets.Therefore, the hierarchical learning follows the principle of successive bifurcation of the training set.This principle is visualised in Figure 1.The principle of successive bifurcation described above generalises the hierarchical learning procedure.At any given point during the heirarchical learning the four statements ( − ) of equation 1, must be satisfied. Where

Postulate 1
The principle of successive bifurcation can be used as a tool to replace a high complexity model with several simpler models each with a constrained domain representing a unique subset of the training set such that union of all constrained domains equals the training set.

Implementation
The Figure 2 presents flowchart showing implementation of principle of successive bifurcation.The algorithm learns from the training set which results in its bifurcation into the classified and the unclassified subsets.If the unclassified subset is not empty and has samples belonging to more than one class, then the unclassified set is used as the training set in the next hierarchy.In the figure, X refers to subset containing members from only one class.www.aetic.theiaer.org

Principle of two-tier Discrimination
The phrase two-tier discrimination refers to the two-step discrimination function of each model.Each model must perform two-step discrimination.The two-step discrimination means the discrimination in terms of classification of samples and then the discrimination in terms of partitioning the training set into two subsets that is, the subset within its domain and subset outside its domain.The expression 2 encapsulates this whole idea.
where, (, ℎ, ) = Probability that the sample  ℎ is the member of class   w. r. t. model   .ψ = Size of the set partition The expression 2 says if the Probability (, ℎ, ) is greatest among all the classes by the minimum value equal to size of set partition, then sample  ℎ is the member of class   and lies within classified subset   , otherwise it belongs to unclassified subset    .Whereas, the size of set partition is the greatest margin by which the   could misclassify the sample during a training session.The size of set partition can be calculated through equation 3.
From equation 3, it can be seen that ψ represents the size of set partition which is maximum difference among all the training samples between the largest and second largest probabilities of membership, whereas largest probability belongs to a wrongly assigned class.
The model in expression 2, describes a state in hierarchical learning at any given hierarchy   , where there are only two possibilities available that either the sample under the test is classified correctly, or its classification is postponed to the next hierarchy level.This process continues until the last hierarchy where it attains a state that either    = ∅ or    ∈  where  contains samples belonging to one class only.The class  can be called a remainder class.The remainder class does not have potential to introduce errors in the classification as it is distinguishable by the second tier of discrimination based on set partitioning.Therefore, this process can only end up in accurate classification of all samples.However, in the case of    ∈ , the set partitioning ψ > 0 otherwise if    = ∅ then ψ = 0.

Postulate 2
Incorporating a set-partition in the framework of probabilistic class membership is here referred to as the principle of two-tier discrimination.The principle can eliminate misclassification of samples completely during hierarchical training, making the hierarchical learning model errorfree.

Implementation
Please refer to Figure 2. The flowchart between arrow 1 and arrow 4 doesn't show any discriminatory rules for the classified and unclassified set.Replace it with the flowchart in Figure 3, which shows two tier discrimination.
It can be seen in Figure 3, that in the first tier it is checked that whether the sample under investigation obeys expression 2. If it doesn't obey, then 'otherwise' clause of expression 2 is materialized, where it is sent to the unclassified set to postpone its classification to the next hierarchy.However, if it obeys expression 2 www.aetic.theiaer.orgthen in the second tier it is checked whether it is classified correctly.If it is classified correctly then it is sent to the classified set otherwise the set partitioning margin is reset for the learning algorithm to restart the classification for the current hierarchy again.The symbols used in the figure refer to the symbols described in expression 2 and equation 3.

Principle of class membership
From the descriptions presented in section 3 about principle of two-tier discrimination (expression 2), it can be seen that the decision about the class membership in hierarchical learning is largely based on the probability.Now the question arises how to model this probability so that it could be useful in assignment of membership of a class.Equation 4provides such a framework. where where  = Range parameter, computed in theorem later in appendix.
(,,) = Estimated standard deviation of samples belonging to class   , w. r. t. model The standard deviation  (,,) can be estimated as, It should be noted that in equation 4, if any of the conditions either  (,ℎ) <  (,,) or  (,ℎ) >  (,,) is true then such a condition will push probability towards negative zone, which shows flexibility of this model to accept negative probabilities -the concept quite well established in quantum mechanics [27].However, this shows that probabilistic model of equation 4 depends on relative closeness of sample with respect to class means.It should be noted that the measure of relative closeness heavily depends on the computation of class boundaries.On the other hand, the computation of class boundary entirely depends on the estimates of the minimum and maximum of the class.Alternately, the minimum and maximum estimates of a class largely depend on the range parameter mentioned in equation 6 of the model.Obviously, this range parameter is different at different hierarchical levels as size of the training subset continues to become smaller with each subsequent hierarchical level.Therefore, this parameter should be www.aetic.theiaer.orgset according to the size of a training subset.Furthermore, as we move to the higher levels of hierarchies the subset not only become smaller but also their sample spread becomes larger, as remaining samples are only those who failed to fit in the earlier models.This necessitates to estimate maximum possible value of this parameter to capture the structure of the subset.We have proved in our theorem presented in appendix that range parameter for maximal spread is equal to √ − 1. Please see appendix, where we have calculated its value.
The experiments have shown that the value of range parameter as computed is only advantageous at tail end hierarchies where subset sizes are substantially curtailed.For the rest of the hierarchies its value trends around the value of .Therefore, final value of range parameter is shown in equation 8.

Postulate 3
The principle of class membership based on the relative closeness of a sample to the class mean provides a convenient estimate of probability of class membership provided class boundaries configured carefully according to sample size.• Use this probability to assign class to samples according to equation 2.

Principle of selective data normalisation
The principle of selective data normalisation involves two steps.First is the development of a mechanism that decides whether data normalisation is needed.The second step decides how it should be done.Let us introduce the notion of range ratio, which can help with the decision whether to normalise data.The range ratio is the ratio between maximum and minimum of data belonging to a feature.The range ratio  can be computed as shown in equation 9.

𝛾 = 𝑚𝑎𝑥(𝑎𝑏𝑠(𝑚𝑖𝑛),𝑎𝑏𝑠(𝑚𝑎𝑥))
((),()) (9) Now let us consider a feature as potent if its range ratio is greater than or equal to two, else consider it as impotent.Now if the dataset has minimum of 50% features as potent then there is no need to normalise the data.However, data normalization is needed if this is not the case.The equation 10, provides the method of data normalisation.
Where,  0 = original value of a feature of a sample   = modified value of a feature of a sample   = minimum value of a feature among all samples   = maximum value of a feature among all samples   = minimum value of all features among all samples   = maximum value of all features among all samples The equation 10, proportionally distributes the values from   to   and it is applied only to features which satisfy the condition   =   .www.aetic.theiaer.org

Postulate 4
The principle of selective data normalisation is based on the range of values of different features in the dataset, which can then be utilized to decide whether the dataset needs data normalisation procedure.

Implementation
Follow the steps below.
• Compute the range ratio  according to equation 9 for each of the features.
• Compute the number of potent features   with  ≥ 2.
then stop (  = total number of features).
• Choose the features to be normalised   satisfying the condition   =   .
• Apply data normalisation to features   according to equation 10.

Errorfree model test
An errorfree model test is the test prepared to verify whether the models developed through hierarchical learning could be errorfree as claimed in postulate 2 represented by expression 2. An errorfree model should accurately classify the whole dataset, or it should not misclassify any of the samples of the dataset.Therefore, accurate classification of some of the well-known datasets through a chain of simpler models trained in hierarchical order may validate postulate 2 of the theory.Some of the challenging realworld datasets from the UCI repository2 were chosen to test this hypothesis.The details of those datasets are given in Table 1 for feature and class description.In Table 1, column 1 gives name of the dataset, its domain and reference.Column 2 provides feature description in sequence as they appear in the dataset.Column 3 contains class names and finally column 4 states number of samples in each class.It also gives the total number of samples in the dataset.The datasets are alphabetically sorted.It should be noted that in the Table 1 only 10 features are shown for the dataset of Breast Cancer Wisconsin Diagnostic [30], which are computed from a fine needle aspirate digitized image of breast mass.The dataset comprises 10 image features to characterize the cell nuclei.However, the dataset has been augmented to comprise 30 attributes, including mean attribute values, standard deviations, and largest deviation from the mean values.
An evolutionary algorithm [10][11][12][13] was used to train a model proposed hierarchical learning method realized in Microsoft Visual Studio C/C++.It should be noted that the data normalisation procedure was not applied for errorfree model test.The program was tried on the 10 datasets described in Table 1.The method was tried 30 times on each dataset using random seeds.It should be noted that whole dataset was taken as the training set.The trained models were later used to classify the same dataset.Each random trial ended up in accurate classification of the dataset.The description of models is given in Table 2.
Table 2 presents one model with least number of hierarchies from 30 random trials for all the ten datasets (Table 1).The column 1 gives the name of the dataset.The column 2 provides the number of hierarchies.The Column 3 shows the corresponding hierarchy level.The mathematical model in that hierarchy is presented in column 4. The column 5 reveals the number of samples classified by the model, whereas the column 6 mentions either the number of unclassified samples or the number of samples belonging to one last remaining class.Finally, the column 7 states the size of set partition.The datasets are presented in the same order as that of Table-1.The feature numbers in the models (column 4), correspond to feature numbers shown in Table -1.The following procedure can be followed to verify the models presented in the Table 2: • Model value of each sample in the relevant subset of training set should be calculated by putting feature values of each sample in the models shown in Table 2.
• Classification of each sample should be done using principle of class membership.Its implementation procedure is given in the section 4.2.
If the readers find that any of the models given in the Table 4, misclassify any of the samples of the datasets, they should report to author along with details.
From the results shown in Table 2, one finds that the proposed method is able to classify four datasets namely Acute Inflammation Nephritis, Acute Inflammation Urinary, Balance Scale and Wine dataset in just one hierarchy, whereas Banknote Authentication, Iris and Breast Cancer Wisconsin Diagnostic are classified in only two hierarchies and finally Seeds, User knowledge modelling and car evaluation are classified in www.aetic.theiaer.orgthree, five, and 13 hierarchies respectively.The accurate classification of the datasets in multiple hierarchies shows the ability of the proposed theory to model complex non-linear datasets with a high precision.

Generalising ability test
Models are normally tested on two criteria.First is about their accuracy on the data on which they are trained (Section 6).Second is about their generalising ability i.e., their accuracy on the data on which they are not trained.This section deals with the second scenario.Experiments to evaluate the generalising ability of the hierarchical learning involve using two disjoint subsets of the dataset for the purpose of training and testing the model, respectively.The training sets are used to learn models, which are later tried on the test set.This division of training and test set is done entirely at random.Furthermore, to cross validate the results, the roles of training and test sets are also interchanged to remove any bias.In addition to this, it is done several times to produce average performance results.In this work this standard procedure is strictly followed to test the generalizing ability of hierarchical models.The datasets are randomly divided into training and test sets of equal size, their roles are also reversed, and this procedure is repeated for 30 independent random trials.The 10 datasets used in the above-described experiments were the same as those used in experiments described in section 6. Results of these experiments are reported in Table 3, in which, column 1 and 2 identify the dataset.Column 3 and column 4 list the best and the worst results, respectively, achieved among 30 random trials.Column 5 lists average results of 30 random trials.Column 6 states what percentage of results were 100% accurate classification among the 30 random trials.The average results reveal that the proposed method achieves more than 90% correct results on each dataset.

Comparison with contemporary methods
We compare the proposed method with state-of-the-art methods for which the results on all the above datasets have been reported.Unfortunately, there is none that has been tried on all the ten datasets.However, one recently published paper about Random Forest [36] has tested five popular ensemble methods on nine of the ten datasets (except Breast cancer Diagnostic).We also found another paper about classification trees [37], which has been tested on nine out of ten datasets (except user knowledge modelling).Table 4 presents the comparison of the results of proposed method with those of methods.Column 1 and 2 of the table identify the dataset, columns 3-9 state average results of the seven methods (as referred in the table) on ten datasets in terms of number of correct predictions in percentage.At the bottom of the table there are three additional rows i.e., rows (11)(12)(13).Row 11 provides overall average of all the ten datasets.Number of random trials is mentioned in row 12 and finally row 13 reveals size of the training set as the percentage of the size of the complete dataset.The results show that the hierarchical learning achieves overall average of 95.37% against the best average of other schemes 94.68%.Furthermore, the hierarchical learning was trained on a much smaller proportion of the datasets, i.e., 50%, as compared to the competing methods trained on 75%-90% of the datasets.In addition to this, computational time of other schemes range within 5-15 minutes against average of less than one minute in case of hierarchical learning.Further to add, other schemes used much faster application-optimized cloud-computing environments as compared to personal laptop used to produce results presented here with the proposed method.Accounting all these facts, one may regard the results produced by the proposed method as competitive.www.aetic.theiaer.org

Conclusion and Future Work
This is sixth paper in the series of papers on the hierarchical learning.It theorises the method of hierarchical learning through 4 principles, i.e., principle of successive bifurcation, principle of two-tier discrimination, principle of class membership and the principle of selective data normalization.The first principle proposes that the multiple simpler models can emulate the effect of a more complex model, when put together hierarchically.The second principle separates the datapoints in terms of classes as well as domain.The third principle establishes class membership rule at different hierarchy levels.The last principle articulates the rules for the data normalisation.The presented method is not only supported on the mathematical grounds but also on the empirical results on ten popular real-world classification datasets taken from UCI repository.On these datasets accurate classification nonlinear discriminant models were produced, details of some of those models are given in section 6.The procedure to evaluate the accuracy of those models is also detailed.The generalising ability of the hierarchical method was also tested on the same datasets.The technique produced more than 95% correct results on average while trained on only 50% of the samples.Interestingly, the average of worst results in 30 random trials on all the datasets also turns out to be greater than 92%, which is a commendable result.The method performs competitively when compared with the results from other state of the art methods.Despite all the above success the technique still needs further theoretical enhancements for its wider applicability on the large spectrum of datasets, which are currently under investigation.www.aetic.theiaer.org  =   −  *   (11) Rearranging the variables:  =   −    (12) Applying limits: as minima approaches 0. lim The samples of the set will be maximally spread when one of the samples is closest to the mean while rest of the samples are at farthest point from the mean.Let us assume that maximum distance between the points is a unity.In the case of equation 13 since minima lies at 0, therefore, maxima should be at 1. To further follow our assumption of maximal spread let us consider one sample lies at minima 0 and -1 samples lie at maxima 1.Therefore, by substituting these values in equation 5, the mean of the point set can simply be calculated, as shown in equation 14.
We can compute standard deviation by substituting the value of mean from equation 13 in equation 7, as shown in equation 15.
Substituting the values of the mean (equation 14) and the standard deviation (equation 15) in equation 12, we get the value of the range parameter as shown in equation (16).
Equation 16 proves the theorem statement.

Figure 1 .
Figure 1.Principle of successive bifurcation Figure 1 depicts the training set, which is partitioned into subsets  1 , … … ,   by the models  1 … …  −1 .Each model   in hierarchy   bifurcates the training set into classified   and unclassified  i  subsets.The last subset   is either NULL subset or the subset containing samples from one class only, illustrating the end of hierarchical classification procedure.The principle of successive bifurcation described above generalises the hierarchical learning procedure.At any given point during the heirarchical learning the four statements ( − ) of equation 1, must be satisfied.

Figure 2 .
Figure 2. Principle of successive bifurcation implementation

Figure 3 :
Figure 3: Principle of two-tier discrimination implementation

Follow
the steps below: • Compute mean based on equation 5. • Compute standard deviation based on equation 7. • Compute range parameter based on equation 8 • Compute minimum and maximum of the class based on equation 6 • Compute probability from equation 4.
−1 precedes   , whose domain is the subset   and class set  is its codomain.b.At   , the training set  is the set of all samples including    and    c.At   ,    is the union of all sets classified on levels 1 through  d.At   , total number of hierarchies is always one less than total number of subsets.

Table 1 .
Class and Feature Description of datasets

Table 2 .
Accurate models of classification datasets

Table 3 .
The Classification Results

Table 4 :
Comparison with Literature