## ML Algorithm series.

### Classification , Clustering , Decision tree and the concept of Entropy splitting criterion information gain

ex: A mortgage company wants to risk assess loan applicants ! a utility company wishes to classify potential defaulters to design preventive measures ! An online retailer wants to giveaway budget coupons for maximum utilization ! A process manufacturer wants to optimize operational equipment efficiency!….the cases are a plenty : Fraud detection, target marketing, performance prediction, manufacturing and healthcare, the world of classification in Machine learning.

Data classification is two step process 1. learning step 2.classification step. the classification model is constructed in learning step, and the model is used to predict the class labels for a given data. In the first step a classifier is built describing a predetermined set of data made up of database tuples and associate class *tuples*. Please note that the class label value is discrete unordered and is categorical.The individual training tuples are randomly sampled from the analytical data asset/s.As the training data is provided with labels for classification the technique is also labeled as supervised learning, meaning the *learning *of the classifier is told under the supervision to which class each training tuple belongs to. In unsupervised learning technique this concept is called *clustering *in which the class label and the number of set of classes to be learned for each training tuple is unknown to the algorithm

*Cluster analysis* is widely used data discretization technique, applied to discretize a numeric attribute into clusters or groups. Clustering takes the distribution of a numeric attribute into consideration along with the closeness of data points to produce high-quality discretization techniques.Clustering can be used to generate a concept hierarchy for a numeric attribute by using a top-down splitting or bottom-up merging strategy spawning each node into a concept cluster.we will dicss more in depth on clusterin gin future sections, but for now let us comprehend decision tree classification technique.

**Decision Tree Induction** : A statistical learning technique which classifies data into classes and illustrates like a flowchart representing a tree structure. In this model all training data set is fed into the root – an *attribute* that plays critical role in classification, the model then classifies the data set by flowing through a *query* structure until it reaches the *leaf.* TDIDT ( the top down induction of decision trees is most common strategy used to learn the Decision tree from data.the learning process is supervised because the model constructs the DT from class-labeled training tuples. in this technique each cluster of a node can be further decomposed into several clusters forming a lower level hierarchy. There are two types of Decision tree algorithms : I*terative Dichotomiser ,C4.5, and CART* . C4.5 is an extension of ID3 that accounts for unavailable values , continuous attribute range values, and pruning of the Decision Tree. *information gain *is the statistical measure used in DTI where as *gain-ratio* is the measure for C4.5. Both DTI and C4.5 has close relationship with *Entropy ( ref statistical mechanics : *https://en.wikipedia.org/wiki/Entropy) in Machine learning the *Shannon* theory of entropy is widely adopted to measure the uncertainty in a random variable.* Shannon entropy *quantifies the uncertainty in a random variable in terms of expected value of information present in a given message.

*eq: *H(X) = ∑i P(xi) I(xi) = – ∑P(xi) logbP(xi) ( for example if we toss a *fair coin* the probability of Head or tail is 1/2 , that means in such uncertain environment, one complete bit of information occurs for every tossed outcome.if it is* unfair coin or biased coin *the entropy is less or minimum

Decision tree classifiers are very popular because , the learning and classification steps are very simple, it does not require any domain knowledge or parameter setting to construct a classifier, highly useful for exploratory knowledge discovery, also decision trees can be easily converted to classification rules,the attribute selection measures used to best partition the tuples into distinct classes. Finally when the decision trees are built many of the branches may reflect noise or outliers, *tree pruning *is applied to identify noisy branches and improve classification accuracy.

*Classification algorithm ( DTI )*

Using DTI assist in the prediction of Alzheimer’s : Class labeled training tuples ( Dataset “D”)

The above table represents a training set D, of class-labeled tuples randomly selected from the the research gate website for predicting Alzheimer’s desease.In the above dataset each attribute is discretized, and continuois values have been generalized.the class label Alzheimer’s disease has basically two values ( *yes or no* ) therefor there are two distinct classes ( m = 2)

Let C1 == yes and C2 == no , that means there are 9 tuples of C1 and 8 tuples of C2. Now A root node N is cretaed for tuples in D. Now the goal of the DTI is to find the splitting criterion for these tuples for which we must compute the information gain for each attribute.

1) Compute the info gain equation using the equation

*I**nfo*(D) = ∑[ pi log2(pi ) ] for given i ranging from 1 to m.

*I**nfo*(D) = -9/17 log2(9/17) – 8/17 log2(8/17) = (-0.529411) * (.056131) – (0.4705) *(0.053122) = 0.0546 bits

2) Now let us compute the expected information requirement for each attribute, starting from age. For age <65 Class tuples c1 == 4 and C2 == 3. for age 65-85 C1 == 3 & C2 == 2, for age >85 C1 == 3 and C2 == 2

so the expected information needed to classify a tuple in D if the tuples have to be partitioned according to *age* is

** Infoage(D) = 7/17( -4/7 log2(4/7) – 3/7log2(3/7)) + 2 * { 5/17(-3/5 log2(3/5)- 2/5 log2 (2/5)) } = **0.57 bits

similarly after computing the *gain(GC) bits gain(VD)bits gain(BI)bits notice the highest information gain attribute is selected as the splitting attribute at the root node of the decision tree.*

*looks like you read the entire article , iw ould appreciate if you can check if my assumption : age has the highest information gain?*

By Kiran Balijepalli