Decision Trees and Evolutionary Programming
By ai-depot | December 4, 2002
Recursive Partitioning
Recursive Partitioning
The question that comes to mind at this point is, "How do we develop a decision tree from any given data matrix?" While we could sit down and manually construct a decision tree based upon the data we have, this would be very difficult even for the simplest of data. It is not unusual in some fields to have a data matrix consisting of thousands of rows and thousands of columns. Manually constructing a decision tree from a training set of this magnitude would be extremely laborious. Furthermore, it is very likely that we may identify random relationships within the data that effectively represent the data we have, but may not represent the true underlying relationships between features and classes. A decision tree of this sort would be "over-fit" for the training data and not useful for outside predictions due to its dependency on small, random correlations in the training data.
To address the problem of decision tree construction, an algorithm known as Recursive Partitioning was developed. While there are many different versions of Recursive Partitioning (RP) available, each with its own unique details, the overall methodology is consistently the same regardless of the exact implementation. Figure 2 illustrates the general methodology involved in Recursive Partitioning.
Figure 2: Recursive Partitioning
To understand the RP process, a few basic concepts must first be understood. The first of these is the concept of splitting (partitioning) the training set. During the process of decision tree induction, we have to consider what questions we will ask in order to direct the user down the appropriate path. For simplicity, let�s consider that each potential question can have a true or false answer, thus, any particular node will have at most two paths leading from it to the next node(s) in the path. Every possible value of every possible feature within the training set represents a potential split that could be done. The result is that we will be able to go down the right path or the left path based upon the data and we will effectively split the data at each node into two independent groups � this is partitioning. For example, let us consider a training set which has purely numerical data. The features will be called Xn and the possible values for those features will be called Ym. As a result, every question that could be asked can take the form, "Is Xn less than or equal to Ym?" The resulting answer will direct us down the appropriate path, e.g., if Xn is less than or equal to Ym, then go left, otherwise, go right. Once we have two new nodes (children nodes) linked to a previous node (the parent node), we can repeat the process for each child node independently using only the observations present in that child � the recursive step.
The next concept is related to how we choose which question to ask. Remember that we could ask a question of the above form for every possible value of every possible feature within our training set. For the purpose of choosing the most appropriate question, we develop a measure called the purity measure that we can use to decide which split is the best possible split from our choices. There are many purity measures available. The simplest (but definitely not the best) is the absolute purity of classes represented as a percent purity that we can achieve in the nodes, and this will serve to illustrate the process. The purity measure answers the question, "Based upon a particular split, how good of a job did we do of separating the two classes away from each other?" We calculate this purity measure for every possible split and choose the one that gives the highest possible value.
Finally, we must have some type of stopping criteria. If we were to allow the splitting process to continue until each leaf only had 1 observation, we would have a perfect tree! This situation, however, is very dangerous if we want to use our resulting decision tree for the purpose of prediction. Most likely a tree of this sort is over fit for the training data and will not perform well on new data. To avoid this process, we introduce a stopping criteria to halt the recursive partitioning process. Much like purity measures, stopping criteria come in many different forms, including:
- A maximum number of nodes in the tree. Once this maximum is reached, the process is halted.
- A minimum number of observations in a particular node can be set such that if the number of observations in a node is less than or equal to a minimum value, partitioning of that node will not be attempted, and it becomes a leaf.
- A threshold for the purity measure can be imposed such that if a node has a purity value higher than the threshold, no partitioning will be attempted regardless of the number of observations. (A node which is pure enough based upon a predefined threshold does not require any further splitting regardless of how many observations are present.)
Once we have determined that we will stop the procedure, we have effectively reached a leaf in our tree. Based upon the observations that have made it through the tree to that leaf, we can assign it a class value. One common way to do this is to determine which class is in higher numbers and use that class as the leaf�s class value. This is merely a majority rules voting method and, like all other aspects of tree induction, many different methods exist by which to determine a leaf�s class label.
While the above procedure for decision tree induction can be modified and fine-tuned for the task at hand, the general procedure remains the same regardless of the choice of purity measures or stopping criteria. Additionally, while this is a method that is used extensively throughout the field of pattern recognition, a potential flaw exists. At each node which we have decided needs to be split, we look at the immediate effects on the children. As mentioned above, we look through each and every possible way to partition this node�s observations and choose the one which gives us the best purity in the immediate children. It is conceivable, however, that if we choose a splitting criteria which is not optimal, we might have better choices much later in the tree which we cannot see in the immediate children. In other words, a sub-optimal split may lead to an increase in optimality of choices for later splits.
Tags: none
Category: tutorial |