Discussing Decision Trees: What Makes a Good Split?

Discussing Decision Trees: What Makes a Good Split?
Image by Editor | ChatGPT

Introduction

It’s no secret that most advanced artificial intelligence solutions today are predominantly based on impressively powerful and complex models like transformers, diffusion models, and other deep learning architectures. However, models of milder to moderate complexity — such as decision trees and random forests (which combine multiple decision trees to perform a single predictive task) — remain a very popular and effective solution for building predictive systems across diverse scenarios, especially in business contexts. From financial fraud detection to customer churn prediction and supply chain optimization by forecasting product demand, decision tree-based models are widely used in real-world applications.

But what are the underlying mechanisms that make decision trees so well-suited for various predictive tasks? And what criteria are internally used to construct them? Specifically, how are nodes recursively split as the tree-shaped structure is formed? This article takes a closer look at the inner workings of decision trees, focusing on how branches are created through deliberate, data-driven splitting (spoiler: it certainly doesn’t happen at random).

How Decision Trees Are Built

To gain an initial understanding of how decision trees are built and, consequently, how nodes containing training data examples are split, consider the example illustrated in the image below:

Homogeneity is the guiding principle for building a good decision tree

Homogeneity is the guiding principle for building a good decision tree
Image by Author

Suppose we have a big box containing balls of two different colors, and we want to distribute them into two smaller boxes. The requirement is to try dividing the set of balls into two subsets that are each as homogeneous as possible; that is, having as few mixed colors as possible. Three possible options are presented to perform this “split” on the original box into two boxes: option A, option B, and option C. Which would you choose?

If you chose option B, then we are aligned. And best of all, you have already captured much of the essence of how decision trees are built.

The process of building or training a decision tree starts with a single node, visually represented at the top of the image shown below. Just like a Galton Board with tiny beans poised at the top, that initial node encompasses the entire set of training data. For example, if we take the popular iris dataset and intend to build a decision tree for classifying iris observations into one of their three possible species (classes), the training process starts with all 150 instances of the dataset in the top node.

An example decision tree built on the iris dataset

An example decision tree built on the iris dataset
Image by Author

Here’s the catch, and the reason why we identified option B as the best choice in the earlier example: class homogeneity is the key criterion to pursue in the process of splitting the root node of a decision tree and gradually growing it. We need to ensure data instances are gradually separated into nodes that are as homogeneous as possible. Obviously, the starting point is far from homogeneous, with an initial, full training set that contains 150 instances equally distributed into three classes, with 50 instances per class.

Making Good Splits

So, how do we split a node containing, say, 150 instances, into two “smaller” nodes, each containing a subset of these instances? This is done by defining a splitting condition, associated with an attribute in the dataset, and a threshold value that will send every instance to one side of the split or the other, depending on its value. For example, by looking at the information at the top of the root node, we can observe the condition:

[ petal length (cm) <= 2.45 ]

This condition is used to split the 150 instances, such that those with the value of the ‘petal length’ attribute being smaller than or equal to 2.45 are sent to one of the “child nodes,” and those with a petal length larger than 2.45 cm are sent to the other child node. This results in 50 and 100 instances with shorter and larger petal lengths than 2.45 cm, respectively.

Earlier I said “nothing happens at random” when building a decision tree, and here’s what I meant: the splitting condition involving the petal length attribute and the threshold value of 2.45 is not chosen arbitrarily, but as a result of applying a decision tree-building algorithm called CART (Classification And Regression Trees) that searches and identifies a splitting condition that seeks to maximize class homogeneity or, put another way, minimizes node impurity. Impurity is a measure of heterogeneity in the instances contained in a node of the decision tree, and there are several measures largely based on information theory to measure it: entropy and Gini index are among the most commonly used.

A decision tree algorithm like CART evaluates a large space of possible [attribute – threshold] conditions to split a node of the tree into two sub-nodes or child nodes, and chooses the split that would maximize the impurity reduction; that is, “by splitting the colored balls in the large box into two smaller boxes, which condition yields the most substantial shift from a single, diverse box towards two more homogeneous boxes?” That’s impurity reduction: moving from less homogeneous to more homogeneous nodes as we progress in building the tree by growing new nodes and performing successive splits.

If you look at the value = [X, Y, Z] lists in each of the nodes in the decision tree visualized earlier, which indicate the distribution of instances per class in that node, you can see how as we descend through the tree, the distribution becomes more and more concentrated around a single class, which is roughly the objective to achieve in building the tree.

On a final note, not everything is completely black or white, and there are always practical nuances to take into account. For example, exhaustively seeking full homogeneity in every part of the tree grown is usually not a good idea in practice. In fact, overgrowing a tree for the sake of achieving total homogeneity often means the model is memorizing the training data, rather than learning from it in a balanced way. If you are fairly familiar with machine learning, you may know what this means: the risk of overfitting. Thus, while the main guiding principle to split nodes in a decision tree is seeking class homogeneity, this process must be applied carefully to avoid ending up with a model that fails to generalize to future unseen data because it excessively memorizes information from the training data.

Wrapping Up

This article examined a key aspect of how decision tree-based models in machine learning work: the process of splitting nodes to build and grow a decision tree for predictive tasks like classification and regression. Using gentle language and without going deeply technical, we demystified the ins and outs of constructing a decision tree capable of making accurate predictions based on a dynamically defined hierarchy of rules or conditions.



Source link