Decision tree learning

Decision Tree Learning
- Python for Integrated Circuits -
- An Online Book -

Python for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Like the name decision tree suggests, we can imagine this model as breaking down our data by making decisions based on asking a series of questions. A decision tree can be used for both binary classification and multi-class classification tasks. In the binary classification, a decision tree typically has two possible outcomes at each node, leading to two branches. The final leaves of the tree represent the two classes or categories being predicted.

Most libraries (including scikit-learn) implement binary decision trees. Decision tree classifiers are attractive models if we care about interpretability. Decision trees are generally considered non-linear models. A decision tree is a tree-like model where an internal node represents a decision based on the value of a particular feature, a branch represents the outcome of the decision, and a leaf node represents the final output or class label. The decision boundaries created by decision trees are non-linear, as they consist of a series of axis-parallel splits.

In contrast, linear models, such as linear regression or support vector machines, create linear decision boundaries. The decision boundary in a linear model is a hyperplane that separates the input space into different regions. Decision trees are non-linear models because they can capture complex relationships between input features and the output, allowing for more flexible and intricate decision boundaries. However, it's important to note that an ensemble of decision trees, such as a Random Forest or Gradient Boosted Trees, can be used to create more powerful models that combine the strengths of multiple decision trees.

Figure 4313a shows an example of decision tree.

Figure 4313a. Example of decision tree. (Code)

In a decision tree, splits are made at internal nodes of the tree to divide the data into subsets based on the values of input features. The decision on where to make these splits is a fundamental aspect of building the tree, and it depends on the choice of a feature and a threshold value. Here's how the process of making splits in a decision tree works:

Selecting the Feature: At each internal node of the tree, the algorithm evaluates all available features (also known as attributes or variables) to determine which one is the best feature to split the data. The goal is to select the feature that, when split, provides the most information or reduces uncertainty about the target variable. This is typically done using a measure of impurity or information gain, such as Gini impurity, entropy, or mean squared error, depending on whether it's a classification or regression tree.
Choosing the Threshold: Once the best feature is selected, the algorithm then determines the threshold value for that feature, which separates the data into two or more subsets. The threshold can be a specific numeric value or a range of values, depending on the nature of the feature. For example, if the feature is age, the threshold might be "age less than 30."
Splitting the Data: The data is partitioned into subsets based on the selected feature and threshold. Data points with feature values that meet the specified condition go down one branch of the tree, while data points that do not meet the condition go down another branch. This process is repeated recursively for each subset, creating a hierarchical tree structure.
Repeat the Process: Steps 1 to 3 are repeated at each internal node until a stopping criterion is met. This stopping criterion could be a predefined maximum depth for the tree, a minimum number of data points in a node, or a measure of impurity that falls below a certain threshold.

The goal of making these splits is to create a decision tree that effectively segments the data into homogeneous subsets, where data points within each subset are more similar in terms of the target variable. This allows the tree to make accurate predictions for new, unseen data points by following the decision rules defined by the tree's structure.

The selection of features and thresholds, as well as the stopping criteria, are determined by the specific algorithm used to build the decision tree (e.g., CART, ID3, C4.5) and the problem type (classification or regression). The choice of these parameters greatly influences the structure and performance of the resulting decision tree.

In a decision tree, the choice of where to make a split is determined by optimizing a splitting criterion that measures how well a split separates the data into more homogeneous subsets. The specific splitting criterion used depends on whether you're building a classification tree or a regression tree.

Here are the typical splitting criteria for each type of tree:

Classification Trees:
- Gini Impurity: The split that minimizes Gini impurity is chosen. Gini impurity measures the probability of misclassifying a randomly chosen data point within a node if it were classified according to the majority class in that node.
- Entropy: The split that maximizes information gain (reduction in entropy) is chosen. Entropy measures the level of disorder or impurity in a node.
- Classification Error Rate: The split that minimizes the classification error rate is chosen. The classification error rate is the fraction of misclassified data points within a node.

The candidate split with the highest LogWorth is not a standard splitting criterion used in decision trees. Instead, decision trees are typically built based on the criteria mentioned above, with the goal of reducing impurity or misclassification.

Regression Trees:
- Mean Squared Error (MSE): The split that minimizes the MSE is chosen. MSE measures the average squared difference between the predicted values and the actual values of the target variable within a node. In regression trees, the goal is to minimize the error variance.
- Mean Absolute Error (MAE): Some regression tree algorithms use MAE as the splitting criterion, aiming to minimize the average absolute difference between the predicted and actual values.

The candidate split with the highest RMSE (Root Mean Squared Error) is not typically used as the splitting criterion in regression trees. Instead, the focus is on minimizing the error variance or absolute error.

The choice of splitting criterion in a decision tree can vary depending on the specific context and the goals of the analysis. While widely used criteria such as Gini impurity, entropy, MSE, and MAE are standard in most decision tree algorithms, there may be cases where other criteria, like LogWorth, are used.

LogWorth is not a standard splitting criterion in decision trees, but it could be a custom or domain-specific metric that someone has chosen to use for their particular problem. The decision to use LogWorth as a splitting criterion would depend on the specific objectives and requirements of the analysis.

LogWorth is often associated with statistical hypothesis testing or feature selection in some statistical techniques. It measures the significance of a predictor variable in explaining the variation in the response variable. If someone is using LogWorth as a splitting criterion in a decision tree, it suggests that they are prioritizing variables that have a strong statistical association with the response variable.

Whether or not using LogWorth as a splitting criterion is appropriate depends on several factors, including the nature of the data, the problem at hand, and the goals of the analysis. It might be a suitable choice in some situations, particularly if the goal is to identify the most influential predictors. However, it's not a standard approach in decision tree algorithms, and its effectiveness would need to be evaluated within the context of the specific problem and dataset.

Figure 4313b plots month and latitude which are suitable for ski around the world.

Month and latitude which are suitable for ski around the world

Figure 4313b. Month and latitude which are suitable for ski around the world. (Code)

Figure 4313c plots the decision tree with the month and latitude, which are suitable for ski around the world, described in Figure 4313b.

Month and latitude which are suitable for ski around the world

Figure 4313c. Decision tree with the month and latitude, which are suitable for ski around the world, described in Figure 4313b. "gini" likely refers to the criterion used for splitting nodes in the decision tree. (Code)

:"Split" concept in decision tree:

1. Node Splitting: In a decision tree, each internal node represents a decision based on a certain feature, and the branches emanating from that node represent the possible values of that feature. The split is essentially a decision rule that divides the dataset into subsets based on the values of a certain attribute/feature.

2. Splitting Criteria: The decision on how to split a node is based on a certain criteria or metric. Common criteria include Gini impurity, information gain, or variance reduction, depending on whether you're dealing with classification or regression trees.

3. Example (Classification):. For instance, in a classification decision tree, a split might involve checking if a certain feature is greater than a threshold. If true, the data goes down one branch; if false, it goes down another.

In decision tree, the split operation is represented as,

decision tree ----------------------------------------------------- [4313a]

A split can be used in the process of splitting a parent region in a decision tree. Each node corresponds to a region in the input space, and a split at a node divides that region into two child regions. The split is defined as a function of the feature number and the threshold . This split creates two subsets of the parent region : , which consists of instances where the -th feature is less than , and , which consists of instances where the -th feature is greater than or equal to . This formulation ensures that the parent region is partitioned into two disjoint sets based on the chosen feature and threshold. This is a fundamental step in the construction of decision trees, where the goal is to recursively split the data into subsets until certain criteria are met (e.g., a predefined depth is reached or a minimum number of samples in a node).

The misclassification loss for a region is defined as 1−max⁡(p^c), where is the proportion of examples in that belong to class ,

decision tree ----------------------------------------------------- [4313b]

where,

max⁡(p^c) represents the maximum proportion of examples in belonging to any single class.

The misclassification loss is designed to quantify the error in prediction for a given region. It penalizes the model based on the highest proportion of misclassified examples in that region. The subtraction from 1 ensures that the loss is minimized when the maximum proportion (max⁡(p^c)) is close to 1, indicating that the dominant class in the region is correctly predicted.

In decision tree training, the goal is to find the best splits that minimize the misclassification loss. When growing a tree, the algorithm evaluates different splits on different features and thresholds to find the one that minimizes the weighted sum of misclassification losses for the resulting child regions.

While misclassification loss is one possible choice, other loss functions like Gini impurity or entropy are also commonly used in decision tree algorithms. The choice of loss function may depend on factors such as interpretability, computational efficiency, or specific goals of the modeling task.

When constructing decision trees, the goal is to find splits that result in child regions ( and R₂) with lower misclassification loss. The process involves evaluating potential splits on different features and thresholds and selecting the split that minimizes the overall loss. Mathematically, this is expressed as:

decision tree ----------------------------------------------------- [4313c]

where,:

is the misclassification loss of the parent region (),
,
is the misclassification loss of child region ,
is the misclassification loss of child regionR₂.

The objective is to find a split that results in child regions with the minimum total misclassification loss. This process is typically performed recursively, creating a tree structure where each node represents a region, and each split further refines the regions until a stopping criterion is met. By selecting splits that decrease the overall misclassification loss, the decision tree aims to improve predictive accuracy and separate the data into homogeneous groups with respect to the target variable.

In decision tree construction, the focus is on minimizing the misclassification loss in the child regions ( and ) rather than the loss in the parent region (). The reason for this is that the parent region will be split into two child regions, and it is the reduction in misclassification loss in these child regions that contributes to the improvement in the overall model. The specific split selected is determined by evaluating different splits and choosing the one that minimizes the overall misclassification loss. This approach is part of the recursive nature of decision tree construction, where each split aims to improve the model's predictive ability by creating more homogenous child regions with respect to the target variable. The minimization of the misclassification loss in the child regions is the driving factor for selecting the optimal splits during the tree-building process.

Decision trees are fairly high variance models. Variance in machine learning refers to the model's sensitivity to the specific training data it has been exposed to. High variance often means that the model is very flexible and can fit the training data very closely, but it may not generalize well to new, unseen data. Decision trees are known for being able to capture complex relationships in the training data, which can lead to high variance. Each decision node in a tree makes a decision based on a particular feature, and the tree structure can become very deep and complex, especially if it's allowed to grow without constraints.

High variance can be a double-edged sword. On one hand, it allows the model to learn intricate patterns in the training data, making it capable of fitting complex relationships. On the other hand, this flexibility can lead to overfitting, where the model becomes too tailored to the training data and performs poorly on new, unseen data. To mitigate the high variance of decision trees, techniques like pruning or ensemble methods (e.g., Random Forests or Gradient Boosted Trees) are often used. Pruning involves removing branches of the tree that do not provide significant improvement in predictive performance on the validation set, helping to prevent overfitting. Ensemble methods combine multiple decision trees to reduce overfitting and improve generalization. Therefore, some regularization techniques can be used to avoid overfitting in decision trees even though pruning is the traditional technique for reducing overfitting in decision trees.

The technique commonly used for reducing overfitting in decision trees is pruning. Pruning involves trimming the tree by removing parts of it that do not provide significant predictive power or that are likely to be overfitting to the training data. This helps prevent the model from fitting too closely to the noise or idiosyncrasies of the training data and improves its ability to generalize to unseen data.

About the time complexity during training and testing (inference) phases, we have:

Training Time Complexity:
- During the training phase, each example in the dataset passes through, on average, O(d) nodes in the decision tree. This is because a decision tree of depth, d, involves traversing, d, nodes from the root to a leaf for each example.
- The cost associated with evaluating a feature at a node is proportional to the number of features (f). In a decision tree, at each decision point (node), the algorithm considers one of the features to determine how to split the data, and thus there are f splits.
Combining these, the training time complexity is then O(nfd), where n is the number of examples, f is the number of features, and d is the depth of the tree. Each example contributes O(d) nodes, and at each node, O(f) features are considered.
Testing (Inference) Time Complexity:
- The testing time complexity is O(d), where d is the depth of the decision tree, and it's assumed to be logarithmic in the number of examples (d = log₂ n).
- Runtime generally refers to the time it takes for the decision tree model to make predictions (inferences) on new, unseen data.
Data Matrix Size:
- The size of the data matrix is typically nf, where n is the number of examples and f is the number of features.

The advantages of decision trees are:

Interpretability: Decision trees are easy to understand and interpret, making them a popular choice for decision-making in various fields. The visual representation of a tree structure, with branches and leaves corresponding to decisions and outcomes, facilitates communication and comprehension.
Ease of Explanation: Decision trees are simple to explain to non-experts. You can walk through the tree structure step by step, outlining the decision-making process in a clear and straightforward manner.
Handles Categorical Variables: Decision trees naturally handle categorical variables without the need for extensive preprocessing. Many machine learning algorithms require numerical input, but decision trees can accommodate categorical features directly.
No Need for Feature Scaling: Decision trees are not sensitive to the scale of input features. Unlike some machine learning algorithms (e.g., support vector machines or k-nearest neighbors), decision trees don't require feature scaling, making them less dependent on the units or magnitudes of the input variables.
Fast Predictions: Once trained, decision trees can make predictions quickly. The time complexity for making predictions is logarithmic in the number of data points used to train the tree, making them efficient for large datasets.
Handles Non-linear Relationships: Decision trees can capture non-linear relationships between features and the target variable. They can model complex decision boundaries, making them versatile for a wide range of problems.
Feature Importance: Decision trees can provide information about feature importance. By examining which features are used near the top of the tree, you can gain insights into which variables are most influential in making predictions.
Robust to Outliers: Decision trees are generally robust to outliers in the data. Outliers might have a limited impact on the tree structure, as the algorithm makes decisions based on majority voting.

Decision trees, while a popular and powerful machine learning algorithm, do have some disadvantages:

Overfitting: One of the main challenges with decision trees is their tendency to overfit the training data. This means that the tree may capture noise or outliers in the training data, leading to poor generalization performance on new, unseen data.
High Variance: Decision trees are sensitive to variations in the training data. Small changes in the training set can result in a completely different tree structure. This high variance can make decision trees less robust.
Unstable to Small Changes: A small change in the input data can lead to a significantly different tree structure. This sensitivity to input variations can make decision trees less reliable in some situations.
Difficulty in Modeling Additive Structures: Decision trees are not naturally good at capturing additive structures in the data. For example, if a relationship between input features and the target variable is additive, a decision tree may struggle to represent this relationship effectively.
Limited Expressiveness: Decision trees may not be expressive enough to capture complex relationships in the data. In cases where the true relationship is intricate, other models like ensemble methods (e.g., Random Forests or Gradient Boosted Trees) might be more suitable.
Greedy Nature: Decision trees use a greedy approach, selecting the best split at each node without considering the global optimum. This can lead to a suboptimal overall tree structure.
Bias Towards Features with More Levels: Decision trees tend to favor features with more levels (more categories) because they can create more splits and potentially fit the training data better. This can lead to a biased importance of features.
Not Suitable for Linear Relationships: Decision trees are not effective at modeling linear relationships between features and the target variable. Other models like linear regression may perform better in such cases.
Difficulty in Handling Missing Data: Traditional decision tree algorithms can struggle with missing data. While there are techniques to handle missing values, the default behavior of most decision tree algorithms is to ignore instances with missing values during the training process.

============================================

=================================================================================