Decision Tree Learning - Python for Integrated Circuits - - An Online Book - |
||||||||
| Python for Integrated Circuits http://www.globalsino.com/ICs/ | ||||||||
| Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= Like the name decision tree suggests, we can imagine this model as breaking down our data by making decisions based on asking a series of questions. A decision tree can be used for both binary classification and multi-class classification tasks. In the binary classification, a decision tree typically has two possible outcomes at each node, leading to two branches. The final leaves of the tree represent the two classes or categories being predicted. Most libraries (including scikit-learn) implement binary decision trees. Decision tree classifiers are attractive models if we care about interpretability. Decision trees are generally considered non-linear models. A decision tree is a tree-like model where an internal node represents a decision based on the value of a particular feature, a branch represents the outcome of the decision, and a leaf node represents the final output or class label. The decision boundaries created by decision trees are non-linear, as they consist of a series of axis-parallel splits. In contrast, linear models, such as linear regression or support vector machines, create linear decision boundaries. The decision boundary in a linear model is a hyperplane that separates the input space into different regions. Decision trees are non-linear models because they can capture complex relationships between input features and the output, allowing for more flexible and intricate decision boundaries. However, it's important to note that an ensemble of decision trees, such as a Random Forest or Gradient Boosted Trees, can be used to create more powerful models that combine the strengths of multiple decision trees. Figure 4313a shows an example of decision tree. Figure 4313a. Example of decision tree. (Code) In a decision tree, splits are made at internal nodes of the tree to divide the data into subsets based on the values of input features. The decision on where to make these splits is a fundamental aspect of building the tree, and it depends on the choice of a feature and a threshold value. Here's how the process of making splits in a decision tree works:
The goal of making these splits is to create a decision tree that effectively segments the data into homogeneous subsets, where data points within each subset are more similar in terms of the target variable. This allows the tree to make accurate predictions for new, unseen data points by following the decision rules defined by the tree's structure. The selection of features and thresholds, as well as the stopping criteria, are determined by the specific algorithm used to build the decision tree (e.g., CART, ID3, C4.5) and the problem type (classification or regression). The choice of these parameters greatly influences the structure and performance of the resulting decision tree. In a decision tree, the choice of where to make a split is determined by optimizing a splitting criterion that measures how well a split separates the data into more homogeneous subsets. The specific splitting criterion used depends on whether you're building a classification tree or a regression tree. Here are the typical splitting criteria for each type of tree:
The candidate split with the highest LogWorth is not a standard splitting criterion used in decision trees. Instead, decision trees are typically built based on the criteria mentioned above, with the goal of reducing impurity or misclassification.
The candidate split with the highest RMSE (Root Mean Squared Error) is not typically used as the splitting criterion in regression trees. Instead, the focus is on minimizing the error variance or absolute error. The choice of splitting criterion in a decision tree can vary depending on the specific context and the goals of the analysis. While widely used criteria such as Gini impurity, entropy, MSE, and MAE are standard in most decision tree algorithms, there may be cases where other criteria, like LogWorth, are used. LogWorth is not a standard splitting criterion in decision trees, but it could be a custom or domain-specific metric that someone has chosen to use for their particular problem. The decision to use LogWorth as a splitting criterion would depend on the specific objectives and requirements of the analysis. LogWorth is often associated with statistical hypothesis testing or feature selection in some statistical techniques. It measures the significance of a predictor variable in explaining the variation in the response variable. If someone is using LogWorth as a splitting criterion in a decision tree, it suggests that they are prioritizing variables that have a strong statistical association with the response variable. Whether or not using LogWorth as a splitting criterion is appropriate depends on several factors, including the nature of the data, the problem at hand, and the goals of the analysis. It might be a suitable choice in some situations, particularly if the goal is to identify the most influential predictors. However, it's not a standard approach in decision tree algorithms, and its effectiveness would need to be evaluated within the context of the specific problem and dataset. Figure 4313b plots month and latitude which are suitable for ski around the world.
Figure 4313b. Month and latitude which are suitable for ski around the world. (Code) Figure 4313c plots the decision tree with the month and latitude, which are suitable for ski around the world, described in Figure 4313b.
:"Split" concept in decision tree: 1. Node Splitting: In a decision tree, each internal node represents a decision based on a certain feature, and the branches emanating from that node represent the possible values of that feature. The split is essentially a decision rule that divides the dataset into subsets based on the values of a certain attribute/feature. 2. Splitting Criteria: The decision on how to split a node is based on a certain criteria or metric. Common criteria include Gini impurity, information gain, or variance reduction, depending on whether you're dealing with classification or regression trees. 3. Example (Classification):. For instance, in a classification decision tree, a split might involve checking if a certain feature is greater than a threshold. If true, the data goes down one branch; if false, it goes down another. In decision tree, the split operation is represented as, A split can be used in the process of splitting a parent region in a decision tree. Each node corresponds to a region in the input space, and a split at a node divides that region into two child regions. The split is defined as a function of the feature number and the threshold . This split creates two subsets of the parent region : , which consists of instances where the -th feature is less than , and , which consists of instances where the -th feature is greater than or equal to . This formulation ensures that the parent region is partitioned into two disjoint sets based on the chosen feature and threshold. This is a fundamental step in the construction of decision trees, where the goal is to recursively split the data into subsets until certain criteria are met (e.g., a predefined depth is reached or a minimum number of samples in a node). The misclassification loss for a region is defined as 1−max(p^c), where is the proportion of examples in that belong to class , where, max(p^c) represents the maximum proportion of examples in belonging to any single class. The misclassification loss is designed to quantify the error in prediction for a given region. It penalizes the model based on the highest proportion of misclassified examples in that region. The subtraction from 1 ensures that the loss is minimized when the maximum proportion (max(p^c)) is close to 1, indicating that the dominant class in the region is correctly predicted. In decision tree training, the goal is to find the best splits that minimize the misclassification loss. When growing a tree, the algorithm evaluates different splits on different features and thresholds to find the one that minimizes the weighted sum of misclassification losses for the resulting child regions. While misclassification loss is one possible choice, other loss functions like Gini impurity or entropy are also commonly used in decision tree algorithms. The choice of loss function may depend on factors such as interpretability, computational efficiency, or specific goals of the modeling task. When constructing decision trees, the goal is to find splits that result in child regions ( and R2) with lower misclassification loss. The process involves evaluating potential splits on different features and thresholds and selecting the split that minimizes the overall loss. Mathematically, this is expressed as: where,:
The objective is to find a split that results in child regions with the minimum total misclassification loss. This process is typically performed recursively, creating a tree structure where each node represents a region, and each split further refines the regions until a stopping criterion is met. By selecting splits that decrease the overall misclassification loss, the decision tree aims to improve predictive accuracy and separate the data into homogeneous groups with respect to the target variable. In decision tree construction, the focus is on minimizing the misclassification loss in the child regions ( and ) rather than the loss in the parent region (). The reason for this is that the parent region will be split into two child regions, and it is the reduction in misclassification loss in these child regions that contributes to the improvement in the overall model. The specific split selected is determined by evaluating different splits and choosing the one that minimizes the overall misclassification loss. This approach is part of the recursive nature of decision tree construction, where each split aims to improve the model's predictive ability by creating more homogenous child regions with respect to the target variable. The minimization of the misclassification loss in the child regions is the driving factor for selecting the optimal splits during the tree-building process. Decision trees are fairly high variance models. Variance in machine learning refers to the model's sensitivity to the specific training data it has been exposed to. High variance often means that the model is very flexible and can fit the training data very closely, but it may not generalize well to new, unseen data. Decision trees are known for being able to capture complex relationships in the training data, which can lead to high variance. Each decision node in a tree makes a decision based on a particular feature, and the tree structure can become very deep and complex, especially if it's allowed to grow without constraints. High variance can be a double-edged sword. On one hand, it allows the model to learn intricate patterns in the training data, making it capable of fitting complex relationships. On the other hand, this flexibility can lead to overfitting, where the model becomes too tailored to the training data and performs poorly on new, unseen data. To mitigate the high variance of decision trees, techniques like pruning or ensemble methods (e.g., Random Forests or Gradient Boosted Trees) are often used. Pruning involves removing branches of the tree that do not provide significant improvement in predictive performance on the validation set, helping to prevent overfitting. Ensemble methods combine multiple decision trees to reduce overfitting and improve generalization. Therefore, some regularization techniques can be used to avoid overfitting in decision trees even though pruning is the traditional technique for reducing overfitting in decision trees. The technique commonly used for reducing overfitting in decision trees is pruning. Pruning involves trimming the tree by removing parts of it that do not provide significant predictive power or that are likely to be overfitting to the training data. This helps prevent the model from fitting too closely to the noise or idiosyncrasies of the training data and improves its ability to generalize to unseen data. About the time complexity during training and testing (inference) phases, we have:
The advantages of decision trees are:
Decision trees, while a popular and powerful machine learning algorithm, do have some disadvantages:
============================================
|
||||||||
| ================================================================================= | ||||||||
|
|
||||||||