Isolation Forest algorithm

Isolation Forest Algorithm
- Python Automation and Machine Learning for ICs -
- An Online Book -

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Isolation Forest is an algorithm used for anomaly detection in machine learning. [1] The primary goal of the Isolation Forest algorithm is to isolate anomalies or outliers in a dataset.

The key idea behind the Isolation Forest algorithm is to build a tree structure where anomalies are isolated into individual leaves. The algorithm uses the fact that anomalies are typically rare and different from the majority of the data points:

Random Selection of Features and Splitting:
- Randomly select a feature and a random split value for that feature.
- This process is repeated until a tree structure is formed.
Recursive Partitioning:
- Continue partitioning the data into two subsets based on random feature splits until individual data points are isolated in leaf nodes.
Anomaly Score Calculation:
- Anomalies or outliers are expected to be isolated with fewer splits compared to normal data points. The isolation path length, i.e., the number of edges to reach a data point in the tree, is used as an anomaly score.
- Shorter path lengths indicate anomalies.
Ensemble of Trees:
- Build multiple trees in this manner to form an ensemble. The anomaly score for each data point is averaged or aggregated over all the trees.
Anomaly Detection:
- The final anomaly score for a data point can be compared to a threshold to determine whether it is an outlier or not.

In anomaly detection using the Isolation Forest algorithm, anomalies are typically identified based on the isolation score of each data point. The isolation score is a measure of how easily a data point can be isolated or separated from the rest of the data. The lower the isolation score, the more likely the point is considered an anomaly.

For Isolation Forest, the decision function is often based on the concept of path length. The intuition is that anomalies will have shorter average path lengths in the trees built by the algorithm. The isolation score for a data point is computed as follows:

Isolation Forest, the decision function ------------------------------------------- [3699a]

where,

is the path length of data point in a tree.
is the average path length of over all trees in the forest.
is a normalization factor that depends on the number of data points .

The decision function is then derived from the isolation score:

Isolation Forest, the decision function --------------------------------------- [3699b]

where,

represents the probability of point being an anomaly.

Lower values of in Equation 3699b indicate a higher likelihood of being an anomaly. Figure 3699 shows isolation forest algorithm for anomaly detection. In the Python script using scikit-learn's IsolationForest, the decision function is available as decision_function, and the anomaly score (negative of the decision function) is used to identify anomalies. The specific decision threshold for classifying a point as an anomaly depends on the application and can be adjusted based on the desired level of sensitivity to anomalies.

Upload Files to Webpages

(a)

Upload Files to Webpages

(b)

Figure 3699. Isolation forest algorithm for anomaly detection (code): (a) Data with anomalies, and (b) Isolation forest anomaly detection.

============================================

[1] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou, Isolation Forest, 2008.

=================================================================================