Introduction to Isolation Forests
Isolation Forests (IFs), presented in Liu1 et. al (2012), are a popular algorithm used for outlier classification. In a very simplified way, the method consists of building an ensemble of Isolation Trees (ITs) for a given data set and observations are deemed anomalies if they have short adjusted average path lengths on the ITs.
ITs, which will be covered shortly, have several properties in common with a fundamental data structure: the Binary Search Tree (BSTs). In a very simplified way, BSTs are a special instance of tree structures where keys are kept in such an order that a node search is performed by iteratively (or recursively) choosing a left or right branch based on a quantitative comparison (e.g. lesser or greater). Node insertion is performed by doing a tree search, using the method described previously, until reaching an external node, where the new node will be inserted. This allows for efficient node searches since, on average, half the tree will not be visited. To illustrate this assume the values
One of the properties of BSTs is that, with randomly generated data, the path between the root node and the outliers will typically be shorter. We can see from the illustration below that, with our example data, the path length for (say) 7 is twice the length than for the suspicious value of 54. This property will play an important role in the IF algorithm, as we will see further on.
{{< figure src="/site/images/isolationforests/bst_path_length.png" alt="bst_path_length.png" >}}Isolation Trees
Since ITs are the fundamental component of IFs, we will start by describing their building process. We start by defining
To build an isolation tree
Based on the cut-off point
and return an internal node having an isolation tree with left and right nodes as respectively
To illustrate this (and the general method of identifying anomalies in a two dimensional feature space,
The particular realisation of this simulation looks like this:
{{< figure src="/site/images/bbdtrees/gaussian_data.png" alt="data.png" >}}Below we illustrate the building of a single IT (given the data), illustrating the feature split point
This is a modal window.
In order to perform anomaly detection (e.g. observation scoring) we will then use the IT equivalent of the BST unsuccessful search heuristics. An external node termination in an IT is equivalent to a BST unsuccessful search. Given an observation
This technique amounts to partitioning the feature space randomly until feature points are “isolated”. Intuitively, points in high density regions will need more partitioning steps, whereas anomalies (by definition away from high density regions) will need fewer splits. Since the building of the ITs is performed in a randomised fashion and using a subsample of the data, this density predictor can be average over a number of ITs, the Isolation Forest.
Intuitively, this could be done by calculating the average path length for our
where
Denoting
Given these quantities we can then, finally, calculate the anomaly score,
with
- Parameters
As mentioned in Liu3 et. al (2012), the empirical subsampling size
Now that we know how to implement an IF algorithm and calculate an anomaly score, we will try to visualise the anomaly score distribution in the vicinity of the simulated data. To do so, we simply create a two dimensional lattice enclosing our data an iteratively calculate
The above steps fully define a naive isolation forest algorithm, which when applied to the previously simulated data, result in 88% of the anomalies being correctly identified.
{{< figure src="/site/images/isolationforests/detection.png" alt="detection.png" >}}Thanks for reading! If you have any questions or comments, please let me know on Mastodon or Twitter.
Footnotes
Liu, F. T., Ting, K. M., & Zhou, Z. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1), 1–39.↩︎
Liu, F. T., Ting, K. M., & Zhou, Z. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1), 1–39.↩︎
Liu, F. T., Ting, K. M., & Zhou, Z. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1), 1–39.↩︎