Anomaly Detection Part 2: Isolation Forest
What is Isolation forest ?
Isolation Forest is an unsupervised anomaly detection algorithm that isolates observations by randomly selecting a feature and then randomly selecting a split value for that feature. It works on the principle that anomalies are "few and different," meaning that they are easier to isolate than normal points.
Anomalies (outliers) are more easily separated because they exist in low-density regions, and thus are isolated quickly with fewer splits. Normal points require more splits to be isolated because they lie in dense regions. Fig. 2 depicts this schematically. An Isolation Forest consists of several Isolation Trees.
The Algorithm
Given a sample of data points X, the Isolation Forest algorithm builds an Isolation Tree (iTree), T, using the following steps:
1. Randomly select an attribute q and a split value p.
2. Divide X into two subsets by using the rule q < p. The subsets will correspond to
a left subtree and a right subtree in T.
3. Repeat steps 1–2 recursively until either the current node has only one sample
or all the values at the current node have the same values.
The algorithm then repeats steps 1–3 multiple times to create several Isolation Trees, producing an Isolation Forest. Based on how Isolation Trees are produced and the properties of anomalous points, we can say that most anomalous points will be located closer to the root of the tree since they are easier to isolate when compared to normal points.
Background on trees.
We will look at some basic concepts regarding trees with a focus on binary trees. Refer Fig. 3 for a schematic
Key Terminologies for a tree
Path Length: The path length h(x) of a point is the number of edges ‘x’ traverses from the root nodes.
Height: The height of a node ‘u’ is the length of the longest path length from ‘u’ to a leaf node.
Depth: The depth of a node ‘u’ is the length of the path from the root to ‘u’
Level : The set of all nodes at the same depth.
Parent Node: The node which is an immediate predecessor of a node is called parent of that node.
Child Node: The node which is an immediate successor of a node is called child node of that node
Root Node: The node which is topmost , and which does not have any parent is called root node
Leaf or External Node: Nodes which do not have child nodes
Sibling: Children of the same parent
Internal Node : Node with at least one child node
Binary Tree concepts
Ordered Trees: An ordered tree is a tree for which the order of the children of each node is considered important
Binary Trees: A binary tree is an ordered tree such that each node has <= 2 children. Call them left and right children.
Fig. 4 below depicts edge cases in Binary Trees and Fig 5. Shows a small binary tree vis a vis the concept of an extended binary tree.
Full and Complete Binary Trees:
A Binary Tree is called Full if every node has either 0 or 2 children . If the lowest ‘d-1’ levels of a Binary Tree of height ‘d’ are filled and the level ‘d’ is partially filled from left to right, then the tree is called complete. Fig. 6 Depicts a Perfect Binary Tree.
Perfect Binary Trees:
If all ‘d’ levels of a height ‘d’ Binary Tree are filled , the tree is called perfect. Refer Fig. 7 for a Schematic Example
Key Theorem around Binary Trees
Full Binary Tree Theorem
In a non empty full binary tree, the number of internal nodes is 1 less than the number of leaves. Refer Fig. 8.
Proof: By Induction
L(n) – number of leaves in ‘n’ internal nodes = n + 1
L(o) = n + 1 = 0 + 1 = 1
Induction step :
L(i) = i + 1 for i < n . Given a Tree with ‘n’ internal nodes. Remove 2 sibling leaves. Consider the tree T1 after removal of Sibling Nodes . The new tree T1 has ‘n-1’ internal nodes and n leaf nodes.
Thus by Induction L (n-1) = (n-1) +1 = n .
Now replace removed leaves to tree T1. One leaf node is now an internal node and two new leaf nodes are added and one existing leaf node is replaced.
Recommended by LinkedIn
Thus L(n) = n + 2 -1 = n + 1 leaf nodes
Average Height of a Binary Tree
Consider ‘n’ points such that a full Binary Tree may be constructed. Refer Fig. 9 Let ‘h’ be the height of the tree (i.e. the distance between the root and the leaf node).
Now consider a GP Series as follows
· Number of terms ‘n’ = h + 1
· First term ‘a’ = 1
· Common ratio ‘r’ = 2
The sum of a GP series is then given as
Internal and External Path length of a Binary Tree
The sum of the depths of internal nodes is the internal path length. For the Tree shown in Fig. 10. We have two internal nodes at depth = 1. Thus the internal path length is = 2. For the same Tree, we have 4 external nodes, each at a path length 2, thus the external path length would be 8 in this case.
Proof by Induction:
Consider a Tree to have ‘n’ internal nodes , n+1 external nodes. Now add 2 daughter nodes to one of the external nodes so that the resulting tree has n+1 internal nodes, and n+2 external nodes. Let the depth of the new tree be ‘d’. Thus now the ‘new’ tree has ‘n+1’ internal nodes and ‘n+2’ external nodes. Fig. 11
Unsuccessful Search in a Binary Tree
An unsuccessful search traverses a path starting at root node and ending at an external node. Refer Fig. 12
Average Length of Unsuccessful search in a Binary Tree
Consider the Binary Tree shown in Fig. 13 . We have n = 6 internal nodes , n + 1 = 7 external or dummy nodes. Consider internal path length in this case,
Deriving Anomaly Score
Explanation of the formula for normalizing factor c(n)
· when E(h(x)) → c(n), s → 0.5;
· when E(h(x)) → 0, s → 1;
· and when E(h(x)) → n − 1, s → 0.
Using the anomaly score ‘s’, we are able to make the following assessment:
• If instances return s very close to 1, then they are definitely anomalies,
• If instances have s much smaller than 0.5, then they are quite safe to be regarded as normal instances, and
• If all the instances return s ≈ 0.5, then the entire sample does not really have any distinct anomaly.
Code Examples
In my next Article Will do a thorough code based analysis. Readers are encouraged to look up references [3] , [4] for code examples
REFERENCES
[1] F. T. Liu, K. M. Ting, and Z. H. Zhou, Isolation Forest, (2008), 2008 Eighth IEEE
International Conference on Data Mining
Mathematics and Statistics faculty, Mentor, Training, Mathematics Education
5moLucid and detailed explanation of the topic,Dr. Anish Roychowdhury, Ph.D.
GM - Data Scientist | Schneider Electric | Global Supply Chain
5moSimple yet detailed, very well put together Anish!
AI Enablement Intern
5moVery informative and insightful
Founder and CEO - AI LENS | Passionate about harnessing the power of Advanced AI to solve business problems. Let's connect and innovate the future together!
5moInsightful!
Chief Investment Officer and Head of Investment Solutions at Lombard Odier Group
5moCongrats Anish!! I look forward to reading!!