In the world of data, and especially in those environments where we want to extract value from the data, the term “anomaly” usually appears. An anomaly is nothing more than a piece of data within our dataset that does not follow expected behaviour.#Anomaly: a data within our dataset that does not follow the expected behaviour #datascience Click To Tweet
Anomaly detection has a lot of use cases, from cyber security (intrusion) to medicine (image processing), industrial maintenance, fraud detection and also the Internet of Things (monitoring). A good anomaly detection system can reduce business costs in many industries, but we often find ourselves facing questions such as:
What algorithm do I choose to detect anomalies? How do I make my model available to the other business units?
This article will discuss one of the most powerful solutions in recent years in terms of anomaly detection accuracy. Not only powerful, but also effective and easy to implement, Isolation Forest has become one of the most accepted star algorithms for detecting anomalies within the Data Science community.
Will this algorithm be able to do the job of many people monitoring a process and/or data flow in order to find strange or atypical records? The answer is definitely yes! The truth is that as long as there is a minimum quality of data, we are dealing with a classic case of robot-replacing-human. Let us begin.
A Forest of Solitudes
To understand how Isolation Forest works, we have to see how a decision tree concludes that a point is anomalous. The steps that a tree performs are:
- Choosing a record within the dataset and its variables;
- Choosing a random value within the minimum and maximum of each variable;
- Creating a node or branch: if the value of the record under consideration is greater or less than the previous random value, we repeat the exercise of evaluating our point with the minimum and maximum interval, limiting it further this time, with the cut-off point being the new maximum or minimum of the branch created.
- Executing the third step until further branching is not possible and the point to be evaluated is isolated.
Thus, the fewer branches needed by the tree to isolate the point, the more anomalous it will be. And if we make, for example, 100 trees and calculate the average number of branches for each value, we will have a fairly robust approximation of the degree of anomaly of each observation or record in our data set. And without going into mathematical terms or formulations, that is how the Isolation Forest works. A big advantage of this algorithm compared to other methods is that it does not use distance, similarity or density measurements of the dataset, which is usually very expensive computationally. Complexity of the Isolation Forest grows linearly thanks to the benefits of sub-sampling: it computes trees by sub-parts of the dataset. Thus, it is able to scale in large datasets and with many irrelevant variables.
Furthermore, we can use it in supervised as well as unsupervised mode. In the latter case, we would have to check the percentage of observations in our data that are considered anomalies with the Product Owner or a business manager.
Putting the Forest to Work
The most severe headaches or the data scientist often arise not from extensive formulas or selections of the best hyper parameter, but from the way a model is produced. Let’s get down to business.
With the right combination of Python frameworks like Flask and the fantastic Docker, we can work wonders. In this case we create an API, a simple application, that we ask whether the data we’re sending is anomalous or not. To do this, we will use a famous cardiology data dataset from this repository and the great PyOD framework.
Click here to see the project we created. First of all, we must have two scripts inside our program: one that takes care of training and another that takes care of prediction and the API service. In this case, we will use Flask as a bookstore server, which allows us to create simple applications that make endpoints available to the world. In this instance, we will be creating this application in our home server, but as mentioned, we can use Docker and a simple deployment in some of the public clouds (Amazon Web Services, Microsoft Azure and Google Cloud) to really open our application to the corresponding business unit, or simply to the rest of the world.
Furthermore, we are creating a folder structure that allows us to store data. Again, this serves as a template structure, but ideally the data should be hosted in an appropriate database. Here we would simply have to create a configuration file with the connection strings. The same goes for the model we are training: it should be stored in a binary file system, but we are hosting it inside the folder structure to show the full functionality of Flask. Without any further ado, the training script looks like this.
We see that we have a place from which we take the hyperparameters. This JSON can be extrapolated to where we store the connection strings to our databases, as well as the routes where our model and more training execution configurations are hosted. We see that we train the model with the test data and simply save it. It will be this model that later loads the application into memory in order to make the inference. It can be seen here.
The Scoring Service class is in charge of loading the model and making the inference. Moreover, the other functions indicate the nature of the endpoints we have created. The main function is the one that creates the Home of the application, so if we run the program, the application will already be up! Going to http://0.0.0.0:5000/ takes us to the Home of the application. If we keep looking at the script we’ll see that the api_predict function is just a POST method. It receives data, invokes the Scoring Service class and performs the inference, returning the appropriate prediction for the data received. Simple, isn’t it?
Once we see that our application is running, we can test it by running the following test script (which doesn’t test the model!)
The Forest in the Cloud
Practically all public clouds provide you with similar self-scaling services for absurd data volumes. There are many examples of implementation of similar algorithms. In AWS, for example, the self-managed Sagemaker service of Machine Learning has a variant of the Isolation Forest. Azure also has an interesting API for detecting anomalies in time series. The possibilities are endless! The change to the cloud is really a paradigm shift and we now have all the computing power we want or certainly the minimum required to undertake applications without servers that provide access to our API at no cost.
Imagen: Unsplash | Rishi Deep