Robust Random Cut Forest chart
Overview
During the training phase, a forest of trees that represent normal behavior is created. When a new point arrives it is inserted into the trees creating a distortion inside them. The amount of reorganization needed in order to stabilize the tree is translated into an anomaly score called codisplacement. Anomalies are produced by defining a threshold on the anomaly score.
When should I use this chart?
This chart is less affected by data periodicity than the Triple exponential chart, so it is more versatile. This algorithm used by the Robust Random Cut Forest chart is suitable for data flows that don't necessarily have a constant period.
What data do I need for this chart?
Firstly, it is important to note that this chart is meant to be used with time series, that is to say, numerical univariate data flows that must have a constant time step and an associated timestamp.
The option to create this chart will be disabled unless your query contains a temporal grouping with no fields added as arguments. Furthermore, your query must contain a numeric field (for example, you can aggregate your data and add a count).
For example, this is a correct query that would enable the Robust Random Cut Forest chart option in the search window:
from demo.ecommerce.data
group every 1h
select count() as count
Creating a robust random cut forest chart
Working with Robust Random Cut Forest charts
The data analysis performed by this chart can be divided into two phases: training and evaluation.
The points used to train are those included under the green band, which can be modified by dragging the band or using the options at the top of the chart, as explained below. Everything that is not part of the training is the part that will be evaluated by the algorithm. After training part of the data in a specified series, the chart will predict potential anomalies and will indicate them as red points.
Anomalous points
It is advisable, as far as possible, to avoid including visible anomalous points in the training set.
After selecting the required period, click the Train button to get the results.
If there is no data at the beginning of the time period selected, you will be not be able to train your data. This is because data is interpolated only up to the first gap. The existence of a gap will stop the interpolation and training at that point.
You can configure the following options at the top of the chart window:
Shingle | Corresponds to the number of points of the sliding window. The sliding window is used to convert a univariate variable to a multivariate one. The default value is 1. This parameter controls how fast the algorithm is going to adapt and learn changes in the incoming data flow. A low shingle (short sliding window) means that the algorithm is more flexible to changes, forgets older data faster, and tends to learn changes quickly. On the other hand, a high shingle value will not adapt fast to changes and will remember old behavior longer. One needs to find the right balance between flexibility on detecting data changes (anomalies) and the undesired tendency to learn anomalies too quickly and thus not detect them as such in the future. The shingle value is somehow equivalent to the notion of the period and it is often chosen as a multiple of the period. |
---|---|
Number of trees | A higher number of trees leads to better estimation results. However, the improvement decreases as the number of trees increases, in other words, at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees. The default value is 40. |
Tree size | A higher tree size improves the capacity to learn more about the data. As with the Number of trees parameter, at a certain point, the cost in computation time will increase considerably. The default value is 256. |
Time decay | Expected age at which a random sample point should expire and be replaced by a tree. It is used to calculate the internal decay factor, which is 1/time decay. It must be greater than or equal to the tree size. The default value is 256. |
Inizialite points | This value can also be edited by dragging the light green band. The default value is 100. Keep in mind that you may not get proper results if you enter a low value in this parameter since the algorithm would not be able to learn enough. However, indicating the widest possible range may also be dangerous, since it may cause overfitting. |
Threshold | Limit on the algorithm score from which the points are considered anomalies. After training the model, you can update the threshold by clicking the Update threshold button next to this option or by dragging the red horizontal line. The default value is 10. The threshold is a constant value that decides how strict do you want your system to be with the anomalies it produces. Increasing the threshold will make the chart detect only very clear anomalies. This is useful in situations where only extremely weird data points are of importance and ambiguous ones can be discarded. By contrast, if the threshold is set to a low value, every point that differs slightly from the normality will be an anomaly. This is often the case when tiny variations are of extreme importance (for example, imagine anomalies on a certain quantity that causes a deathly disease). In this case, even though many false positives will be produced, one would prefer to catch every point that is different in any aspect. This quantity is different in every series, so that's why a deep understanding of the underlying problem is required before setting this value. |
You can zoom in to specific parts of the series by clicking a point in the top chart and dragging to the required ending point. You can also use the sliders in the bottom chart to specify the required part of the series. To go back to the default zoom, return the sliders to the beginning and end of the bottom chart or click the All button in the Zoom area.
Handling missing points
The robust random cut forest chart needs a regular and uninterrupted flow of data points in order to work properly, so missing values need special handling to make the chart work. There are two possible causes for missing points: