Document toolboxDocument toolbox

Robust Random Cut Forest chart

Overview

During the training phase, a forest of trees that represent normal behavior is created. When a new point arrives it is inserted into the trees creating a distortion inside them. The amount of reorganization needed in order to stabilize the tree is translated into an anomaly score called codisplacement. Anomalies are produced by defining a threshold on the anomaly score.

When should I use this chart?

This chart is less affected by data periodicity than the Triple exponential chart, so it is more versatile. This algorithm used by the Robust Random Cut Forest chart is suitable for data flows that don't necessarily have a constant period.

What data do I need for this chart?

Firstly, it is important to note that this chart is meant to be used with time series, that is to say, numerical univariate data flows that must have a constant time step and an associated timestamp.

The option to create this chart will be disabled unless your query contains a temporal grouping with no columns added as arguments. Furthermore, your query must contain a numeric column (for example, you can aggregate your data and add a count).

For example, this is a correct query that would enable the Robust Random Cut Forest chart option in the search window:

from demo.ecommerce.data group every 1h select count() as count

Creating a robust random cut forest chart

Working with Robust Random Cut Forest charts

The data analysis performed by this chart can be divided into two phases: training and evaluation.

The points used to train are those included under the green band, which can be modified by dragging the band or using the options at the top of the chart, as explained below. Everything that is not part of the training is the part that will be evaluated by the algorithm. After training part of the data in a specified series, the chart will predict potential anomalies and will indicate them as red points.

It is advisable, as far as possible, to avoid including visible anomalous points in the training set.

After selecting the required period, click the Train button to get the results.

You can configure the following options at the top of the chart window:

Shingle

Corresponds to the number of points of the sliding window. The sliding window is used to convert a univariate variable to a multivariate one. The default value is 1

This parameter controls how fast the algorithm is going to adapt and learn changes in the incoming data flow. A low shingle (short sliding window) means that the algorithm is more flexible to changes, forgets older data faster, and tends to learn changes quickly. On the other hand, a high shingle value will not adapt fast to changes and will remember old behavior longer. One needs to find the right balance between flexibility on detecting data changes (anomalies) and the undesired tendency to learn anomalies too quickly and thus not detect them as such in the future.

The shingle value is somehow equivalent to the notion of the period and it is often chosen as a multiple of the period.

Number of trees

A higher number of trees leads to better estimation results. However, the improvement decreases as the number of trees increases, in other words, at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees. The default value is 40.

Tree size

A higher tree size improves the capacity to learn more about the data. As with the Number of trees parameter, at a certain point, the cost in computation time will increase considerably. The default value is 256.

Time decay

Expected age at which a random sample point should expire and be replaced by a tree. It is used to calculate the internal decay factor, which is 1/time decay. It must be greater than or equal to the tree size. The default value is 256.

Inizialite points

This value can also be edited by dragging the light green band. The default value is 100.

Keep in mind that you may not get proper results if you enter a low value in this parameter since the algorithm would not be able to learn enough. However, indicating the widest possible range may also be dangerous, since it may cause overfitting.

Threshold

Limit on the algorithm score from which the points are considered anomalies. After training the model, you can update the threshold by clicking the Update threshold button next to this option or by dragging the red horizontal line. The default value is 10.

The threshold is a constant value that decides how strict do you want your system to be with the anomalies it produces. Increasing the threshold will make the chart detect only very clear anomalies. This is useful in situations where only extremely weird data points are of importance and ambiguous ones can be discarded. By contrast, if the threshold is set to a low value, every point that differs slightly from the normality will be an anomaly. This is often the case when tiny variations are of extreme importance (for example, imagine anomalies on a certain quantity that causes a deathly disease). In this case, even though many false positives will be produced, one would prefer to catch every point that is different in any aspect. This quantity is different in every series, so that's why a deep understanding of the underlying problem is required before setting this value.

You can zoom in to specific parts of the series by clicking a point in the top chart and dragging to the required ending point. You can also use the sliders in the bottom chart to specify the required part of the series. To go back to the default zoom, return the sliders to the beginning and end of the bottom chart or click the All button in the Zoom area.

Handling missing points

The robust random cut forest chart needs a regular and uninterrupted flow of data points in order to work properly, so missing values need special handling to make the chart work. There are two possible causes for missing points:

There are events that don't exist

You will not be able to train your model if the data series contains holes due to non-existing events. In this case, the chart would try to interpolate those missing points. The interpolation takes into account the average of n previous points to allow working in real-time. When interpolation occurs, gaps are filled with purple dots to indicate that you are visualizing generated values. 

The maximum number of consecutive missing points to be interpolated is 5. If this value is exceeded, you will not be able to train the model. An error will appear when clicking the Train button and holes will be marked in the chart with pinkish bands.

It is advisable to keep the lowest possible rate of points to be interpolated in order to get the most out of the chart. Keep in mind that interpolated points are simply a prediction generated by the chart, and are not real. 

This problem may be solved by regrouping your query using a greater grouping period.

The data is not fully downloaded

Interpolation only works with events that don't exist. The chart will never interpolate values from data yet to be loaded. These areas are represented on the chart as gray bands. In this case, the chart will only evaluate up to the first gap.

There must be enough events for at least training and evaluating one point before the gap starts, otherwise, you will be notified. Click the Download more button in the warning message that appears to download the data required for the widget to train, or do it manually activating the Load all events option in the Event loading indicator of the search window.

If you wish to fill certain areas where gaps are located, you can do so by clicking on the event timeline at the top of the search window. Learn more about loading data in the search window here.