Document toolboxDocument toolbox

baselineScorer

Score an event based on the past behavior of similar events. If an event is anomalous, a higher score is assigned.

Operator Usage in Easy Mode

  1. Click + on the parent node.
  2. Enter the Baseline Scorer operator in the search field and select the operator from the Results to open the operator form.
  3. In the Input Table drop-down, enter or select the table containing the data to run this operator on.
  4. In the Group By Field, enter the column name by which to group the rows by.
  5. In the Metric Field, enter the column name that contains the metric to be used for scoring.
  6. In the Baseline Table drop-down, enter or select the name of the baseline table.
  7. Click Run to view the result.
  8. Click Save to add the operator to the playbook.
  9. Click Cancel to discard the operator form.

Usage Details

``` {text}baselineScorer(eventTable, groupByField, metricField, baselineTable)

**Input**  
`eventTable`: name of a step that contains the events to be scored  
`groupByField`: name of a grouping field (or lookup field) in a table  
`metricField`: name of a metric field in a table  
`baselineTable`: name of a step that contains historical events

For example: let `XYZ` table contain `time`, `ip`, `bytes_downloaded` fields. The baseline scorer operator identifies which IP addresses downloaded more or fewer bytes relative to the past. 

The operator compares `XYZ` table with historicalTable (which contains the historical events). Based on a statistical analysis it calculates whether the downloaded bytes for the particular IP address are out of range. If within range, the score is zero. If out of range, the score is based on how far from the range seen in the past (maximum score is 10). 

In the example: `XYZ` is an argument for `eventTable`, `ip` for `groupByField`, `bytes_downloaded` for `metricField`, and `historicalTable` for `baselineTable`. 

Example: `baselineScorer(XYZ, "ip", "bytes_downloaded", historicalTable)`

**Output**  
Baseline scorer operator returns `eventTable` with the original columns and two extra columns: `lhub_score` and `lhub_confidence_score`

`lhub_score`: computed score  
`lhub_confidence_score`: Confidence in the score based on the number of samples. 100 means there are enough samples to calculate the score; less than 100 means that there are fewer samples to calculate the score. The operator scores the events regardless of the number of samples, so `lhub_confidence_score` is a measure of the confidence level for the score.

## Example

`tableA` contains a baseline (history) of files downloaded for each user. `tableB` is today's data. baselineScorer compares today's data relative to the history to determine if the user downloaded more or fewer files (an anomaly).

**Input** 

**tableA**:

<style>
  th {
    border: 1px solid #cccccc;
    background-color: #eeeeee;
    padding: 8px 5px 8px 5px;
    text-align: left
  }
</style>

<div><table>
<thead>
<tr>
<th>id</th>
<th>user</th>
  <th>download_count</th>
</tr>
</thead>
<tbody>
    <tr><td>x1</td><td>emil</td><td>12</td></tr>
  <tr><td>x2</td><td>emil</td><td>22</td></tr>
  <tr><td>x3</td><td>monica</td><td>32</td></tr>
  <tr><td>x4</td><td>monica</td><td>35</td></tr>
</tbody>
</tr>
</table></div>

**tableB**:

<style>
  th {
    border: 1px solid #cccccc;
    background-color: #eeeeee;
    padding: 8px 5px 8px 5px;
    text-align: left
  }
</style>

<table>
<thead>
<tr>
<th>id</th>
<th>user</th>
  <th>download_count</th>
</tr>
</thead>
<tbody>
    <tr><td>v1</td><td>monica</td><td>25</td></tr>
  <tr><td>v2</td><td>emil</td><td>15</td></tr>
  <tr><td>v3</td><td>emil</td><td>50</td></tr>
</tbody>
</tr>
</table>


LQL command
``` {sql}
baselineScorer(tableB, "user", "downloaded", tableA)

Output

id user download_count lhub_score lhub_confidence_score
v1monica255.0100%
v2emil150.0100%
v3emil5010.0100%

User "emil" downloaded many more files than usual. "monica" downloaded fewer files than usual but her activity was less out of range thatn "emil" in x3.

lhub_confidence_score is 100% in each case, indicating that there are enough samples for high confidence.