Document toolboxDocument toolbox

buildTermCorpus

There are many cases where we have a handful (10 - 2000) past examples of text data and we want to see if new text is close to these saved examples. Machine learning techniques like classification are not appropriate because we don’t have enough data to train an accurate model.

Operator Usage in Easy Mode

  1. Click + on the parent node.
  2. Enter the Build Term Corpus operator in the search field and select the operator from the Results to open the operator form.
  3. In the Table drop-down, enter or select a table to create the model.
  4. In the Model Name field, enter the name of a model.
  5. In the Column field, enter a column name that contains the text to extract TF-IDF features.
  6. In the Columns to Keep drop-down, select a column name to store with the corpus as enrichment to retrieve when used in matchSimilarFromCorpus.
  7. Optional. Click Add More to add values for minimum and maximum TF parameters.
  8. Click Run to view the result.
  9. Click Save to add the operator to the playbook.
  10. Click Cancel to discard the operator form.

Usage Details

We then want to match them against an incoming stream of events to determine how close they are to what we've already observed and retrieve some enrichment about those past examples.

```cplusplus LQL Command buildTermCorpus(table: TableReference, modelName: String, column:String, columnsToKeep:String[], parameters:Double*)

**Input Parameters**  
_table_ (TableReference)    -   The table to create a model  
_modelName_ (String)    -   name of a model  
_column_    (String)    -   Column name that contains the text to extract TF-IDF features  
_columnsToKeep_ (String\[]) -   list of columns to store with the corpus as enrichment to retrieve when used in matchSimilarFromCorpus  
_parameters_    (Double)    -   minDF and minTF parameters, default values for both is 1.0 (include 100% of terms). First parameter will be always set for minTF. e.g.  
buildTermCorpus(table, model, corpus, [columnsToKeep], 0.5, 0.4) // minDF = 0.5, minTF=0.4

**Output**  
If the operator successfully run, it will return success message of model being stored.

## Example

**Input**  
table = github_logs

<style>
  th {
    border: 1px solid #cccccc;
    background-color: #eeeeee;
    padding: 8px 5px 8px 5px;
    text-align: left
  }
</style>

<div><table>
<thead>
<tr>
<th>corpus</th>
<th>label</th>
<th>domain</th>
</tr>
</thead>
<tbody>
<tr><td>a b c, d e f, g, h, i, j</td><td>x</td><td>google</td></tr><tr><td>aa b, c, d ee, ff, gg, hh, i, jj</td><td>y</td><td>facebook</td></tr>
  <tr><td>k, l, m, n, o, p, q</td><td>z</td><td>apple</td></tr>
</tbody>
</tr>
</table></div>

LQL command
``` {sql}
buildModelFromCorpus(inputTable, "corpusModel", "corpus", ["label", "domain"])
// table = inputTable
// text to train model = corpus
// columns to keep so they will be added after match is found = label and domain
// minDF and minTF are default

Output

RESULT
'Successfully created model and stored into <> file'