Document toolboxDocument toolbox

How Devo indexes data

One of the biggest challenges facing big data solutions isn't how to collect and store large amounts of data, but rather how to quickly find the "needles in the haystack". To accelerate searches for very specific information, the Devo Platform uses a unique two-level system for indexing data. 

Tag index 

This is the primary index used to locate the requested data across multiple data nodes. This index is used for every query and every time you call upon a table in the finder.

The primary function of the tag index is to identify the files that contain the events you are requesting.

Every query calls a data table and every table is usually associated internally to one or sometimes more tags. Each event is saved in an individual file path, identifying the complete event tag, the domain, and the date. The tag index can thus quickly isolate the files containing the data requested.

This index sometimes works in combination with the token index.

Token index

This second index contains all tokens identified in the data saved across all data nodes. An internal indexing task runs regularly to scan the recently ingested events to identify all tokens and add them to this index. This is an inverted index, which means that every token is mapped to the individual events in which the token has been found.

The primary function of the token index is to identify the events that contain the data you seek within the files already identified by the tag index.

So, what's a token?

A token is simply a string of alphanumeric characters separated by ASCII symbols (non-alphanumeric characters like symbols and spaces) in the raw event as it was delivered to Devo. Devo also recognizes as tokens any value that matches the IPv4, IPv6, or MAC address-like data formats. Therefore, not only will Devo identify 10.0.1.2 and aa:bb:cc:dd as tokens, but also their component parts, 10, 0, 1, 2, aa, bb, cc, and dd because these component parts are delimited by ASCII symbols (the periods and colons). 

Here's an excerpt from a firewall event. In green are all of the substrings identified as tokens. In blue is highlighted a complete standard IP address, also recognized and indexed as a token.

Since almost all raw data sent to Devo uses spaces or other ASCII symbols as separators between field values, the first segment of an event (up to the first ASCII symbol) is also identified as a token. For example, the token access in the example above.

When is the token index used?

Once the tag index has located the relevant files in Devo's data nodes, this index may be used to accelerate the location of the events that you're looking for. Whenever a query launches a search for a string, the query engine determines if the token index should be consulted. It does this based upon the LINQ operation used and how the string to search for is formatted. However, only those operations designed to identify string values can trigger the use of this index.

In addition, there are three ways that matches can be found in the token index:

  1. By searching for an entire token. 

  2. By searching for tokens that begin with a specified prefix.

  3. By searching for tokens that end with a specified suffix.

These are the LINQ operations that always use the token index, regardless of how the search string is formatted: 

Operation name

Case sensitivity

Description

Operation name

Case sensitivity

Description

Contains tokens (toktains)

Case-sensitive

This operation assumes that the string to search for is a token and therefore always uses the token index. It is a case-sensitive operation, however, so searching for Banana is not the same as searching for banana.

For example, toktains(message, "dev01")will return events where the message field contains dev01 as a token but not as just any substring. For example, it will return events that contain us.dev01.web or simply dev01, but not dev013 or xdev01. This is because, in the last two examples, dev01 is not a token.

However, if the optional left-extended and right-extended Boolean arguments are used, toktains will treat the search string differently. For example: 

toktains(message, "dev01", true, false)

will return events where a token ends with dev01. With the left-extended argument set to true, this tells the query engine that the search string is not a complete token and that it is preceded by additional alphanumeric characters.

Starts with (startswith)

Case-sensitive

This operation assumes that the string to search for is the beginning of a token and therefore always uses the token index. Like toktains, this operation is case-sensitive.

This returns events containing tokens that start with the specified string.

Ends with (endswith)

Case-sensitive

This operation assumes that the string to search for is the end of a token and therefore always uses the token index. Like toktains, this operation is case-sensitive.

This returns events containing tokens that end with the specified string.

Equal (eq, =)

Case-sensitive

Since these operations look for an exact match of the string to search for, they always use the token index. While eq is case-sensitive, eqic can be used when the search should ignore case.

These return events containing tokens that exactly match the specified string (either regarding or disregarding case).

Equal - case insensitive (eqic)

Case-insensitive

These are the LINQ operations that sometimes use the token index, depending upon how the search string is formatted: 

Operation name

Case sensitivity

Description

Operation name

Case sensitivity

Description

Contains (has, ->)





Case-sensitive

These operations will use the token index if the search string contains all or part of a token. That is to say, if the search string contains an alphanumeric string bounded on the left and/or right by a non-alphanumeric ASCII symbol.

To illustrate this, let's look at how the following query filter would be handled by the query engine.

has(message, "us.dev01.web")

The token index will be used to accelerate the search for:

  • Tokens that end in us.

  • The token dev01.

  • Tokens that start with web.

The results will include a subset of events identified by the token index search and will be only those events that contain the full search string.

However, unlike toktains, which assumes all search strings are tokens, has and weakhas will only use the token index if the search string contains at least one non-alphanumeric ASCII symbol.

Contains - case insensitive (weakhas)

Case-insensitive

Is in (`in`, <-)

Case-sensitive

Like has and weakhas, these operations will use the token index when the search string contains non-alphanumeric symbols that define a complete or partial token. Here are some examples:

  • weakin("john", email) - Does not provoke a search of the token index and the results will include all events containing the string or substring "john".

  • weakin(".john", email) - Provokes a search of the token index to find events that contain a token that begins with "john".

  • weakin(".johnson@", email) - Provokes a search of the token index to find events where "johnson" occurs as a token. The results will contain only those events where "johnson" is preceded by a period (.) and followed by an @ symbol.

Is in - case insensitive (weakin)

Case-insensitive

 

Learn more about the LINQ operations mentioned in the article.