Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

Overview

Logs generated by most AWS services (Cloudtrail, VPC Flows, Elastic Load Balancer, etc.) are exportable to a blob object in S3. Many other 3rd party services have also adopted this paradigm so it has become a common pattern used by many different technologies. Devo Professional Services and Technical Acceleration teams have a base-collector code that will leverage this S3 paradigm to collect logs and can be customized for different customer's different technology logs that may be stored in S3.

This documentation will go through setting up your AWS infrastructure for our collector integration to work out of the box:

  • Sending data to S3 (this guide uses Cloudtrail as a data source service)

  • Setting up S3 event notifications to SQS

  • Enabling SQS and S3 access using a cross-account IAM role

  • Gathering information to be provided to Devo for collector setup

General architecture diagram

2bd87f6b-a1f6-4d91-b079-7933757fb68b.png

Requirements

  • Access to S3, SQS, IAM, and CloudTrail services

  • Permissions to send data to S3

  • Knowledge of log format/technology type being stored in S3

 

Devo collector features

Feature

Details

Allow parallel downloading (multipod)

Allowed

Running environments

Collector Server, On Premise

Populated Devo events

Table

Flattening Preprocessing

No

Data sources

Data source

Description

Collector service name

Devo table

Available from release

Any

Theoretically any source you send to an SQS can be collected

 

 

v1.0.0

CONFIG LOGS

 

aws_sqs_config

cloud.aws.configlogs.events

v1.0.0

AWS ELB

 

aws_sqs_elb

web.aws.elb.access

v1.0.0

AWS ALB

 

aws_sqs_alb

web.aws.alb.access

v1.0.0

CISCO UMBRELLA

 

aws_sqs_cisco_umbrella

sig.cisco.umbrella.dns

v1.0.0

CLOUDFLARE LOGPUSH

 

aws_sqs_cloudflare_logpush

cloud.cloudflare.logpush.http

v1.0.0

CLOUDFLARE AUDIT

 

aws_sqs_cloudflare_audit

cloud.aws.cloudflare.audit

v1.0.0

CLOUDTRAIL

 

aws_sqs_cloudtrail

cloud.aws.cloudtrail.*

v1.0.0

CLOUDTRAIL VIA KINESIS FIREHOSE

 

aws_sqs_cloudtrail_kinesis

cloud.aws.cloudtrail.*

v1.0.0

CLOUDWATCH

 

aws_sqs_cloudwatch

cloud.aws.cloudwatch.logs

v1.0.0

CLOUDWATCH VPC

 

aws_sqs_cloudwatch_vpc

cloud.aws.vpc.flow

v1.0.0

CONTROL TOWER

VPC Flow Logs, Cloudtrail, Cloudfront, and/or AWS config logs

aws_sqs_control_tower

 

v1.0.0

FDR

 

aws_sqs_fdr

edr.crowdstrike.cannon

v1.0.0

GUARD DUTY

 

aws_sqs_guard_duty

cloud.aws.guardduty.findings

v1.0.0

GUARD DUTY VIA KINESIS FIREHOUSE

 

aws_sqs_guard_duty_kinesis

cloud.aws.guardduty.findings

v1.0.0

IMPERVA INCAPSULA

 

aws_sqs_incapsula

cef0.imperva.incapsula

v1.0.0

LACEWORK

 

aws_sqs_lacework

monitor.lacework.

v1.0.0

PALO ALTO

 

aws_sqs_palo_alto

firewall.paloalto.[file-log_type]

v1.0.0

ROUTE 53

 

aws_sqs_route53

dns.aws.route53

v1.0.0

OS LOGS

 

aws_sqs_os

box.[file-log_type].[file-log_subtype].us

v1.0.0

SENTINEL ONE FUNNEL

 

aws_sqs_s1_funnel

edr.sentinelone.dv

v1.0.0

S3 ACCESS

 

aws_sqs_s3_access

web.aws.s3.access

v1.0.0

VPC LOGS

 

aws_sqs_vpc

cloud.aws.vpc.flow

v1.0.0

WAF LOGS

 

aws_sqs_waf

cloud.aws.waf.logs

v1.0.0

Options

See examples of common configurations here: General S3 Collector Configuration Examples and Recipes
There are many configurable options outlined in the README on the GitLab link, reproduced here. See GitLab repository for specific examples in each subdirectory.

  • direct_mode --- true or false (default is false), set to true if the logs are being sent directly to the queue without using s3.

  • file_field_definitions --- defined as a dictionary mapping variable names (you decide) to lists of parsing rules.
    each parsing rule has an operator, with its own keys which go along with it. Parsing rules are applied in the order they are listed in the configuration.

    • The "split" operator takes an "on" and an "element" -- the file name will split into pieces on the character or character sequence specified by "on" and extract whatever is at the specified "element" index as in the example.

    • the "replace" operator take a "to_replace" and a "replace_with"

    • For example, if your filename were "server_logs/12409834/ff.gz", this configuration would store the log_type as "serverlogs"

"file_field_definitions": 
{
	"log_type": [{"operator": "split", "on": "/", "element": 0}, {"operator": "replace", "to_replace": "_", "replace_with": ""}]
}
  • filename_filter_rules: a list of rules for filtering out entire files.

  • encoding -- takes a string from one of the following: “gzip” “none” “parquet”

  • ack_messages -- whether or not to delete messages from the queue after processing, takes boolean values. If not specified, default is true. We recommend leaving this out of the config. If you see it in there, pay close attention to if it’s on or off.

  • file_format -- takes a dictionary with the following keys

    • type -- a string specifying which processor to use

      • single_json_object -- logs are stored as/in a json object

        • single_json_object_processor config options: “key” (string: the key of where the list of logs is stored) See cloudtrail_collector for example.

          • config: {"key": "log"}
            fileobj:  {..."log": {...}}
      • unseparated_json_processor -- logs are stored as/in json objects which are written in a text file with no separator

        • unseparated_json config options: “key” (string: where the log is stored), “include” (dict: maps names of keys outside of inner part to be included, which can be renamed). If there is no key, that is, the whole JSON object is the desired log, set “flat”: true See aws_config_collector for example

          • fileobj:  {...}{...}{...}
      • text_file_processor -- logs are stored as text files, potentially with lines and fields separated with e.g. commas and newlines

        • text_file config options: includes options for how lines and records are separated (e.g. newline, tab, comma), good for csv style data.

      • line_split_processor –- logs stored in a newline separated file, works more quickly than separated_json_processor

        • config options: “json”: true or false. If setting json to true, assumes that logs are newline-separated json, and allows them to be parsed by the collector therefore enabling record-field mapping

      • separated_json_processor – logs stored as many json objects that have some kind of separator

        • config options: specify the separator e.g. “separator”: “||”. the default is newline if left unused.

          • fileobj:  {...}||{...}||{...}
      • jamf_processor – special processor for JAMF logs

      • aws_access_logs_processor – special processor for AWS access logs

      • windows_security_processor – special processor for Windows Security logs

      • vpc_flow_processor – special processor for VPC Flow logs

      • json_line_arrays_processor – processor for unseparated json objects that are on multiple lines of a single file

        • fileobj:  {...}{...}
          {...}{...}{...}
          {...}
      • dict_processor – processor for logs that comes as python dictionary objects, i.e. in direct mode

    • config -- a dictionary of information the specified file_format processor needs

  • record_field_mapping -- a dictionary -- each key defines a variable that can be parsed out from each record (which may be referenced later in filtering)
    e.g., we may want to parse something and call it "type", by getting "type" from a certain key in the record (which may be multiple layers deep).

    {"type": {"keys": ["file", "type"]},	"operations": []	}

    keys is a list of how key values in the record to look into to find the value, its to handle nesting (essentially defining a path through the data). Suppose we have logs that look like this:

    {“file”: {“type”: { “log_type” : 100}}}

    so if we want to get the log_type, we should list all the keys needed to parse through the json in order:

    keys: [“file”, “type”, “log_type”]

    In many cases you will probably only need one key.

    e.g. in flat json that isn’t nested

    {“log_type”: 100, “other_info”: “blah” ….}

    here you would just specify keys: [“log_type”]. A few operations are supported that can be used to further alter the parsed information (like split and replace). This snippet would grab whatever is located at log[“file”][“type”] and name it as “type”. record_field_mapping defines variables by taking them from logs, and these variables can then be used for filtering. Let’s say you have a log in json format like this which will be set to devo:

    {“file”: {“value”: 0, “type”: “security_log”}}

    Specifying “type” in the record_field_mapping will allow the collector to extract that value, “security_log” and save it as type. Now let’s say you want to change the tag dynamically based on that value. You could change the routing_template to something like my.app.datasource.[record-type]. In the case of the log above, it would be sent to my.app.datasource.security_log. Now let’s say you want to filter out (not send) any records which have the type security_log. You could write a line_filter_rule as follows:

    {"source": "record", "key": "type", "type": "match", "value": "security_log" } We specified the source as record because we want to use a variable from the record_field_mapping. We specified the key as “type” because that is the name of the variable we defined. We specify type as “match” because any record matching this rule we want to filter out. And we specify the value as security_log because we specifically do not want to send any records with the type equalling “security_log” The split operation is the same as if you ran the python split function on a string.

    Let’s say you have a filename “logs/account_id/folder_name/filename” and you want to save the account_id as a variable to use for tag routing or filtering.

    You could write a file_field_definition like this:

    "account_id": [{"operator": "split", "on": "/", "element": 1}]

    This would store a variable called account_id by taking the entire filename and splitting it into pieces based on where it finds backslashes, then take the element as position one. In Python it would look like:

    filename.split(“/”)[1]
  • routing_template -- a string defining how to build the tag to send each message. e.g.
    "my.app.wow.[record-type].[file-log_type]" -- if the "type" extracted during record_field_mapping were "null", the record would be sent to the tag "my.app.wow.null"

  • line_filter_rules -- a list of lists of rules for filtering out individual records so they do not get sent to devo
    for example:

"line_filter_rules": [
	[{
        "source": "record",
        "key": "type",
        "type": "doesnotmatch",
        "value": "ldap"
      }],
    [
      {"source": "file", "key": "main-log_ornot", "type": "match", "value": "main-log"},
      {"source": "record", "key": "type", "type": "match", "value": "kube-apiserver-audit"},
    ]
  ]

This set of rules could be expressed in pseudocode as follows:
if record.type != "ldap" OR (file.main-log_ornot == main-log AND record.type == "kube-api-server-audit"):
do_not_send_record()

(Internal) Notes + Debugging
Config can include "debug_mode": true to print out some useful information as logs come in.
For local testing it is useful to set "ack_messages" to false, to try processing without eating from the queue. Be careful to remove this or set it to true when launching the collector. The default is to ack messages if it is not set.

If something seems wrong at launch, you can set the following in the collector parameters/ job config.

"debug_mode": true,
"do_not_send": true,
"ack_messages": false

This will print out data as it is being processed, stop messages from getting hacked, and at the last step, not actually send the data (so you can see if something is breaking without the customer getting wrongly formatted repeat data without consuming from the queue and losing data)

  • No labels