General setup for S3 + SQS collector

Overview

Data source	Description	Collector service name	Devo table	Available from
Any	Any source you send to an SQS can be collected.	-	-	`v1.0.0`
`CONFIG LOGS`	-	`aws_sqs_config`	`cloud.aws.configlogs.events`	`v1.0.0`
`AWS ELB`	-	`aws_sqs_elb`	`web.aws.elb.access`	`v1.0.0`
`AWS ALB`	-	`aws_sqs_alb`	`web.aws.alb.access` `web.aws.alb.connection`	`v1.0.0`
`CISCO UMBRELLA`	-	`aws_sqs_cisco_umbrella`	`sig.cisco.umbrella.dns`	`v1.0.0`
`CLOUDFLARE LOGPUSH`	-	`aws_sqs_cloudflare_logpush`	`cloud.cloudflare.logpush.http`	`v1.0.0`
`CLOUDFLARE AUDIT`	-	`aws_sqs_cloudflare_audit`	`cloud.aws.cloudflare.audit`	`v1.0.0`
`CLOUDTRAIL`	-	`aws_sqs_cloudtrail`	`cloud.aws.cloudtrail.*`	`v1.0.0`
`CLOUDTRAIL VIA KINESIS FIREHOSE`	-	`aws_sqs_cloudtrail_kinesis`	`cloud.aws.cloudtrail.*`	`v1.0.0`
`CLOUDWATCH`	-	`aws_sqs_cloudwatch`	`cloud.aws.cloudwatch.logs`	`v1.0.0`
`CLOUDWATCH VPC`	-	`aws_sqs_cloudwatch_vpc`	`cloud.aws.vpc.flow`	`v1.0.0`
`CONTROL TOWER`	VPC Flow Logs, Cloudtrail, Cloudfront, and/or AWS config logs	`aws_sqs_control_tower`	-	`v1.0.0`
`FDR`	-	`aws_sqs_fdr`	`edr.crowdstrike.cannon`	`v1.0.0`
`FDR LARGE`	The files can be so large and hard to pull that if the service above fails, use this one.	`aws_sqs_fdr_large`	`edr.crowdstrike.cannon`
`GUARD DUTY`	-	`aws_sqs_guard_duty`	`cloud.aws.guardduty.findings`	`v1.0.0`
`GUARD DUTY VIA KINESIS FIREHOUSE`	-	`aws_sqs_guard_duty_kinesis`	`cloud.aws.guardduty.findings`	`v1.0.0`
`IMPERVA INCAPSULA`	-	`aws_sqs_incapsula`	`cef0.imperva.incapsula`	`v1.0.0`
`JAMF`	-	`aws_sqs_jamf`	`my.app.[file-log_type].logs`	`v1.0.0`
`KUBERNETES`	-	`aws_sqs_kubernetes`	`my.app.kubernetes.events`	`v1.0.0`
`LACEWORK`	-	`aws_sqs_lacework`	`monitor.lacework`	`v1.0.0`
`PALO ALTO`	-	`aws_sqs_palo_alto`	`firewall.paloalto.[file-log_type]`	`v1.0.0`
`RDS`	Relational Database Audit Logs	`aws_sqs_rds`	`cloud.aws.rds.audit`	`v1.1.1`
`ROUTE 53`	-	`aws_sqs_route53`	`dns.aws.route53`	`v1.0.0`
`OS LOGS`	-	`aws_sqs_os`	`box.[file-log_type].[file-log_subtype].us`	`v1.0.0`
`SENTINEL ONE FUNNEL`	-	`aws_sqs_s1_funnel`	`edr.sentinelone.dv`	`v1.0.0`
`S3 ACCESS`	-	`aws_sqs_s3_access`	`web.aws.s3.access`	`v1.0.0`
`VPC LOGS`	-	`aws_sqs_vpc`	`cloud.aws.vpc.flow`	`v1.0.0`
`WAF LOGS`	-	`aws_sqs_waf`	`cloud.aws.waf.logs`	`v1.0.0`

For each setup, you can use this general config:

{
  "global_overrides": {
    "debug": false
  },
  "inputs": {
    "sqs_collector": {
      "id": "34523",
      "enabled": true,
      "credentials": {
        "aws_cross_account_role": "if provided",
        "aws_external_id": "if needed/supplied"
      },
      "region": "us-east-2",
      "base_url": "https://sqs.us-east-2.amazonaws.com/",
      "sqs_visibility_timeout": 120
      "sqs_wait_timeout": 20
      "sqs_max_messages": 1
      "ack_messages": false
      "direct_mode": false
      "do_not_send": false
      "compressed_events": false
      "debug_md5": false,
      "services": {
        "aws_sqs_kubernetes": {
          "encoding": "gzip",
          "type": "unseparated_json_processor",
          "config": {
            "key": "logEvents"
          }
        }
      }
    }
  }
}

The services are listed above. Every part of the service is overridable, so if you need to change the encoding, you can do it freely. You can also leave the service as "service_name": {}

Custom services or overrides

For a custom service or override, the config can look like this:

"services": {
  "custom_service": {
  "file_field_definitions": {},
  "filename_filter_rules": [],
  "encoding": "parquet",
  "file_format": {
    "type": "line_split_processor",
    "config": {"json": true}
  },
  "record_field_mapping": {},
  "routing_template": "my.app.ablo.backend",
  "line_filter_rules": []
  }
}

The main things you need:

file_format is type of processor.
routing_template is the tag you need.

Collectors that need custom tags

`aws_sqs_alb`	`web.aws.alb.access.SQS_REGION.SQS_ACCID` `SQS_REGION` needs to be filled in. `SQS_ACCID` needs to be filled in.
`aws_sqs_elb`	`web.aws.alb.access.SQS_REGION.SQS_ACCID` `SQS_REGION` needs to be filled in. `SQS_ACCID` needs to be filled in.
`aws_sqs_rds`	`cloud.aws.rds.audit.SQS_REGION.SQS_ACCID` `SQS_REGION` needs to be filled in. `SQS_ACCID` needs to be filled in. It is possible to put in information about the database that it’s coming from, it doesn’t have to be account IDs.

Types of processors

`unseparated_json_processor`	This is if the events come in json in one massive object use this.
`split_or_unseparated_processor`	This will determine if the log is split by `\n` or not.
`aws_access_logs_processor`	For AWS access logs and \n splits.
`single_json_object_processor`	This is for one JSON object.
`separated_json_processor`	Similar to other separators.
`bluecoat_processor`	For Bluecoat recipe.
`json_object_to_linesplit_processor`	Split by configured value.
`json_array_processor`	For JSON array processors
`json_line_arrays_processor`	Similar to other separators
`jamf_processor`	Jamf log processing.
`parquet_processor`	Parquet encoding.
`guardduty_processor`	For GuardDuty processors.
`vpc_flow_processor`	VPC service processor.
`alt_vpc_flow_processor`	VPC service processor.
`kolide_processor`	For Kolide service.
`json_array_vpc_processor`	VPC service processor.
`rds_processor`	RDS processor for the RDS service `unseparated_json_processor`. Use this if the events come in one massive JSON object.

Tagging

Tagging can be done in many different ways. One way tagging works is by using the file field definitions:

  "file_field_definitions": {
    "log_type": [
      {
        "operator": "split",
        "on": "/",
        "element": 2
      }
    ]
  },

These are the elements of the filename object:

If you look at the highlighted object filename, you can see that we are splitting and looking for the 2nd value. This starts at 0 like arrays. So:

0 = cequence-data
1 = cequence-devo-6x-NAieMI
2 = detector

"routing_template": "my.app.test_cequence.[file-log_type]"

Our final tag is my.app.test_cequence.detector

Options for filtering

`direct_mode`	Allowed values are `true` or `false` (default value is `false`). Set to `true` if the logs are sent directly to the queue without using s3.
`file_field_definitions`	Defined as a dictionary mapping of variable names (you decide) that lists parsing rules. Each parsing rule has an operator with its own keys. Parsing rules are applied in the order they are listed in the configuration. The `split` operator uses the `on` and `element` keys. The file name will split into pieces considering the character or character sequence specified in the `on` key, and will extract whatever it is at the specified `element` index, as in the example below. The `replace` operator uses the `to_replace` and `replace_with` keys. For example, if your filename is `server_logs/12409834/ff.gz`, this configuration would store the `log_type` as `serverlogs`: "file_field_definitions": { "log_type": [{"operator": "split", "on": "/", "element": 0}, {"operator": "replace", "to_replace": "_", "replace_with": ""}] }
`filename_filter_rules`	A list of rules to filter out entire files.
`encoding`	Takes any string. List of most common to least common: `gzip`, `none`, `parquet`, `latin-1`
`ack_messages`	Decide whether or not to delete messages from the queue after processing. It takes boolean values. If not specified, default is `true`. We recommend leaving this out of the config. If you see it in there, pay close attention to if it’s on or off.
`file_format`	`type` - A string specifying which processor to use.	`single_json_object` - Logs are stored as/in a JSON object. `single_json_object_processor` config options: `key` -(`string`) The key of where the list of logs is stored. config: {"key": "log"} fileobj: {..."log": {...}}
		`unseparated_json_processor` - Logs are stored as/in JSON objects, which are written in a text file with no separator. `unseparated_json` config options: `key` - (`string`) where the log is stored `include` (dict: maps names of keys outside of inner part to be included, which can be renamed). If there is no `key`, that is, the whole JSON object is the desired log, set `"flat": true` See `aws_config_collector` for example: fileobj: {...}{...}{...}
		`text_file_processor` - logs are stored as text files, potentially with lines and fields separated with e.g. commas and newlines `text_file` config options: includes options for how lines and records are separated (e.g. newline, tab, comma), good for csv style data.
		`line_split_processor` –- logs stored in a newline separated file, works more quickly than `separated_json_processor` config options: “json”: true or false. If setting json to true, assumes that logs are newline-separated json, and allows them to be parsed by the collector therefore enabling record-field mapping
		`separated_json_processor` – logs stored as many json objects that have some kind of separator config options: specify the separator e.g. “separator”: “\|\|”. the default is newline if left unused. fileobj: {...}\|\|{...}\|\|{...}
		`jamf_processor` – special processor for JAMF logs
		`aws_access_logs_processor` – special processor for AWS access logs
		`windows_security_processor` – special processor for Windows Security logs
		`vpc_flow_processor` – special processor for VPC Flow logs
		`json_line_arrays_processor` – processor for unseparated json objects that are on multiple lines of a single file. fileobj: {...}{...} {...}{...}{...} {...}
		`dict_processor` – processor for logs that comes as python dictionary objects, i.e. in direct mode
	`config` - a dictionary of information the specified file_format processor needs
`record_field_mapping`	a dictionary -- each key defines a variable that can be parsed out from each record (which may be referenced later in filtering) e.g., we may want to parse something and call it "type", by getting "type" from a certain key in the record (which may be multiple layers deep). {"type": {"keys": ["file", "type"]}, "operations": [] } keys is a list of how key values in the record to look into to find the value, its to handle nesting (essentially defining a path through the data). Suppose we have logs that look like this: {“file”: {“type”: { “log_type” : 100}}} so if we want to get the log_type, we should list all the keys needed to parse through the json in order: keys: [“file”, “type”, “log_type”] In many cases you will probably only need one key. e.g. in flat json that isn’t nested {“log_type”: 100, “other_info”: “blah” ….} here you would just specify keys: [“log_type”]. A few operations are supported that can be used to further alter the parsed information (like split and replace). This snippet would grab whatever is located at log[“file”][“type”] and name it as “type”. record_field_mapping defines variables by taking them from logs, and these variables can then be used for filtering. Let’s say you have a log in json format like this which will be set to devo: {“file”: {“value”: 0, “type”: “security_log”}} Specifying “type” in the record_field_mapping will allow the collector to extract that value, “security_log” and save it as type. Now let’s say you want to change the tag dynamically based on that value. You could change the routing_template to something like my.app.datasource.[record-type]. In the case of the log above, it would be sent to my.app.datasource.security_log. Now let’s say you want to filter out (not send) any records which have the type security_log. You could write a line_filter_rule as follows: `{"source": "record", "key": "type", "type": "match", "value": "security_log" }` We specified the source as record because we want to use a variable from the record_field_mapping. We specified the key as “type” because that is the name of the variable we defined. We specify type as “match” because any record matching this rule we want to filter out. And we specify the value as security_log because we specifically do not want to send any records with the type equalling “security_log” The split operation is the same as if you ran the python split function on a string. Let’s say you have a filename “logs/account_id/folder_name/filename” and you want to save the account_id as a variable to use for tag routing or filtering. You could write a file_field_definition like this: `"account_id": [{"operator": "split", "on": "/", "element": 1}]` This would store a variable called account_id by taking the entire filename and splitting it into pieces based on where it finds backslashes, then take the element as position one. In Python it would look like: filename.split(“/”)[1]
`routing_template`	a string defining how to build the tag to send each message. e.g. "my.app.wow.[record-type].[file-log_type]" -- if the "type" extracted during record_field_mapping were "null", the record would be sent to the tag "my.app.wow.null"
`line_filter_rules`	a list of lists of rules for filtering out individual records so they do not get sent to devo for example: "line_filter_rules": [ [{ "source": "record", "key": "type", "type": "doesnotmatch", "value": "ldap" }], [ {"source": "file", "key": "main-log_ornot", "type": "match", "value": "main-log"}, {"source": "record", "key": "type", "type": "match", "value": "kube-apiserver-audit"}, ] ] This set of rules could be expressed in pseudocode as follows: `if record.type != "ldap" OR (file.main-log_ornot == main-log AND record.type == "kube-api-server-audit"):` `do_not_send_record()` (Internal) Notes + Debugging Config can include "debug_mode": true to print out some useful information as logs come in. For local testing it is useful to set "ack_messages" to false, to try processing without eating from the queue. Be careful to remove this or set it to true when launching the collector. The default is to ack messages if it is not set. If something seems wrong at launch, you can set the following in the collector parameters/ job config. "debug_mode": true, "do_not_send": true, "ack_messages": false This will print out data as it is being processed, stop messages from getting hacked, and at the last step, not actually send the data (so you can see if something is breaking without the customer getting wrongly formatted repeat data without consuming from the queue and losing data)