/
GitHub collector

GitHub collector

Overview

Take into account that version 2.0.0 contains breaking changes.

The configuration in this article is only valid for versions v2.x. If you are using a v1.x, please upgrade to v2.x and carefully analyze this configuration file to successfully deploy the collector.

GitHub is a version control platform that allows you to track changes to your codebase, flag bugs and issues for follow-up, and manage your product's build process. It simplifies the process of working with other people and makes it easy to collaborate on projects. Team members can work on files and easily merge their changes with the master branch of the project. GitHub API provides data about a hosted code repository, ranging from commits to pull requests and comments.

The Devo Github collector enables customers to retrieve data from GitHub API into Devo to query, correlate, analyze, and visualize it, enabling Enterprise IT and Cybersecurity teams to take the most impactful decisions at the petabyte scale.

API limitations

The main API limitation is the Rate Limit established by Github.

Like several other APIs, the GitHub API has a rate limit. Applications such as collectors are allowed to make a limited number of API requests per hour. Hitting the rate limit causes the collector to stop the time determined by the GitHub API until the next period (for instance, the collector has to stop for 1000 or 2000 seconds). All the API shares this limit, so the collector stops receiving and sending events to Devo for all the services. The result is a delay in the event reception, some events don’t arrive at Devo until several minutes later.

The rate limit depends on the section of the API used. From the collector's point of view, there are two kinds of Github API services, by Organization and by Repositories.

  • For Organization services, the rate limit is 1750 requests per hour

  • For Repository services, the default limit is 5000, but Enterprise Cloud Accounts may have a higher limit, up to 15000 requests per hour.

This number can seem high, and it is enough for a small corporation or individual, but unfortunately, it is extremely limited to extract all the data available from some corporations on Github. The reason is the way that data on Github should be extracted.

  • Services by Organization (as audit) are simple, one call can return all the event items for the entire organization. Usually, the information is divided into several different pages (maybe 100 or more), and each page takes a request for the count.  Only if we want to extract data from several months or years, that can be tens of thousands of events, we are at risk of reaching the rate limit.

  • For by Repository services (commits, events, forks), it is not possible to gather all in one call. First, we ask for a list of repositories (there are 148), and for each repository, we ask for the new items. We need to make this for each service, even if there is no data for that repository. If there is data, we need to download all the pages, so more requests are needed.

Let's put an example of this. Imagine that we have 200 code repositories in GitHub (some corporations have thousands) and we activate 10 services in the collector. Then we’ll need 2000 requests (as a minimum) for each data pull, even if there is no new data. This is a minimum value, additional requests can be needed for loading all data pages for each repository.

If we do a data pull every minute, the limit can be depleted in a few minutes, even when the limit is 15000.

When the limit is reached, Github API returns a header telling you how long the collector should stop to recover. The collector writes in the log a warning “API Rate Limit Exceeded, waiting for X seconds“ and stops for the time given.

So it is important to configure the collector carefully. There are two ways to avoid the rate limit:

  • Optimize the number of services in execution, and select carefully those services that require to be monitored.

  • Decrease the pull frequency, instead of trying to get data from all the services each minute, (the default setting), use the parameter request_period_in_seconds with values of 600, 1200, or more seconds. Prioritize some services related to alerts as Audit and call other services every 10 or more minutes. Some services could be called 1 or 2 times a day. 

There is a trick that can be done to somehow increase the number of requests. Github allows the creation of one Personal Access Token per account. So, for this purpose, it would be possible to create multiple accounts and then use one account per service to be monitored. Those accounts must belong to the Organization from which the data will be pulled. Github only allows having one free account, so those accounts would be paid accounts.

Migration from 2.x.x to 3.0.0

If you are using the audit service, the migration to version v3.0.0 requires a small human intervention. There is a change in the service name and parameters. In version 3.0.0:

  • The audit service disappeared.

  • A new service organization_audit has been added, including some improvements.

The old audit service uses a pagination method that doesn't allow to choose a initial date for event gathering. The new organization_audit service uses a different method, so we can now use since as other services. Unfortunately, the new method cannot use the pagination token from the old method, and a initial date has to be configured by the user.

The procedure for migration is:

  1. Stop the old 2.x.x service and take note of the time.

  2. Edit the config file. Delete the audit service and add a new organization_audit service with a since value that is the time when the old collector was stopped.

For instance, if we have this configuration for 2.x.x:

"audit": { "request_period_in_seconds": "60", "persistence_reset_date": "2002-03-11" },

change it by:

"organization_audit": { "request_period_in_seconds": "60", "since": "2024-05-22T23:00:00Z" }

where 2024-05-22T23:00:00Z is the time when we stopped the old collector.

  1. Change the image to new version 3.0.0 and restart the collector.

The new organization_audit service uses the same endpoint, events use the same format and they are stored in vcs.github.organization.audit as before.

Configuration requirements

To run this collector, there are some configurations detailed below that you need to take into account.

Configuration

Details

Configuration

Details

Token

You’ll need to create an access token to authenticate the collector on the GitHub server.

Configuration

Refer to the Vendor setup section to know more about these configurations.

Devo collector features

Feature

Details

Feature

Details

Allow parallel downloading (multipod)

not allowed

Running environments

  • collector server

  • on-premise

Populated Devo events

standard

Data sources

Data Source

Description

GitHub API endpoint

Collector service name

Type

Devo table

Available from release

Collaborators

Information about collaborators.

/repos/{owner}/{repo}/collaborators

Repositories - GitHub Docs

  • metadata:read

collaborators

repository

vcs.github.repository.collaborators

v1.0.0

Commits

Commits made in the repository

/repos/{owner}/{repo}/commits

Repositories - GitHub Docs

  • contents:read

commits

repository

vcs.github.repository.commits

v1.0.0

Forks

Forks created in the repository

/repos/{owner}/{repo}/forks

Repositories - GitHub Docs

  • metadata:read

forks

repository

vcs.github.repository.forks

v1.0.0

Events

Information about the different events such as resource creations or deletions.

/repos/{owner}/{repo}/events

Eventos - GitHub Docs

  • metadata:read

events

repository

vcs.github.repository.events

v1.0.0

Issue comments

Comments made in every issue.

/repos/{owner}/{repo}/comments

Issue comments - GitHub Docs

  • issues:read or

  • pull_requests:read

issue_comments

repository

vcs.github.repository.issue_comments

v1.0.0

Subscribers

Information about the different users subscribed to one repository.

/repos/{owner}/{repo}/subscribers

Watching - GitHub Docs

  • metadata:read

subscribers

repository

vcs.github.repository.subscribers

v1.0.0

Pull requests

Pull requests made in the repository.

/repos/{owner}/{repo}/pulls

/repos/{owner}/{repo}/pulls/{pull_number}/commits

Pulls - GitHub Docs

  • pull_requests:read

pull_requests

repository

vcs.github.repository.pull_requests

vcs.github.repository.pull_request_commits

v1.0.0

Subscriptions

Repositories you are subscribed.

/repos/{owner}/{repo}/subscription

Activity - GitHub Docs

  • metadata:read

subscriptions

repository

vcs.github.repository.subscriptions

v1.0.0

Releases

Information about releases made in the repository.

/repos/{owner}/{repo}/releases

Repositories - GitHub Docs

  • contents:read

releases

repository

vcs.github.repository.releases

v1.0.0

Stargazers

Information about users who start repositories making them favorites

/repos/{owner}/{repo}/stargazers

Starring - GitHub Docs

  • metadata:read

stargazers

repository

vcs.github.repository.stargazers

v1.0.0

SSO Authorizations

Single sign-on authorization.

/orgs/{org}/credential-authorizations

Organizations - GitHub Docs

  • organization_administration:read

sso_authorizations

organization

vcs.github.organization.sso_authorizations

v1.0.0

Webhooks

Organization created webhooks.

/orgs/{org}/hooks

Organizations - GitHub Docs

admin:org_hook

webhooks

organization

vcs.github.organization.webhooks

v1.0.0

Dependabot Alerts

GitHub sends Dependabot alerts when we detect that your repository uses a vulnerable dependency or malware.

/repos/{owner}/{repo}/dependabot/alerts

Dependabot alerts - GitHub Docs

  • vulnerability_alerts:read

dependabot_alerts

repository

vcs.github.organization.dependabot_alerts

v2.0.0

Dependabot Secrets

Lists all secrets available in an organization without revealing their encrypted values.

/orgs/{org}/dependabot/secrets

Dependabot secrets - GitHub Docs

  • admin:org

dependabot

organization

vcs.github.organization.dependabot

v2.0.0

Actions

GitHub Actions for a repository.

/repos/{owner}/{repo}/actions/runs

Workflow runs - GitHub Docs

  • actions:read

actions

repository

vcs.github.repository.actions

v2.0.0

CodeScan

Code scanning is a feature that you use to analyze the code in a GitHub repository to find security vulnerabilities and coding errors.

/repos/{owner}/{repo}/code-scanning/alerts

Code Scanning - GitHub Docs

  • security_events:read

codescan

repository

vcs.github.repository.codescan

v2.0.0

Enterprise Audit

 

 

 

 

 

 

Enterprise Auditory Events

/enterprises/{enterprise}/audit-log

REST API endpoints for organizations - GitHub Docs

  • admin:enterprise

  • read:audit_log

  • read:enterprise

enterprise_audit

enterprise

vcs.github.enterprise.audit

v2.0.0

Organization Audit

Organization Auditory events

/orgs/{org}/audit-log

REST API endpoints for organizations - GitHub Docs

read:audit_log

organization_audit

organization

vcs.github.organization.audit

v3.0.0

For more information on how the events are parsed, visit our page.

Vendor setup

Personal access token authentication

To retrieve the data, we need to create an access token to authenticate the collector on the GitHub server.

Github App Installation authentication

Authorization with SAML

What is SAML authorization?

SAML authorization is a markup language for security confirmations that provides a standardized way to tell external applications and services that a user is who he or she claims to be. SAML uses single sign-on (SSO) technology and allows you to authenticate a user once and then communicate that authentication to multiple applications.

Authorizing a personal access token

To use SAML, you need to authorize the token for personal use. There are two ways:

  • Authorize existing token

  • Create a new token and authorize it.

Minimum configuration required for basic pulling

Although this collector supports advanced configuration, the fields required to retrieve data with basic configuration are defined below.

Setting

Details

Setting

Details

token

Set up here requires your access token created in the GitHub console.

username

Set up here requires your username.

private_key_path

Set here the path to the .pem file that stores your private key

private_key_base64

Set here the private key file encoded in base64

app_id

Set here the id of your installed app

organization

Use this parameter to define the name of the organization that owns the repository

Accepted authentication methods

Authentication method

URL

Token

Username

Private key

App ID

Organization

Authentication method

URL

Token

Username

Private key

App ID

Organization

Personal Access Token

OPTIONAL
(default is https://api.github.com/)

REQUIRED

REQUIRED

NOT REQUIRED

NOT REQUIRED

REQUIRED

GitHub App installation

OPTIONAL
(default is https://api.github.com/)

NOT REQUIRED

NOT REQUIRED

REQUIRED

REQUIRED

REQUIRED

Run the collector

Once the data source is configured, you can either send us the required information if you want us to host and manage the collector for you (Cloud collector), or deploy and host the collector in your own machine using a Docker image (On-premise collector).

Collector services detail

This section is intended to explain how to proceed with specific actions for services.

Verify data collections

Once the collector has been launched, it is important to check if the ingestion is performed in a proper way. To do so, go to the collector’s logs console.

This service has the following components:

Component

Description

Component

Description

Setup

The setup module is in charge of authenticating the service and managing the token expiration when needed.

Puller

The setup module is in charge of pulling the data in a organized way and delivering the events via SDK.

Setup output

A successful run has the following output messages for the setup module:

INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> The token/header/authentication is defined INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> The token/header/authentication is valid INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> The user whatever-user belongs to whatever-company INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> Finalizing the execution of setup() INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> Setup for module <GithubDataPullerActions> has been successfully executed

This service lists all workflows that run for a repository in GitHub. All events of this service are ingested into the table vcs.github.repository.actions.

Verify data collections

Puller Output

A successful initial run has the following output messages for the puller module:

After a successful collector’s execution (this is, no error logs were found), you should be able to see the following log message:

Restart the persistence

This service makes use of persistence. To restart the persistence, the since parameter must be changed from the user configuration. This field indicates the date from which to start pulling data. For further details, go to the settings section.

Code scanning is a feature that you can use to analyze the code in a GitHub repository to find security vulnerabilities and coding errors. This service returns the codescan results for each repository in case it is enabled.

Verify data collection

Puller output

A successful initial run has the following output messages for the puller module:

After successful execution of the collector, you should be able to see the following log message:

Restart the persistence

This service makes use of persistence. To restart the persistence, the since parameter must be changed from the user configuration. This field indicates the date from which to start pulling data. For further details, go to the settings section.

This service gets a list of collaborators for each repository in GitHub.

Verify data collection

Puller output

A successful initial run has the following output messages for the puller module:

After the successful execution of the collector, you should be able to see the following log message:

Restart the persistence

This service makes use of persistence. To restart the persistence, the since parameter must be changed from the user configuration. This field indicates the date from which to start pulling data. For further details, go to the settings section.

This service gets a list of collaborators for each repository in GitHub.

Verify data collection

Puller output

A successful initial run has the following output messages for the puller module:

After the successful execution of the collector, you should be able to see the following log message:

Restart the persistence

This service makes use of persistence. To restart the persistence, the since parameter must be changed from the user configuration. This field indicates the date from which to start pulling data. For further details, go to the settings section.

Collector operations

This section is intended to explain how to proceed with specific operations of this collector.

Change log

Release

Released on

Release type

Details

Recommendations

Release

Released on

Release type

Details

Recommendations

v3.0.0

Jul 2, 2024

NEW FEATURE
IMPROVEMENT
BUG FIXING

New features

  • Audit substituted by new improved Organization Audit service

  • Requests now send Time Zone UTC explicitly

Bug fix

  • INT-2482 Github events delayed (Organization Audit)

  • INT-2177 Github collector does not specify timezone

Improvements

  • Upgraded DCSDK from 1.11.1 to 1.12.1

    • Added new sender for relay in house + TLS

    • Added persistence functionality for gzip sending buffer

    • Added Automatic activation of gzip sending

    • Improved behaviour when persistence fails

    • Upgraded DevoSDK dependency

    • Fixed console log encoding

    • Restructured python classes

    • Improved behaviour with non-utf8 characters

    • Decreased defaut size value for internal queues (Redis limitation, from 1GiB to 256MiB)

    • New persistence format/structure (compression in some cases)

    • Removed dmesg execution (It was invalid for docker execution)

Recommended version

v2.3.0

May 7, 2024

NEW FEATURE
IMPROVEMENT
BUG FIXING

New features

  • Added new Enterprise Audit service

Bug fix

  • Fix missing parentheses

Improvements

  • Upgraded DCSDK from 1.10.0 to 1.11.1:

    • Introduced pyproject.toml

    • Added requirements-dev.txt

    • Fixed error in pyproject.toml related to project scripts endpoint

    • Updated DevoSDK to v5.1.9

    • Fixed some bug related to development on MacOS

    • Added an extra validation and fix when the DCSDK receives a wrong timestamp format

    • Added an optional config property for use the Syslog timestamp format in a strict way

    • Updated DevoSDK to v5.1.10

    • Fix for SyslogSender related to UTF-8

    • Enhace of troubleshooting. Trace Standardization, Some traces has been introduced.

    • Introduced a machanism to detect "Out of Memory killer" situation.

    • Changed default number for connection retries (now 7)

    • Fix for Devo connection retries

    • Added extra check for not valid message timestamps

Update

v2.1.0

Aug 14, 2023

IMPROVEMENT

  • Upgraded DCSDK from 1.4.4 to 1.9.1:

    • Store lookup instances into DevoSender to avoid creation of new instances for the same lookup

    • Ensure service_config is a dict into templates

    • Ensure special characters are properly sent to the platform

    • Changed log level to some messages from info to debug

    • Changed some wrong log messages

    • Upgraded some internal dependencies

    • Changed queue passed to setup instance constructor

    • Ability to validate collector setup and exit without pulling any data

    • Ability to store in the persistence the messages that couldn't be sent after the collector stopped

    • Ability to send messages from the persistence when the collector starts and before the puller begins working

    • Ensure special characters are properly sent to the platform

    • Added a lock to enhance sender object

    • Added new class attrs to the setstate and getstate queue methods

    • Fix sending attribute value to the setstate and getstate queue methods

    • Added log traces when queues are full and have to wait

    • Added log traces of queues time waiting every minute in debug mode

    • Added method to calculate queue size in bytes

    • Block incoming events in queues when there are no space left

    • Send telemetry events to Devo platform

    • Upgraded internal Python dependency Redis to v4.5.4

    • Upgraded internal Python dependency DevoSDK to v5.1.3

    • Fixed obfuscation not working when messages are sent from templates d

    • New method to figure out if a puller thread is stopping

    • Upgraded internal Python dependency DevoSDK to v5.0.6

    • Improved logging on messages/bytes sent to Devo platform

    • Fixed wrong bytes size calculation for queues

    • New functionality to count bytes sent to Devo Platform (shown in console log)

    • Upgraded internal Python dependency DevoSDK to v5.0.4

    • Fixed bug in persistence management process, related to persistence reset

    • Aligned source code typing to be aligned with Python 3.9.x

    • Inject environment property from user config

    • Obfuscation service can be now configured from user config and module definiton

    • Obfuscation service can now obfuscate items inside arrays

Update

v2.0.0

Oct 28, 2022

NEW FEATURE
IMPROVEMENT
BUG FIXING
VULNS

New features

  • Actions data source: Lists all Github Actions “workflows runs” for a repository.

  • CodeScan data source: List alerts after analyzing the code in a repository to find security vulnerabilities and coding errors.

  • Dependabot Alerts data source: List alerts created due to detecting vulnerable dependencies or malware.

  • Dependabot Secrets data source: List all secrets available in an organization without revealing their encrypted values.

  • A new Rate Limiter service has been added providing a higher granularity.

  • A feature has been added to initialize the persistence of services in a granular way through the configuration file.

Improvements

  • The pulling logic for all the services have been improved reducing the risk of duplicates.

  • Improved error management for connection issues.

  • Upgrade underlying Devo Collector SDK from v1.1.3 to v1.4.1.

  • Upgraded the underlying DevoSDK package to v3.6.4 and dependencies, this upgrade increases the resilience of the collector when the connection with Devo or the Syslog server is lost. The collector can reconnect in some scenarios without running the auto reboot service.

  • Support for stopping the collector when a GRACEFULL_SHUTDOWN system signal is received.

  • Re-enabled the logging to Devo.collector.out for Input threads.

  • Added functionality for detecting some system signals for starting the controlled stopping.

  • Added log traces for knowing system memory usage and execution environment status. Added more details in logs.

  • Added a new template functionality for easing the developing collectors (not used by this collector).

  • Refactored source code structure.

  • The Docker container exits with the proper error code.

  • Minimized probabilities of suffering a DevoSDK bug related to "sender" to be null.

  • When an exception is raised by the Collector Setup, the collector retries after 5 seconds. For consecutive exceptions, the waiting time is multiplied by 5 until hits 1800 seconds, which is the maximum waiting time allowed. No maximum retries are applied.

  • When an exception is raised by the Collector Pull method, the collector retries after 5 seconds. For consecutive exceptions, the waiting time is multiplied by 5 until hits 1800 seconds, which is the maximum waiting time allowed. No maximum retries are applied.

  • When an exception is raised by the Collector pre-pull method, the collector retries after 30 seconds. No maximum retries are applied.

Bug fixing

  • Fixed pagination and persistence bugs when pulling thousands of target repositories and events.

  • Fixed a bug in the Webhook data source that prevented complete downloading.

  • Fixed bugs related to ingestions outages.

Vulnerabilities mitigation

  • CVE-2022-1664

  • CVE-2021-33574

  • CVE-2022-23218

  • CVE-2022-23219

  • CVE-2019-8457

  • CVE-2022-1586

  • CVE-2022-1292

  • CVE-2022-2068

  • CVE-2022-1304

  • CVE-2022-1271

  • CVE-2021-3999

  • CVE-2021-33560

  • CVE-2022-29460

  • CVE-2022-29458

  • CVE-2022-0778

  • CVE-2022-2097

  • CVE-2020-16156

  • CVE-2018-2503

Update
(breaking release)