To run this collector, there are some configurations detailed below that you need to take into account.
Configuration
Details
Configuration
Details
Token
You’ll need to create an access token to authenticate the collector on the GitHub server.
Configuration
Refer to the Vendor setup section to know more about these configurations.
Overview
GitHub is a version control platform that allows you to track changes to your codebase, flag bugs and issues for follow-up, and manage your product's build process. It simplifies the process of working with other people and makes it easy to collaborate on projects. Team members can work on files and easily merge their changes with the master branch of the project. GitHub API provides data about a hosted code repository, ranging from commits to pull requests and comments.
The Devo Github collector enables customers to retrieve data from GitHub API into Devo to query, correlate, analyze, and visualize it, enabling Enterprise IT and Cybersecurity teams to take the most impactful decisions at the petabyte scale.
For more information on how the events are parsed, visit our page.
Vendor setup
To retrieve the data, we need to create an access token to authenticate the collector on the GitHub server.
SAML Authentication
If you are using SALM authentication in your account, you’ll need to authorize your token after generating it. Check how to do it in this article.
What is SAML authorization?
SAML authorization is a new feature added to the collector since v.1.2.0. It is a markup language for security confirmations that provides a standardized way to tell external applications and services that a user is who he or she claims to be. SAML uses single sign-on (SSO) technology and allows you to authenticate a user once and then communicate that authentication to multiple applications.
Authorizing a personal access token
To use SAML, you need to authorize the token for personal use. There are two ways:
Authorize existing token
Create a new token and authorize it.
Minimum configuration required for basic pulling
Although this collector supports advanced configuration, the fields required to retrieve data with basic configuration are defined below.
Setting
Details
Setting
Details
Token
Set up requires your access token created in the GitHub console.
Username
Set up requires your username.
Organization
Use this parameter to define the name of the organization that owns the repository.
Accepted authentication methods
Authentication Method
URL
Token
Username
Organization
Authentication Method
URL
Token
Username
Organization
Personal Access Token
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Run the collector
Once the data source is configured, you can either send us the required information if you want us to host and manage the collector for you (Cloud collector), or deploy and host the collector in your own machine using a Docker image (On-premise collector).
API Limitations
By default, the Rate Limit defined by GitHub is 5.000 requests per hour per authenticated user. This limit can change depending on the type of account. Basically, GitHub Enterprise Cloud accounts may have higher limits, up to 15.000 requests per hour.
Collector services detail
This section is intended to explain how to proceed with specific actions for services.
Verify data collections
Once the collector has been launched, it is important to check if the ingestion is performed in a proper way. To do so, go to the collector’s logs console.
This service has the following components:
Component
Description
Component
Description
Setup
The setup module is in charge of authenticating the service and managing the token expiration when needed.
Puller
The setup module is in charge of pulling the data in a organized way and delivering the events via SDK.
Setup output
A successful run has the following output messages for the setup module:
INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> The token/header/authentication is defined
INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> The token/header/authentication is valid
INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> The user whatever-user belongs to whatever-company
INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> Finalizing the execution of setup()
INFO InputProcess::GithubApiserverBasePullerSetup(example_collector,github#444,actions#predefined,all) -> Setup for module <GithubDataPullerActions> has been successfully executed
This service lists all workflows that run for a repository in GitHub. All events of this service are ingested into the table vcs.github.repository.actions.
Verify data collections
Puller Output
A successful initial run has the following output messages for the puller module:
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> Starting the execution of pre_pull()
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> Reading persisted data
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> No changes have been made in saved state. Returning saved state: {'pulling_date_from_config': '1640995200.0', 'last_pulled_date': '1641104263.0', 'ids': [1645452345]}
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> GithubDataPullerActions(github,444,actions,predefined,all) Finalizing the execution of pre_pull()
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> Starting data collection every 60 seconds
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> Pull Started
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> The collector will start pulling data since 2022-01-02T06:17:43Z
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> Total number of repositories: 2
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> Tag: vcs.github.api.repository.actions
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> (Partial) Statistics for this pull cycle Number of requests made: 3; Number of events received: 1; Number of duplicated events filtered out: 1; Number of events generated and sent: 0; Average of events per second: 0.000.
...
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> (Partial) Statistics for this pull cycle Number of requests made: 3; Number of events received: 1; Number of duplicated events filtered out: 1; Number of events generated and sent: 0; Average of events per second: 0.000.
After a successful collector’s execution (this is, no error logs were found), you should be able to see the following log message:
INFO InputProcess::GithubDataPullerActions(github,444,actions,predefined,all) -> Statistics for this pull cycle Number of requests made: 30; Number of events received: 317; Number of duplicated events filtered out: 11; Number of events generated and sent: 306; Average of events per second: 23.234.
Restart the persistence
This service makes use of persistence. To restart the persistence, the since parameter must be changed from the user configuration. This field indicates the date from which to start pulling data. For further details, go to the settings section.
This service gets the audit log (a sequence of activities) for an organization in GitHub. This service starts collecting 90 days back from the moment the persistence is reset. All events of this service are ingested into table vcs.github.organization.audit.
Verify data collection
Puller output
A successful initial run has the following output messages for the puller module:
After a successful collector’s execution (this is, no error logs were found), you should be able to see the following log message. However, it takes a lot of time to reach the end of this service, as it generates a huge amount of events and starts pulling 90 days back:
Restart the persistence
This service makes use of persistence. To restart the persistence, the persistance_reset_date parameter must be changed from the user configuration. This field indicates the date from which to start pulling data. For further details, go to the settings section.
Code scanning is a feature that you can use to analyze the code in a GitHub repository to find security vulnerabilities and coding errors. This service returns the codescan results for each repository in case it is enabled.
Verify data collection
Puller output
A successful initial run has the following output messages for the puller module:
After successful execution of the collector, you should be able to see the following log message:
Restart the persistence
This service makes use of persistence. To restart the persistence, the since parameter must be changed from the user configuration. This field indicates the date from which to start pulling data. For further details, go to the settings section.
This service gets a list of collaborators for each repository in GitHub.
Verify data collection
Puller output
A successful initial run has the following output messages for the puller module:
After the successful execution of the collector, you should be able to see the following log message:
Restart the persistence
This service makes use of persistence. To restart the persistence, the since parameter must be changed from the user configuration. This field indicates the date from which to start pulling data. For further details, go to the settings section.
Collector operations
This section is intended to explain how to proceed with specific operations of this collector.
Change log for v1.x.x
Release
Released on
Release type
Details
Recommendations
Release
Released on
Release type
Details
Recommendations
v2.0.0
Oct 28, 2022
NEW FEATURE IMPROVEMENT BUG FIXING VULNS
New features:
Actions data source: Lists all Github Actions “workflows runs” for a repository.
CodeScan data source: List alerts after analyzing the code in a repository to find security vulnerabilities and coding errors.
Dependabot Alerts data source: List alerts created due to detecting vulnerable dependencies or malware.
Dependabot Secrets data source: List all secrets available in an organization without revealing their encrypted values.
A new Rate Limiter service has been added providing a higher granularity.
A feature has been added to initialize the persistence of services in a granular way through the configuration file.
Improvements:
The pulling logic for all the services have been improved reducing the risk of duplicates.
Improved error management for connection issues.
Upgrade underlying Devo Collector SDK from v1.1.3 to v1.4.1.
Upgraded the underlying DevoSDK package to v3.6.4 and dependencies, this upgrade increases the resilience of the collector when the connection with Devo or the Syslog server is lost. The collector can reconnect in some scenarios without running the auto reboot service.
Support for stopping the collector when a GRACEFULL_SHUTDOWN system signal is received.
Re-enabled the logging to Devo.collector.out for Input threads.
Added functionality for detecting some system signals for starting the controlled stopping.
Added log traces for knowing system memory usage and execution environment status. Added more details in logs.
Added a new template functionality for easing the developing collectors (not used by this collector).
Refactored source code structure.
The Docker container exits with the proper error code.
Minimized probabilities of suffering a DevoSDK bug related to "sender" to be null.
When an exception is raised by the Collector Setup, the collector retries after 5 seconds. For consecutive exceptions, the waiting time is multiplied by 5 until hits 1800 seconds, which is the maximum waiting time allowed. No maximum retries are applied.
When an exception is raised by the Collector Pull method, the collector retries after 5 seconds. For consecutive exceptions, the waiting time is multiplied by 5 until hits 1800 seconds, which is the maximum waiting time allowed. No maximum retries are applied.
When an exception is raised by the Collector pre-pull method, the collector retries after 30 seconds. No maximum retries are applied.
Bug Fixing:
Fixed pagination and persistence bugs when pulling thousands of target repositories and events.
Fixed a bug in the Webhook data source that prevented complete downloading.