GitHub collector
Overview
Take into account that version 2.0.0 contains breaking changes.
The configuration in this article is only valid for versions v2.x. If you are using a v1.x, please upgrade to v2.x and carefully analyze this configuration file to successfully deploy the collector.
GitHub is a version control platform that allows you to track changes to your codebase, flag bugs and issues for follow-up, and manage your product's build process. It simplifies the process of working with other people and makes it easy to collaborate on projects. Team members can work on files and easily merge their changes with the master branch of the project. GitHub API provides data about a hosted code repository, ranging from commits to pull requests and comments.
The Devo Github collector enables customers to retrieve data from GitHub API into Devo to query, correlate, analyze, and visualize it, enabling Enterprise IT and Cybersecurity teams to take the most impactful decisions at the petabyte scale.
API limitations
The main API limitation is the Rate Limit established by Github.
Like several other APIs, the GitHub API has a rate limit. Applications such as collectors are allowed to make a limited number of API requests per hour. Hitting the rate limit causes the collector to stop the time determined by the GitHub API until the next period (for instance, the collector has to stop for 1000 or 2000 seconds). All the API shares this limit, so the collector stops receiving and sending events to Devo for all the services. The result is a delay in the event reception, some events don’t arrive at Devo until several minutes later.
The rate limit depends on the section of the API used. From the collector's point of view, there are two kinds of Github API services, by Organization and by Repositories.
For Organization services, the rate limit is 1750 requests per hour
For Repository services, the default limit is 5000, but Enterprise Cloud Accounts may have a higher limit, up to 15000 requests per hour.
This number can seem high, and it is enough for a small corporation or individual, but unfortunately, it is extremely limited to extract all the data available from some corporations on Github. The reason is the way that data on Github should be extracted.
Services by Organization (as audit) are simple, one call can return all the event items for the entire organization. Usually, the information is divided into several different pages (maybe 100 or more), and each page takes a request for the count. Only if we want to extract data from several months or years, that can be tens of thousands of events, we are at risk of reaching the rate limit.
For by Repository services (commits, events, forks), it is not possible to gather all in one call. First, we ask for a list of repositories (there are 148), and for each repository, we ask for the new items. We need to make this for each service, even if there is no data for that repository. If there is data, we need to download all the pages, so more requests are needed.
Let's put an example of this. Imagine that we have 200 code repositories in GitHub (some corporations have thousands) and we activate 10 services in the collector. Then we’ll need 2000 requests (as a minimum) for each data pull, even if there is no new data. This is a minimum value, additional requests can be needed for loading all data pages for each repository.
If we do a data pull every minute, the limit can be depleted in a few minutes, even when the limit is 15000.
When the limit is reached, Github API returns a header telling you how long the collector should stop to recover. The collector writes in the log a warning “API Rate Limit Exceeded, waiting for X seconds“ and stops for the time given.
So it is important to configure the collector carefully. There are two ways to avoid the rate limit:
Optimize the number of services in execution, and select carefully those services that require to be monitored.
Decrease the pull frequency, instead of trying to get data from all the services each minute, (the default setting), use the parameter
request_period_in_seconds
with values of 600, 1200, or more seconds. Prioritize some services related to alerts as Audit and call other services every 10 or more minutes. Some services could be called 1 or 2 times a day.
There is a trick that can be done to somehow increase the number of requests. Github allows the creation of one Personal Access Token per account. So, for this purpose, it would be possible to create multiple accounts and then use one account per service to be monitored. Those accounts must belong to the Organization from which the data will be pulled. Github only allows having one free account, so those accounts would be paid accounts.
Migration from 2.x.x to 3.0.0
If you are using the audit
service, the migration to version v3.0.0 requires a small human intervention. There is a change in the service name and parameters. In version 3.0.0:
The
audit
service disappeared.A new service
organization_audit
has been added, including some improvements.
The old audit
service uses a pagination method that doesn't allow to choose a initial date for event gathering. The new organization_audit
service uses a different method, so we can now use since
as other services. Unfortunately, the new method cannot use the pagination token from the old method, and a initial date has to be configured by the user.
The procedure for migration is:
Stop the old 2.x.x service and take note of the time.
Edit the config file. Delete the
audit
service and add a neworganization_audit
service with asince
value that is the time when the old collector was stopped.
For instance, if we have this configuration for 2.x.x:
"audit": {
"request_period_in_seconds": "60",
"persistence_reset_date": "2002-03-11"
},
change it by:
"organization_audit": {
"request_period_in_seconds": "60",
"since": "2024-05-22T23:00:00Z"
}
where 2024-05-22T23:00:00Z
is the time when we stopped the old collector.
Change the image to new version 3.0.0 and restart the collector.
The new organization_audit
service uses the same endpoint, events use the same format and they are stored in vcs.github.organization.audit
as before.
Configuration requirements
To run this collector, there are some configurations detailed below that you need to take into account.
Configuration | Details |
---|---|
Token | You’ll need to create an access token to authenticate the collector on the GitHub server. |
Configuration
Refer to the Vendor setup section to know more about these configurations.
Devo collector features
Feature | Details |
---|---|
Allow parallel downloading ( |
|
Running environments |
|
Populated Devo events |
|
Data sources
Data Source | Description | GitHub API endpoint | Collector service name | Type | Devo table | Available from release |
Collaborators | Information about collaborators. |
|
|
|
|
|
Commits | Commits made in the repository |
|
|
|
|
|
Forks | Forks created in the repository |
|
|
|
|
|
Events | Information about the different events such as resource creations or deletions. |
|
|
|
|
|
Issue comments | Comments made in every issue. |
|
|
|
|
|
Subscribers | Information about the different users subscribed to one repository. |
|
|
|
|
|
Pull requests | Pull requests made in the repository. |
|
|
|
|
|
Subscriptions | Repositories you are subscribed. |
|
|
|
|
|
Releases | Information about releases made in the repository. |
|
|
|
|
|
Stargazers | Information about users who start repositories making them favorites |
|
|
|
|
|
SSO Authorizations | Single sign-on authorization. |
|
|
|
|
|
Webhooks | Organization created webhooks. |
|
|
|
|
|
Dependabot Alerts | GitHub sends Dependabot alerts when we detect that your repository uses a vulnerable dependency or malware. |
Dependabot alerts - GitHub Docs
|
|
|
|
|
Dependabot Secrets | Lists all secrets available in an organization without revealing their encrypted values. |
Dependabot secrets - GitHub Docs
|
|
|
|
|
Actions | GitHub Actions for a repository. |
|
|
|
|
|
CodeScan | Code scanning is a feature that you use to analyze the code in a GitHub repository to find security vulnerabilities and coding errors. |
|
|
|
|
|
Enterprise Audit
| Enterprise Auditory Events |
REST API endpoints for organizations - GitHub Docs
|
|
|
|
|
Organization Audit | Organization Auditory events |
REST API endpoints for organizations - GitHub Docs
|
|
|
|
|
For more information on how the events are parsed, visit our page.
Vendor setup
Personal access token authentication
To retrieve the data, we need to create an access token to authenticate the collector on the GitHub server.
Github App Installation authentication
Authorization with SAML
What is SAML authorization?
SAML authorization is a markup language for security confirmations that provides a standardized way to tell external applications and services that a user is who he or she claims to be. SAML uses single sign-on (SSO) technology and allows you to authenticate a user once and then communicate that authentication to multiple applications.
Authorizing a personal access token
To use SAML, you need to authorize the token for personal use. There are two ways:
Authorize existing token
Create a new token and authorize it.
Minimum configuration required for basic pulling
Although this collector supports advanced configuration, the fields required to retrieve data with basic configuration are defined below.
Setting | Details |
---|---|
| Set up here requires your access token created in the GitHub console. |
| Set up here requires your username. |
| Set here the path to the .pem file that stores your private key |
| Set here the private key file encoded in base64 |
| Set here the id of your installed app |
| Use this parameter to define the name of the organization that owns the repository |
Accepted authentication methods
Authentication method | URL | Token | Username | Private key | App ID | Organization |
---|---|---|---|---|---|---|
Personal Access Token | OPTIONAL | REQUIRED | REQUIRED | NOT REQUIRED | NOT REQUIRED | REQUIRED |
GitHub App installation | OPTIONAL | NOT REQUIRED | NOT REQUIRED | REQUIRED | REQUIRED | REQUIRED |
Run the collector
Once the data source is configured, you can either send us the required information if you want us to host and manage the collector for you (Cloud collector), or deploy and host the collector in your own machine using a Docker image (On-premise collector).
Collector services detail
This section is intended to explain how to proceed with specific actions for services.
Collector operations
This section is intended to explain how to proceed with specific operations of this collector.
Change log
Release | Released on | Release type | Details | Recommendations |
---|---|---|---|---|
| Jul 2, 2024 | NEW FEATURE | New features
Bug fix
Improvements
|
|
| May 7, 2024 | NEW FEATURE | New features
Bug fix
Improvements
|
|
| Aug 14, 2023 | IMPROVEMENT |
|
|
| Oct 28, 2022 | NEW FEATURE | New features
Improvements
Bug fixing
Vulnerabilities mitigation
|
|