Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Purpose
The Microsoft Azure collector gets data from Azure cloud computing services. Common uses are:
...
Features | Details | |
---|---|---|
Allow parallel downloading ( | The vm_metrics service cannot work in multipod mode. If you want to use the event_hubs service in multipod mode, you must not include a vm_service in the same collector.Running environments | |
| Populated Devo events |
|
Flattening pre-processing |
| |
Allowed source events obfuscation |
|
Collector services detail
This section is intended to explain how to proceed with specific actions for services.
...
title | Event Hubs (event_hubs) |
---|
General principles
Understanding the following principles of Azure Event Hubs is crucial:
Consumer Groups: A single event hub can have multiple consumer groups, each representing a separate view of the event stream.
Checkpointing: The SDK supports checkpoint mechanisms to balance the load among consumers for the same event hub and consumer group. Supported
mechanisms include:Azure Blob Storage Checkpoint: Recommended to use one container per consumer group per event hub.
Partition Restrictions: Azure Event Hubs limits the number of partitions based on the event hub tier. For quotas and limits, refer to the official documentation.
Configuration options
Devo supports various configurations to cater to different Azure setups.
Event Hubs authentication configuration
Event Hubs authentication can be via connection strings or client credentials (assigning the Azure Event Hubs Data Receiver
role). Preference is given to connection string configuration when both are available.
...
Required parameters
...
Connection string configuration
...
event_hub_connection_string
event_hub_name
Yaml
Code Block |
---|
inputs:
azure_event_hub:
id: 100001
enabled: true
services:
event_hubs:
queues:
queue_a:
event_hub_name: event_hub_value
event_hub_connection_string: event_hub_connection_string_value |
Json
Code Block |
---|
"inputs": {
"azure_event_hub": {
"id": 100001,
"enabled": true,
"services": {
"event_hubs": {
"queues": {
"queue_a": {
"event_hub_name": "event_hub_value",
"event_hub_connection_string": "event_hub_connection_string_value"
}
}
}
}
}
} |
...
Client credentials configuration
...
event_hub_name
namespace
Credentials.client_id
Credentials.client_secret
Credentials.tenant_id
Yaml
Code Block |
---|
inputs:
azure_event_hub:
id: 100001
enabled: true
credentials:
client_id: client_id_value
client_secret: client_secret_value
tenant_id: tenant_id_value
services:
event_hubs:
queues:
queue_a:
namespace: namespace_value
event_hub_name: event_hub_name_value |
Code Block |
---|
"inputs": {
"azure_event_hub": {
"id": 100001,
"enabled": true,
"credentials": {
"client_id": "client_id_value",
"client_secret": "client_secret_value",
"tenant_id": "tenant_id_value"
},
"services": {
"event_hubs": {
"queues": {
"queue_a": {
"namespace": "namespace_value",
"event_hub_name": "event_hub_name_value"
}
}
}
}
}
} |
Configuration considerations
...
Multi-pod mode
...
While multi-pod mode is supported and represents the highest throughput possible for the collector, it requires the user to configure the collector in a specific manner to ensure that the collector operates efficiently and does not send duplicate events to Devo (see below). In most cases, multi-pod mode is unnecessary.
High Throughput: Multi-pod mode allows potentially the highest throughput.
Multi-pod mode is recommended for scenarios in which the user has more partitions than can be supported on a single collector instance.
Consumer Client Thread Limit: The user should specify a
client_thread_limit
to ensure that the collector utilizes load balancing instead of explicitly assigning partition IDs to the consumer clients.In load-balancing mode, having fewer consumer clients than partitions is allowable, but not as efficient as some consumer clients will fetch events from multiple partitions.
In load-balancing mode, having more consumer clients than partitions is allowable, but not as efficient as some consumer clients will not be assigned any partitions.
The most efficient design is to ensure that there are as many consumer clients as there are partitions distributed amongst the pods. The easiest way to achieve this is to set the
client_thread_limit
to1
and creating as many pods as there are partitions.
Azure Blob Storage Checkpointing: Required for multi-pod mode.
Warning: Running in multi-pod with local checkpointing will result in duplicate events being sent to Devo because the load balancing operation will have no visibility of the other pods' checkpoints.
...
Standard mode
...
Both checkpointing options are supported. In standard mode, the collector will automatically create one consumer client thread per partition per event hub.
If the event hubs you wish to fetch data from have too many partitions that can be supported on a single instance (i.e. you have 100 event hubs each with 32 partitions, therefore the collector attempts to create 3200 consumer clients), then you should create multiple collector instances and configure each one to fetch from a subset of the desired events hubs.
Internal process and deduplication method
The collector uses the event_hubs
service to pull events from the Azure Event Hubs. Each queue in the event_hubs
service represents an event hub that is polled for events.
...
Collector deduplication mechanisms
...
Events are deduplicated using the duplicated_messages_mechanism
parameter. There are two methods available:
Local Deduplication: Ensures that subsequent duplicate events from the same event hub are not sent to Devo. This method operates individually within each consumer client.
Global Deduplication: Utilizes a shared cache across all event hub consumers for a given collector. As events are ingested into Devo, the collector checks if the event has already been consumed by another event hub consumer. The event will not be sent to Devo if it has already been consumed. The global cache
tracks the last 1000 events for each consumer client.
If the global
deduplication method is selected, the collector will automatically employ the local deduplication method as well.
...
Checkpointing mechanisms
...
The collector offers two distinct methods for checkpointing, each designed to prevent the re-fetching of events from Azure Event Hubs. These mechanisms ensure efficient event processing by maintaining a record of the last processed event in each partition.
...
Local Persistence Checkpointing
Overview: By default, the collector employs local persistence checkpointing. This method is designed to keep track of the last event offset within each partition of
an Event Hub, ensuring events are processed once without duplication.How It Works: As the collector consumes messages from an Event Hub, it records the offset of the last processed event locally. On subsequent pulls from the Event
Hub, the collector resumes processing from the next event after the last recorded offset, effectively skipping previously processed events.Use Case: Ideal for single-instance deployments where all partitions of an Event Hub are managed by a single collector instance.
Azure Blob Storage Checkpointing
...
Overview: As an alternative to local persistence, the collector can be configured to use Azure Blob Storage for checkpointing. This approach leverages Azure's cloud storage to maintain event processing state.
...
Configuration:
Option 1: Specify both an Azure Blob Storage account and container name. This method requires the collector to have appropriate access permissions to the specified Blob Storage account.
Option 2: Provide an Azure Blob Storage connection string and container name. This method is straightforward and recommended if you have the connection
string readily available.
...
Benefits:
Multi-pod Support: Enables the collector to operate in a distributed environment, such as Kubernetes, where multiple instances (pods) of the collector can run concurrently. Checkpointing data stored in Azure Blob Storage ensures that each instance has access to the current state of event processing, facilitating efficient load balancing and event partition management.
Durability: Utilizes Azure Blob Storage's durability and availability features to safeguard checkpointing data against data loss or corruption.
...