Document toolboxDocument toolbox

Basic concepts

Technologies

Technologies are logical, application-level wrappers of data sources. By managing them as a building block within Service Operations it is possible to extend their technical implementation with additional information, resources (synthetic datasets, related Activeboards), or simply by identifying the models that make use of them.


Technologies in Service Operations do not necessarily rely on the existence of a collector or parser, but rather only require the definition of an ingestion point to place or search the data from. In other words, Service Operations only cares about where the data is uploaded within the internal data structures of a Devo domain, but does not implement the integration or parsing itself.

The most common usage of technologies is for the ingestion of synthetic datasets—pre-uploaded data samples that can be used for rapid testing of certain data sources, or for the injection of specific events to simulate a certain condition, such as a security threat.

Service models

The concept of service models lies at the very heart of Service Operations and can be seen as its cornerstone. In the context of Service Operations, a model is a schematic representation of a customer reality in the form of a hierarchical set of relationships involving entities. These capture the essence of any measurable aspect of the customer reality, and therefore, do have and can report a state over time. Models, alongside other general configurations related to available logics, such as runtime behavior, form the so-called maps that ultimately instruct Service Operations' core to implement a customer scenario.

The following diagram depicts a simple service model:


 

In this example, there is an entity ("Server") whose status depends on a number of sub-entities—in this case, metrics or KPIs. The status of the server in the model is determined by the combination of those three metrics. Arrows and their directions are used to represent both dependency and impact rules—children nodes impact the status of the parent node and the parent nodes’ status depends on a combination of all their child nodes' individual statuses.

This model can be iterated and extended to represent more complex scenarios:


 

Services Operations builds on this foundation to translate business and/or operational realities into discreet, identifiable, contextualized, and measurable items. The result is a highly extensible and versatile machine that can summarize the status of a number of heterogeneous entities in a single pane, and then provide all the tools to diagnose and pinpoint the root cause of any issues. Since entities and their definition are industry or purpose-agnostic, Service Operations can virtually work and provide value in any scenario: IT operations, security, business processes, applications monitoring, and so forth.

Entities status

As mentioned before, all entities in a model are qualified by a condition or status. This status value can vary over time, and typically comes from the result of each entity’s associated query, together with a definition of the expected values for that entity ('thresholds'). Entities in Devo can either have a “normal”, “warning” or “critical” status at a given time based upon the combination of those two elements. 

Following the same example, the Disk metric will have different values over time, refreshed every sampling period. The status of the disk metric is therefore defined by each of those values in combination with the specific thresholds defined for it, hence defining what a "normal" vs. "abnormal" value is. Supposing the disk has a value that violates the warning threshold, that will be the overall status for that metric: "warning" or "critical".

Using the impact/propagation rules defined for the model, the abnormal condition for the disk metric might or might not affect the overall status of the server. In this case, we assume the model implements such a relationship and therefore the overall status of the server is impacted by it, making its status transition to "warning".


 

Service status

As the previous process goes on and on and goes from leaves to parent nodes in the tree, Service Operations can determine the overall status of a service (represented as a datacenter entity in the same example). What is more, it accomplishes two important benefits:

1. Determines the root cause of a problem reported at the top level ("datacenter") based on the impact and correlation rules set in the model.

2. Reduces the amount of superfluous information by aggregating all related events (individual violations of thresholds) into a single, actionable element.

Incidents

While the entities' status feature provides an immediate diagnosis of the status of a given service (or set of services) or what its behavior was in past—using the status, impact and correlation rules defined in the model—,incidents provide a different angle for analysis to try and answer two fundamental questions: why and how relevant. 

Incidents provide several values in the process of determining the root cause of problems and their potential impact:

  1. Incidents consolidate under a common entity a potentially large number of events that would be difficult to isolate and diagnose without proper contextualization. In that way, discreet alarms can be managed in a much more efficient way, minimizing false positives/negatives, data noise, and alarm storms.

  2. Service Operations automatically tries to determine the root cause or causes of incidents by designating a set of low-level KPIs, providing immediately actionable information that can help pinpoint and fix the source of issues more quickly.

  3. Together with the analysis of symptoms and causes, incidents are evaluated in terms of the potential impact they suppose on the overall service. Therefore, incidents are qualified with a severity level based on the relative impact the detected issues inflict on, for example, the end-users of the monitored service.

  4. Incidents are actionable elements by definition, meaning that the Devo platform can initiate actions triggered by them.

Incidents are created through the detection of certain types of events. The following list summarizes all event types supported as of today:

  • Status thresholds violation

  • Dynamic thresholds violation

  • Status flapping

  • Alerts

  • Anomalies (through external jobs)

  • Forecasting (through external jobs)

Actions

Incidents can be associated with actions for closed-loop actions definition and automatic execution through the native alerting mechanisms in the Devo platform. This way, the data center incident reported by Service Operations could be linked to an automatic action (such as filing a Jira ticket automatically or performing automatic remediation actions).