Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The following diagram depicts a simple service model:

Image Modified
 

In this example, there is an entity ("Server") whose status depends on a number of sub-entities—in this case, metrics or KPIs. The status of the server in the model is determined by the combination of those three metrics. Arrows and their directions are used to represent both dependency and impact rules—children nodes impact on the status of the parent node and the parent nodes’ status depend depends on a combination of all their child nodes' individual statuses.

This model can be iterated and extended to represent more complex scenarios:

Image Modified
 

Services Operations builds on this foundation to translate business and/or operational realities into discreet, identifiable, contextualized, and measurable items. The result is a highly extensible and versatile mechanism machine that can summarize the status of a number of heterogeneous entities in a single pane, and then provide all the tools to diagnose and pinpoint the root cause of any issues. Since entities and their definition are industry or purpose-agnostic, Service Operations can virtually work and provide value in any scenario: IT operations, security, business processes, applications monitoring, and so forth.

...

As mentioned before, all entities in a model are qualified by a condition or status. This status value can vary over time, and typically comes from the result of each entities' each entity’s associated query, together with a definition of the expected values for that entity ('thresholds'). Entities in Devo can either have a “normal”, “warning” or “critical” status at a given time based upon the combination of those two elements. 

Following the same example, the Disk metric will have different values over time, refreshed every sampling period. The status of the disk metric is therefore defined by each of those values in combination with the specific thresholds defined for it, hence defining what a "normal" vs. "abnormal" value is. Supposing the disk has a value that violates the warning threshold, that will be the overall status for that metric: "warning" or "critical".

Using the impact/propagation rules defined for the model, the abnormal condition for the disk metric might or might not affect the overall status of the server. In this case, we assume the model implements such a relationship and therefore the overall status of the server is impacted by it, making its status transition to "warning".

Image Modified
 

Service status

As the previous process goes on and on , and goes from leaves to parent nodes in the tree, Service Operations can determine the overall status of a service (represented as a datacenter entity in the same example). What is more, it accomplishes two important benefits:

1. Determines the root cause of a problem reported at the top level ("datacenter") based on the impact and correlation rules set in the model.

2. Reduces the amount of superfluous information by aggregating all related events (individual violations of thresholds) into a single, actionable element.

...

Incidents

While the entities' status feature provides an immediate diagnosis of the status of a given service (or set of services) or what its behavior was in past—using the status, impact and correlation rules defined in the model—,incidents provide a different angle for analysis to try and answer two fundamental questions: why and how relevant. 

Incidents provide several values in the process of determining the root cause of problems and their potential impact:

  1. Incidents consolidate under a common entity a potentially large number of events that would be difficult to isolate and diagnose without

...

  1. proper contextualization. In that way, discreet alarms can be managed in a much more efficient way, minimizing false

...

  1. positives/negatives, data noise, and alarm storms.

  2. Service Operations automatically tries to determine the root cause or causes of incidents by designating a set of low-level KPIs, providing

...

  1. immediately actionable information that can help pinpoint and fix the source of issues more quickly.

  2. Together with the analysis of symptoms and causes, incidents are evaluated in terms of the potential impact they suppose on the overall service. Therefore, incidents are qualified with a severity level based on the relative impact the detected issues inflict on, for example, the end-users of the monitored service.

  3. Incidents are actionable elements by definition, meaning that the Devo platform can initiate actions triggered by them.

Incidents are created through the detection of certain types of events. The following list summarizes all event types supported as of today:

...

Incidents can be associated with actions for closed-loop actions definition and automatic execution through the native alerting mechanisms in the Devo platform. This way, the datacenter data center incident reported by Service Operations could be linked to an automatic action (such as filing a Jira ticket automatically , or performing automatic remediation actions).