Basic concepts

Technologies

Technologies are logical, application-level wrappers of data sources. By managing them as a building block within Service Operations, it is possible to extend their technical implementation with additional information, resources (such as synthetic datasets, related Activeboards, etc.) or simply by identifying the models that make use of those.

Technologies in Service Operations do not necessarily rely on the existence of a collector or parser, but rather the definition of an ingestion point to place or search the data from. In other words, Service Operations only cares about where the data is placed within the internal data structures of a Devo installation, but it does not implement the process of getting the data itself and delegates that to other functions (such as Relays, etc.).

Models

The concept of models lies at the very heart of Service Operations and can be seen as its cornerstone. In the context of Service Operations, a model is a schematic representation of a customer reality in the form of a hierarchical set of relationships involving entities, which capture the essence of any measurable aspect of the customer reality, and therefore, do have and can report a state over time.

The following diagram depicts a simple model:

In this example, there is an entity (SERVER) whose status depends on a number of sub-entities, which in this case are metrics or KPIs (Key Performance Indicators). Thus, the status of the server in the model is determined by the combination of those three metrics. Arrows and their directions are used to convey the idea of both dependency and impact: children nodes impact on the status of the parent node and, conversely, the status of a parent node depends on the individual statuses of all its children.

This model can be iterated and extended to represent more complex scenarios:

Services Operations builds on this foundation to translate business and/or operational realities to discreet, identifiable, contextualized and measurable items. The result is a highly extensible and versatile mechanism that can summarize the status of a number of heterogeneous entities in a single pane, and then provide all the tools to diagnose and pinpoint the root cause of any issues.

Entities status

As introduced before, all entities in a model are qualified by a condition or status. This status value can vary over time, and typically comes as a result of a status query together with a definition of the expected values for that query value. Entities in Devo can have one of NORMAL, WARNING or CRITICAL statuses at a given time based upon the combination of those two elements.

Following the same example as before, the Disk metric will have different values over time which are refreshed during every sampling period. The status of the Disk metric is therefore defined by each of those values in combination with the specific thresholds defined for it, hence defining what a normal vs. abnormal value is. Supposing Disk has a value that violates the warning threshold, that will then be the overall status for that metric.

Using the impact / propagation rules also defined for the model, the abnormal condition for the Disk metric might or might not affect the overall status of the Server. In this case, we assume the model implements such a relationship and therefore the overall status of Server and, as a result, Service Operations would automatically set its status to WARNING (orange).

Service status

Iterating over the previous procedure per entity and metric, Service Operations determines the overall status of a service (represented as a Datacenter entity in the same example) by analyzing the model that defines it: starting from the leaf nodes of the hierarchical tree, and by impact measurement analysis, parent node statuses are calculated and propagated upwards up until the root node of any existing trees.

Incidents

While the entities status feature provides an immediate diagnosis of how a given service (or set of services) is behaving—or what its behavior was in the past—using the status, impact, and correlation rules defined in the model, incidents provide a different angle for analysis. Incidents are defined as a set of abnormal events that have some sort of relationship between them.

Incidents provide three important values in the process of determining the root cause of problems and their potential impact:

Incidents consolidate under a common entity a potentially large number of events that would be difficult to isolate and diagnose without proper contextualization. This means that discreet alarms can be managed in a much more efficient way, minimizing false positive/negatives, data noise, and alarm storms.
Service Operations automatically tries to determine the root cause or causes of incidents by designating a set of low-level KPIs, providing immediate actionable information that can help pinpoint and fix the source of issues more effectively.
Together with the thorough analysis of symptoms and causes, incidents are evaluated in terms of the potential impact they suppose on the overall service. Thus, incidents are qualified with a severity level based on the relative impact the detected issues inflict on the end-users of the monitored service, for example.

As said before, incidents are created through the detection of certain types of events. The following list summarizes all event types supported as of today:

Status thresholds violation
Dynamic thresholds violation
Status flapping
Alerts
Anomalies (through external jobs)
Forecasting (through external jobs)

A design principle for Service Operations is that the incidents detection mechanism must be modular and extensible. This means that users can pick and choose the modules to utilize for any given model and configure them to accommodate their needs. On the other hand, Devo will create and integrate new modules as part of the product roadmap.

Actions

Incidents can be associated with actions for closed-loop actions definition and automatic execution. In future releases of the product, Service Operations will provide built-in actions definition and monitoring. Today though, actions are defined through the native alerts mechanisms in the Devo platform.