Document toolboxDocument toolbox

Incidents viewer

Introduction

Purpose

The incidents viewer section gathers all information related to the incidents detected by Service Operations—initially only in the front-end, but in the future through back-end jobs. This information can be used to understand the nature of the problems affecting all modeled services in a specific map, as well as their relationships, the root cause of issues, and other options to explore.

Use cases highlights

This module provides the following set of use cases:

  • Comprehensive list of incidents affecting the service or services.

  • Breakdown of incidents in terms of their related events (i.e., automic, individually detectable issues).

  • Events timeline representation.

  • Correlation of incident data for full diagnostics and root cause determination.

  • Detection of related incidents.

  • Incidents scoring based on impact level calculations.

  • Access to raw events for further diagnostics.

Incidents viewer

The Incidents viewer module implements a full detail display of the detected incidents affecting the overall status and / or performance of the services mapped in the model under analysis.

In Service Operations, incidents are defined as logical groups of events, each one of those corresponding to an anomalous condition or status experienced by the entities and metrics in the model. For example, a single incident can encapsulate all threshold violations of a key metric in the service—such as CPU consumption—over a period of time.

By grouping all related events under a same logical element, incidents accomplish the following goals:

  • Summarize multiple disconnected events into a single, more manageable and actionable piece of information.

  • Contextualize disparate data inputs through events enriching techniques that rely on the model topology, the type/subtype of the data analyzed, as well as the defined metadata fields.

  • Reduce data noise / data fatigue by aggregating and simplifying the information reported to the application users.

Through incidents aggregation, users can get a more accurate and summarized understanding of what the overalls status of their services are.

Incidents list

The left-hand side of the incidents viewer module implements a vertical list of all incidents detected by Service Operations in the specified time range, where one incident in the list appears as selected. The default sorting order for the list is time-based, where the most recent incidents appear at the top of the list. Each incident is listed with the following information fields:

  • Top level affected entity name, timestamp and incident criticality level in color code (green = low impact, orange = medium impact and red = high impact).

  • Top level entity current status.

  • Service impact estimation, which determines the incident-criticality level.

It is possible to search, filter and/or apply different sorting criteria for the incidents in the list by accessing the different tools in the heading part of this section.

  • Search box: allows for the filtering out of the list of incidents by those that match the text entered in the search component with their respective descriptions.

  • Filtering / sorting criteria selector: opens up a modal dialogue that allows the user to choose from a number of criteria to filter the list (incident status, owner, etc.). Furthermore, different sorting mechanisms are provided in the same modal menu as options.

Incident drill-down

Any time an incident is selected in the list, such as by clicking on its description in the incidents list, all related information about it is brought up on the right-hand side of the UI. This section, called incident drill-down, is populated with all high and low level information gathered by Service Operations on the incident, and aims to facilitate the understanding of the incident itself by trying to answer why and when the incident was triggered, how it can be fixed, or how much it affects the overall service.

The UI in this section differentiates two main areas:

  1. Events timeline: Lists the full list of events that Service Operations considers to be part of the incident.

  2. Incident analysis: Multi-section element that organizes the gathered data in terms of high-level details, events details, impact assessment and related incidents.

The events timeline widget projects all individual entities that are considered part of the incident on a time series graphical representation, in which the colored line shows the status of the entity over time. This way it is possible to know the status changes of that entity over a period of time.

As for the four incident analysis tabs available in the second half of this section, their purpose is as follows:

  1. Details: Provides a more descriptive view of the selected incident, showing its reason, determined root causes and impact, if any. Also, it lists all associated data structures in Devo and provides a direct link to the related information by clicking on the RUN button.

  2. Related incidents: This section lists any other incidents currently active in the platform that Service Operations has determined to be related, in some way, with the current one. Incidents are shown in a tabular way and are navigable, i.e., clicking on their description rows makes Service Operations transition to that selected incident and its corresponding information.

  3. Events: This table captures all events that Service Operations has determined to be the ones causing the Incident. The concept of events in Service Operations is more related to individual, abnormal behaviors or situations that are detected by the application. For example, a transition from a normal status to a warning one is considered as an event, as it is a flapping condition of a status over a period of time. The events section shows all those events and provides some additional information for each of them, also with the capability to send them all to a monitor for comparison purposes.

  4. Impact: Within this section, the incident is analyzed from the angle of how much it is affecting the overall service as it is defined in the model. Impact analysis is calculated using the homonymous definitions in the entities' configuration forms, as explained in the models configuration section of this documentation.