Incidents viewer
Introduction
Purpose
The incidents viewer section gathers all information related to the incidents detected by Service Operations. This information can be used to understand the nature of the problems affecting all modeled services in a specific map, as well as their relationships, the root cause of issues, and other options to explore.
Use cases highlights
This module provides the following set of use cases:
Comprehensive list of incidents affecting the service or services.
Breakdown of incidents in terms of their related events (automic, individually detectable issues).
Events timeline representation.
Correlation of incident data for full diagnostics and root cause determination.
Incidents scoring based on impact level calculations.
Access to raw events for further diagnostics.
Incidents viewer
The Incidents viewer module implements a full detail display of the detected incidents affecting the overall status and / or performance of the services mapped in the model under analysis.
In Service Operations, incidents are defined as logical groups of events, each one of those corresponding to an anomalous condition or status experienced by the entities and metrics in the model. For example, a single incident can encapsulate all threshold violations of a key metric in the service—such as CPU consumption—over a period of time.
By grouping all related events under a same logical element, incidents accomplish the following goals:
Summarize multiple disconnected events into a single, more manageable and actionable piece of information.
Contextualize disparate data inputs through events enriching techniques.
Reduce data noise / data fatigue by aggregating and simplifying the information reported to the application users.
Through incidents aggregation, users get a more accurate and summarized understanding of what the overall status of their services are.
Incidents list
The left-hand side of the incidents viewer module implements a vertical list of all incidents detected by Service Operations in the specified time range, where one incident in the list appears as selected. The default sorting order for the list is time-based, where the most recent incidents appear at the top of the list. Each incident is listed with the following information fields:
Top level affected entity name, timestamp and incident criticality level.
Service impact estimation, which determines the incident-criticality level.
Summary
The summary has the basic information about incidents. There is the total of incidents and then the number of incidents by severity level in color code (white = low impact, orange = medium impact and red = high impact).
Incidents detail
Severity value: A value from 0 to 100 to specify the severity of the incident. Sometimes there is no impact information, so the severity does not apply and the value is N/A.
Incident status: The incident can be in four different statuses, corresponding to the actions of the user. An incident is New when it is created. When a user sees the incident and it is under investigation, the user can change the status to ACK. When the actions have been reviewed and taken to resolve the incident, the user can change its status to Closed. Another status can be Archived, if it is not necessary to investigate the incident. The user can delete an incident with the Deleted status.
Incident date: The date when the incident was created.
Severity indicator: The name of the affected entity by the incident.
Total events: The events associated with this incident.
It is possible to search, filter and/or apply different sorting criteria for the incidents in the list by accessing the different tools in the heading part of this section.
Filtering / sorting criteria selector: Opens up a modal dialogue that allows the user to choose from a number of criteria to filter out the list (incident status, owner, and so on). There are also different sorting mechanisms in the same modal menu as options.
Sort by: Timestamp, status, severity or action:
Order: By ascendent (from past to present) or descendent (from present to past) date.
By search text: Filter the list of incidents by those that match the text entered in the search component with their respective descriptions.
By state: New, ACK, closed, archived.
By severity: High, Medium, Low, Unknown, No impact.
Incidents Drill-Down
Incidents overview
Severity score: A value from 0 to 100. The severity score depends on the number of users impacted by the incident.
Average service impact: The percentage of the service impact.
Max service impact: The maximum of impacted users. For example: 40 of 200 users have been impacted by this incident.
Last service impact: The impacted users in the last moment that the app received information about the incident. For example: 20 of 200 users have been impacted in the last update of the incident.
Entities involved: The number of entities that they are involved in the incidents.
Status: The current status of the incident. There is a button to change the status of the incident from ‘New’ to ‘ACK’ to ‘Closed'.
Owner: The owner of the incident. There is a button to change and choose new owners of the incident.
Creation date: The date when the incident was created.
Last event date: The date of the last event.
Modal to change status
Modal to change the owner
Involved data tables
In this section there is the list of tables that are involved in the events that make up the incident. The information that this section provides is the name of the entity related to the affected table that you can access by clicking on the 'search' icon.
Events timelines
This lists the full list of events that Service Operations considers to be part of the incident.
Type of severity: This color changes depending on whether it is high, medium or low severity.
Entity name: The name of the affected entity by the event.
Query: The result of the query in the moment that the event happens.
Table: Table that contains the data of this event.