Вы находитесь на странице: 1из 14

Customer Event Management

Event - Event is a change of state which has significance for the management of
a resource or service.

The term Event is also used to mean an alert or notification created by any service,
resource or monitoring tool.

Events typically require operations personnel to take actions, and often lead to
Incidents being logged.

Resource - Resources represent physical and non-physical components used to construct services. They
are drawn from the Application, Computing and Network domains, and include, for example, network
elements, software, IT systems, and technology components(1).

Resource information is recorded in the CMDB and is maintained throughout its lifecycle by
the Configuration Management process. The integrity of the whole set of resources for a specific
customer are under control of the Assurance organization. Capacity and performance of the resources
present in the infrastructure is managed by the Capacity & Performance Management process.

This concept is aligned with the Configuration Item definition from ITIL (any component that need to be
managed in order to deliver the services to the end-users(2)). It also covers the ITIL Asset definition
(assets of a service provider, including anything that could contribute to the delivery of a service (3)).

Service - Service is a means of delivering value to Customers by facilitating outcomes customers want to
achieve without the ownership of specific costs and risks(1).

Three aspects of this definition is applied in MSTOP:

 The first is related to the activities every MS organization is delivering to another MS organization
under the scope of a Managed Services contract and in accordance to MS Service Functions and
Elements. For example, the 1st Level Operations, 1st Level Incident Management activities which
are provided from a Global MS Delivery Center to a MSIP is a service.
 The second aspect is associated to the complete Managed Services that is delivered by Ericsson
to its Customer (Operator).
 The last aspect is the "process" definition of service which is related to the parts that are
developed to be provided to the end-user within products. In this definition services can be
included in multiple products, packaged differently, with different price, etc, (2) always constituting
a specific and well defined way of providing value to the end user. Services are dependent
of resources.

This third aspect can be illustrated by considering a Post-Paid voice subscription (product) where voice,
data, customer service, billing, etc are all services that are packaged into it to be provided by the end
user. Each of these products relies on the correct and efficient performance of one or multiple resources
(e.g. voice service requires switches, base station controllers, base stations, radios, the software inside
each of these resources, etc).

Services (in this third aspect) and resources are usually the scope of a Managed Services contract.

Event Management : A process designed to efficiently interpret any detectable or discernible


occurrence (events) that has significance to the management of the customer's (operator) infrastructure,
evaluating its dimension and impact, and to initiate the appropriate control action. In order to be efficient,
this process is dependent on knowing the status of the infrastructure and detecting any deviation from
normal or expected operational levels. This process is thus highly dependant on good monitoring and
control systems, which can be divided in(1):

 Active monitoring tools that pool key resources to determine their status and availability. Any
exceptions will generate alerts that needs to be communicated to the appropriated tool or team
for action
 Passive monitoring tools that detect and correlate operational alerts or communications
generated by resources

In the present stage the New MSTOP Event Management process is dependant on simple correlations
executed by the 1st Level Operations technician or engineering, mainly between alarms, changes,
maintenance works or other incidents. Further efficiency will be achieved by automatizing correlation and
selected operational activities.

Event Management provides the entry point of many operational processes and activities. In addition, it
provides a way of comparing actual performance and behavior against design standards and SLAs
The objectives of this process are:

 Ensure that any meaningful event is detected as soon as possible


 To ensure that standardized methods and procedures are used to initiate the correct control
action
 To prevent unimportant events to flood the areas responsible for control actions, causing an
important event to be "lost"
 To execute constantly real time performance(2) and capacity monitoring, and to provide data for
the related processes and activities(1)
 Correctly handle external incidents tickets (TT) raised from the Customer Service Desk, Help
Desk or Customer Care or from other internal areas

Scope: Event Management can be applied to any aspect of the MS delivery that needs to be controlled
and which can be automated:

 Resources
 Environmental conditions (e.g. fire and smoke detection)
 Security (e.g. intrusion detection)
 Normal activity (e.g. tracking the use of an application or the performance of a server) (1)

It also covers the treatment of calls originated on the Customer's Service Desk, Help Desk or Customer
Care Center.

Event Management covers the actions (standard or automated) to handle events others than the ones
classified as exceptions, which are treated by the Incident Management process. EM will correlate events
in order to certify that a new incident has been identified, excluding outages caused by planned activities.

Event Management is the entry point of incident notifications related to Services and Resources, i.e.
under the scope of 1st Level Operations and 2nd Level Operations incident management.

Basic concepts related to this process are:

 Alarm
 Event
 Incident

Event Management basic function lies in the correct identification of meaningful notifications. Thus it is
paramount that the definition of what the MS Assurance organization is looking for in the Customer's
infrastructure be clearly stated and that the supporting systems and the action procedures be defined
beforehand. For this to be achieved four basic steps are to be performed: detection, filtering,
categorization and correlation.
As ITIL states, events occur continuously, but not all of them are detected or registered. It is therefore
important for everybody involved in designing, developing, managing and supporting services and
resources for a Customer (operator) understands what type of event need to be detected.

New rules of detection may cause an overload of notifications, and it may cause important events to rest
unnoticed. Thus it is important that filtering criteria to be also determine, taking into consideration the
specific characteristics of every resource and service.

After events are filtered a category based on significance is attributed to the event, as follows:

 Informational – This refers to an event that does not require any action and does not represent
an exception. They are typically stored in the system or service log files and kept for a
predetermined period. Informational events are typically used to check on the status of a device
or service, or to confirm the successful completion of an activity. Informational events can also be
used to generate statistics (such as the number of users logged on to an application during a
certain period) and as input into investigations (such as which jobs completed successfully before
the transaction processing queue hung).
 Warning – A warning is an event that is generated when a service or resource is approaching a
threshold. Warnings are intended to notify the appropriate person, process or tool so that the
situation can be checked and the appropriate action taken to prevent an exception (1). Warnings
are not raised for service outage or resource failure. Thresholds are usually defined during
service design, with the appropriate level to gauge the status of a system. Warnings could be
generated when certain levels are reached for disk storage, network activity, processor utilization,
and so on. In our example, if one or more interconnection routes reach the pre-defined utilization
threshold, it would be considered a warning event as it might start to impact on the service
performance - if it is an isolated peak it might be discarded as informational but if it signals a
consistent traffic increase it must be carefully treated.
 Exception – An exception means that a service or device is currently operating abnormally
(however that has been defined). Typically, this means that an OLA and SLA have been
breached and the business is being impacted. Exceptions could represent a total failure, impaired
functionality or degraded performance(1). ITIL reinforces that an exception does not always
represent an incident: for example, an exception could be generated when an unauthorized
device is discovered on the network. On the other hand, MSTOP considers that the exception will
always be treated as an incident, prompting the Incident Management process, as it
acknowledges that, once classified as an exception , the probability of facing an incident is much
higher than the other possible cases and thus justifying taking the necessary actions to restore
the service to the acceptable level. Still using the same example, when the interconnection path
start consistently loosing calls due to congestion, then we have an exception event.

 Event Management and Capacity & Performance Monitoring



 As part of the detection of events, Event Management provides the means for monitoring the real
time capacity and performance(2) of the customer's (operator) infrastructure. Differently from the
activity performed by Capacity & Availability Management which is concerned with the network
capacity and performance to meet present and future customer's business demand, the activity
executed under Assurance is focused in monitoring the current status of the network to detect
any abnormal behavior or to use as input to event correlation, with little regard to the
infrastructure future conditions.
 Event vs Alarm

 Alarm is quite similar to event. In fact we can consider that an alarm is always an event - as it
some occurrence that is detected in a resource or service. But the other way around is not true as
not all events are alarms - some events might be the result of resource state pooling for example.
For MSTOP we are considering alarms as the main process input currently.
 Event vs Incident

 Incident always refer to a failure that causes service interruption or quality degradation. A
detected event may be classified as an incident whenever it causes (or threaten to cause) a
service disruption. Event Management provides means to early detect an incident, so to speed up
its restoration and avoid breaches on the SLA.

Challenges

There are a number of challenges that might be encountered:

 The primordial challenge for this process is to have the necessary tools to provide a
comprehensive monitoring of customer's (operator) services and resources and also to obtain the
necessary effort to exploit the benefits of the tools(1)
 Setting the correct level of filtering. Setting the level of filtering incorrectly can result in either
being flooded with relatively insignificant events, or not being able to detect relatively important
events until it is too late;(1)
 Having the correct dedication to monitoring and event handling may be difficult, as resources tend
to be directed to handle incidents rather than warning or informational events
 Obtaining the correct definitions of thresholds from equipment vendors, from third part integrators
or from internal areas (as per Solution Development and Retirement) may prove to be a
challenge as sometimes information is missing or not set as priority

Critical Success Factors

 One of the most important CSF is achieving the correct level of filtering. This is complicated by
the fact that the significance of events change. For example a user logging into a system today is
normal, but if that user leaves the organization and tries to log in it is a security breach (1)
 1st Level Operations Staff has appropriate knowledge & understanding of Alarms reported by
FMS (Fault Management System) and deviations detected through Active monitoring
 Information related to all Network Changes is available with 1st Level Operations and same is
referred in discipline as per need
 Rejection rate for External Trouble Tickets recorded by Operator’s Service Desk or Customer
Care Center is kept in minimum
 New services are designed with Event Management in mind - the exact targets and mechanisms
for monitoring should be specified and agreed during Availability Management, Capacity &
Performance Management and Solution Development and Retirement processes
 A constant trial and error of the effectiveness of the filtering rules

Security Requirements

The following security controls in ISR ( Information Security Requirements) are associated with this
process:

Control Description

12.4.1 Event logs recording user activities, exceptions, faults and information security events should
be produced, kept and regularly reviewed.
Information security events should be reported through appropriate management channels
16.1.2
as quickly as possible.
Employees and contractors using the organization’s information systems and services should
16.1.3 be required to note and report any observed or suspected information security weaknesses
in systems or services.
Information security events should be assessed and it should be decided if they are to be
16.1.4
classified as information security incidents.
Knowledge gained from analyzing and resolving information security incidents should be
16.1.6
used to reduce the likelihood or impact of future incidents.

Privacy Requirements

The following privacy controls are applicable to this process. More information can be found at Baseline
Privacy Requirements (BPR).

Control Description
Personal Information shall only be collected for the purposes specified in the
1.2.3
contract with the Data Controller.
1.2.4 Use, retention and disposal of private information must follow BPR.
Key Performance Indicators
KPIs to measure effectiveness of the process in the operator's and in the MS organization viewpoint:

KPI Description

EM001 Number of events acknowledged (for historical comparisons);

EM002 Percentage of events acknowledged compared to the number of reported events;

EM003 Number of Escalations for non-detected events;

EM004 Number of duplicated events reported in Fault Management System (for historical comparisons);

Percentage of duplicated events reported on FMS (Fault Management System) compared to total number of
EM005
events reported;

EM006 Number of rejected External Trouble Tickets due to incomplete information;

EM007 Number of rejected External Trouble Tickets raised for planned activities;

EM008 Mean time to report Events

Service Functions, Organization & Roles

Service Functions

The Event Management process is mainly performed by 1st Level Operations Service function. 2nd Level
Operations service function is the key interface for 1st Level Operations, whose services are invoked
while executing the Event Management process.

Organization
This process is basically performed by 1st level Operations Organization.

1st Level Operations function under assurance organization is responsible for the execution of proactive
and reactive maintenance activities to ensure that services provided to customers are continuously
available and performing to SLA performance levels, grouping the activities that require 24x7 execution.

It performs continuous resource status and performance monitoring to promptly detect possible failures. It
is responsible to handle trouble reports or calls received from Operator's Customer Care Center or
Service Desk or from internal Service Desk and to ensure 1st level resolution of incidents. It is also
responsible for the incident lifecycle management, being responsible to follow up on TT from its creation
until incident has been solved and TT closed.

This SF is responsible to initiate the Incident Management process and to follow the escalation path
agreed in the WLA. It is also responsible for following up on change execution, approving execution
activities start and securing that activity has not impacted negatively on the customer's infrastructure. It is
also responsible to engage an external supplier in case it is part of the affected service or part of the
functional escalation.

Roles

No specific role is necessary for this process.

Tools

The Event Management uses MSDP Fault Management System (One FM) to obtain the status of the
infrastructure resources mainly via alarms. Specific tools for performance monitoring and correlation are
under discussion.
Responsibility Matrix

Ops Operator's
1st Level 2nd Level Support Change Service Desk MSIP
Activity
Operations Operations / Manager or Customer OA
MS DM Care Center
Customer
specific detection
A/R
and thresholds
definitions
Detection and
Filtering Rules R A
Definition
Active
Monitoring A/R
Execution
Network
A/R
Surveillance
Provide Change
Execution I A/R I
Information
Secure relevance
and accurateness
R S A
of External
Trouble Tickets
Real time
Performance A/R
Monitoring

Customer Incident Management

Incident- Is an unplanned interruption to a service or the reduction in the quality of a


service. Failure of a resource that has not yet affected service is also an incident (for
example the failure of a redundant module)(1). Incident is the effect or potential effect
of the error upon the end-user(2).

A resolution or incident work-around should be established as quickly as possible in


order to restore the service to the end-users with minimum disruption(2). Incident
Management is the process that focus on fast and effective service restoration.

Incident Management is the process group designed to restore normal service operations (within the SLA
limits) as quick as possible and to minimize the adverse impact on the Customer's business operations,
thus ensuring that the best possible levels of service quality and availability are maintained.(1)
This process also defines the functional and Hierarchic escalation procedure related to Incident
Management activities.

The following picture shows the simplified view of the process activities, interfaces and roles:

The objectives of this process is to ensure:

 That all Incidents are resolved as fast as possible, taking into full consideration the severity of the
incident and the WLA requirements;
 That focus on the incident resolution is not taken away by non-incident related activities
 Optimum re-use of incident resolution knowledge (from multiple customer's networks and
services)
 That overall business risk is minimized
 That the correct solution or most efficient work around for minimizing the negative impact on the
customer's services is applied
 Improve the Maintainability Performance, resulting in improved availability
 Communication flow is effectively managed allowing Ericsson and the customer to make informed
business decisions

Scope: IM deals with failures and capacity or performance breaches - any event that disrupts or can
potentially disrupts a service - as a result of the Event Management processes. In fact, incidents may be
found by event detections (such as alarms), communicated directly by a customer (Customer's users or
the Customer's employees as Enterprise customers) or by the Customer's Service Desk or Customer
Care Center, but they are initially handled according the Event Management processes. This is done to
secure that all necessary information is collected and that the event is verified and correlated, and
Incident Management process is only initiated when an exception is identified. It is important to
emphasize that when an incident is reported or identified with higher severity the Event Management
process offers a quick initiation of the Incident Management, avoiding any delays on the incident handling
activities.

Incident Management deals with incidents on the customer's resources or services, directly related to the
service functions 1st Level Operations and 2nd Level Operations - and with MSTOP horizontal level 1
process groups Service Management & Operations and Resource Management & Operations. Incidents
related to service functions Customer Problem Management and Help Desk - and with MSTOP horizontal
level 1 process group Customer Relationship Management - are not under the scope of this process and
will be described on the Service Desk processes (see picture below).

Differently from the ITIL specification on Incident Management, services requests are not handled by this
process but by Event Management process, with the aim to increase the focus on incident handling
activities within this process.

Basic Concepts
Basic concepts related to this process are:

 Incidents
 Work-Arounds
 E-CAB
 Incident Prioritization
 Hierarchic
 Known Error Database

Incident Management basically deals with the restoration of services and resources affected by incidents,
and it is basically a reactive process in the sense that it is triggered by an outage or an outage threat.
Nevertheless it is necessary to reduce to a minimum the time required to restore the service or to cease a
threat and thus this processes takes benefit from re-using knowledge in the best way, either by accessing
the Known Error Database and by consulting incident resolution models with proven results from previous
incident resolutions.
Because time is a critical factor when the customer's infrastructure is facing an outage, it is important that
timescales are agreed previously defining the different incident priorities and are readily available to all
staff involved on the incident handling activities. The IM process defines four different priority levels to
which the correspondent definition must be included in the WLA for the specific contract / customer.
These four priority levels are described below:

INCIDENT PROBLEM
PRIORITY MSIP OA INVOLVEMENT
MANAGEMENT MANAGEMENT

Minor
No Mandatory Involvement No Mandatory Involvement No mandatory root cause
on restoration on restoration investigation (2)

Medium
No Mandatory Involvement No Mandatory Involvement No mandatory root cause
on restoration on restoration investigation (2)
Major
Optional involvement (WLA Optional involvement Optional investigation
or per escalation) (WLA or per escalation) (WLA) (2)

Critical
Mandatory root cause
Mandatory involvement Mandatory involvement
investigation (2)

From the table above is easy to notice that critical incidents have a specific procedure, following ITIL
recommendation, to cope with its shorter timescales and greater impact. In this scenario the participation
of both Incident Manager and MSIP Operations Assurance is paramount to the success of the process.

Other important aspect is that the priority is always a result of the urgency to restore the service and the
impact of the incident on the customer's business. Usual priority is pre-defined on the WLA based mainly
on the impact of the outage, which roughly drives the urgency to have it corrected. It is important to notice
that the priority of the incident can change during its lifecycle due changes on the circumstances: impact
and urgency may increase or decrease and it must be reflected on the incident priority. The definition of
VIP resources or services usually increase the priority of the incident due to the need for a more urgent
response and must be clearly defined previously.

The general priority coding system takes into consideration both urgency and impact to define the target
resolution time:

IMPACT

High Medium Low


URGENCY High Critical Major Medium
Medium Major Medium Minor
Low Medium Minor Planned (3)
Incident vs Problem

In ITIL terms, an incident is an unplanned interruption or the reduction in the quality of a service
and problem is the underlying cause of one or more incidents. While critical incidents are
sometimes referred as problems, this is a wrong association. Incidents and problems are
different concepts - an incident will remain an incident forever: it might grow in impact or priority,
from a minor to a perhaps a critical incident, but it will never "become" a problem. On the same
hand, a problem will always remains a problem.

Incident Management vs Problem Management

As a result, from the different nature of the incident and problem concepts, the processes
derived from them are also different in timing and purpose. The IM processes needs to respond
quickly and thus the external interferences are minimized. In the cases when the cause of the
incident needs to be investigated at the same time, then the problem management process
would be involved as well but the incident manager must ensure that service restoration and
underlying cause investigation are kept separated.

Challenges

 The ability to detect incidents as early as possible. This requires education of the the technical
team involved with the infrastructure operations and efficient Event Management process and
tools.
 Reinforcing that incident models must be created based on successful incident restorations and
that an accurate Known Error Database must be maintained and that these knowledge bases are
effectively used during incident investigation & diagnosis. This will enable Incident Management
staff to learn from previous incidents and also to track status of resolutions (4).
 Full support of the MSIP Operations Assurance (MSIP OA) function and its active participation on
high priority incidents, securing the management of the customer's expectative, the flow of
information towards the customer's management organization and deciding on the internal
escalation to support the customer. Also MSIP OA can decisively contribute with the Incident
Management activities by bringing its deep knowledge of the customer business requirements.
 Availability of an accurate Configuration Management System to determine relationship between
resources, check on infrastructure topology and to refer to the history of resources(4).
 Integration with the Service Level Management procedure (under Contract
Management process). This will provide the information (as per WLA definition) to assist incident
management involved organization to correctly assess the impact and urgency of incidents, the
timelines and to assist in defining and executing escalation procedures. SLM will also benefit from
the information learned during Incident Management, for example in determining whether service
level performance targets are realistic and achievable.
 The interworking and the performance of the local organization involved in incident resolution.
While the GSC Assurance organization is the final responsible for the performance of the incident
restoration, it is mandatory that the support organizations in the regions (Field Operations and
Deployment) also have clear targets to work associated with the overall incident management
SLA, and that their performance is maintained within these targets.

Critical Success Factors

The following factors will be critical for successful Incident Management (4):

 An efficient Event Management process


 Clearly defined targets to work to, as defined in the WLA
 Clear delivery solution, with well defined responsibilities distributed among the service delivery
units and the MSIP according to the MS Blueprint and Service Functions
 Strong and empowered Incident Managers on the GSC
 Integrated support tools to drive and control the process
 OLA that are capable of influencing and shaping the correct behavior of all support staff
 Active involvement of the MSIP OA on high priority incident resolution

Вам также может понравиться