Вы находитесь на странице: 1из 26

Can your organisation deal

with a sudden and immediate


loss of its IT systems?

And can you avoid it?

A guide to removing user downtime as part of a core


business Initiative

Includes a practical check-list for assessing the


effectiveness of your proposed High Availability,
Business Continuity or Disaster Recovery solution

By Neil Robertson

© Copyright of the Neverfail Group 2005


3 FOREWORD 4

It is no secret that technology moves forward at seemingly ever It will also help IT staff to cut through the marketing spin and address
increasing pace. As a result, it becomes progressively more difficult the critical issues during the purchasing process of systems and
for senior management to keep-up-to date with what is technically software that claim to eliminate user-downtime.
possible and therefore what has become commercially desirable.
The emphasis has moved from protecting data to protecting the
It is a fact that almost every organisation is continually increasing its productivity of the user; whether that user is a member of your staff,
dependency on IT to do business. Critical applications are woven or one of your customers, suppliers or prospects, whether they are
into the very fabric of day-to-day business operations, yet senior internal, on-line or accessing applications and information on your
management of the vast majority of organisations still fail to even intranet, extranet or website.
review their vulnerability, exposure and liability to IT downtime.
Every business recognises its increasing reliance on IT for the day-to-
The options available to organisations for IT Disaster Recovery (DR), day operation of its business.
High Availability (HA) and Business Continuity (BC) planning have
fundamentally changed in what is achievable, whilst the complexity Now, management has a clear choice on whether they are prepared
and cost of deployment / on-going management have reduced to recognise and quantify their corporate exposure to downtime, and
significantly. based on those findings, tolerate downtime or not.
The IT industry has been built on the delivery of solutions that are
fast, easier to use and cheaper than their predecessors. This process The significance of that decision is relevant to your shareholders,
delivers better value to all businesses, whilst making the solution stakeholders, employees, partners, customers and prospects, who
available to an ever increasing market. For the larger organisations look to you to protect and advance their respective interests.
this means better protection at lower cost. For the smaller
organisation, it means that vastly superior protection is not only
available to them for the first time, but is financially justifiable and,
perhaps more importantly, commercially desirable.

This book will help corporate executives and senior IT staff to re-
evaluate their core strategy in the use, protection and availability of Neil Robertson
the critical IT systems that underpin the moment-by-moment running April 2005
of their business.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
5 CONTENTS
CONTENTS Page 6

Foreword 3-4 Go to page

Contents 5-6 Go to page

A Glossary of Terms 7-8 Go to page

Why do we tolerate Downtime? 9-10 Go to page

Quantification of Downtime Risk 11-12 Go to page

High Availability or Disaster Recovery? 13-15 Go to page

Does Downtime really Matter? 16-20 Go to page

Who is Responsible and therefore who is to Blame? 21-22 Go to page

Next Steps 23-24 Go to page

Critical Server Identification: 25-26 Go to page

Data Protection 27-28 Go to page

High Availability and Disaster Recovery: Selection criteria 29-30 Go to page

Architecture: Basic Structure of the HA / DR solution 31-34 Go to page

Reliability & Monitoring 35-38 Go to page

Application Software Protection / Auxiliary Software Protection 39-41 Go to page

Switch / Failover Criteria and Performance 42-44 Go to page

Bandwidth Considerations 45-47 Go to page

Summary 48 Go to page

The Critical Components required to removing Downtime 49-50 Go to page

About the Author 51-52 Go to page

www.neverfailgroup.com Contact Neverfail

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
7 A GLOSSARY OF TERMS ...A GLOSSARY OF TERMS 8

Users: Critical Server Downtime:

The company’s most valuable assets and probably greatest The failure of the business and the users to undertake and execute
overhead, your staff. critical actions required in the moment-by-moment operation of the
business.
Users also include prospects, customers and suppliers that access IT
information and services via email, extranet and your website. Repeated Critical Server Downtime:

Downtime: Career threatening, sleep inhibiting, worst nightmare scenario,


because whenever it happens, for some users, customers, suppliers
The disconnection of users from the software & data they require in and stakeholders, it will be the worst possible and most damaging
order to work effectively (for whatever reason). moment.

User Downtime: High Availability (HA) Strategy:

The real cost of server downtime. User lost productivity, broken Ensuring that the users either remain connected, or can be
commitments, reduced performance and missed expectations. reconnected to a working application in the shortest time possible.

Planned Downtime: Disaster Recovery (DR) Strategy:

Deliberately disconnecting the users from software and data they Ensuring that the users either remain connected, or can be
require in order to work effectively. reconnected to a working application in the event of a disaster such
as fire, flood, hurricane, terrorism (etc) in the shortest time possible.
Unplanned Downtime:
Disaster:
Randomly disconnecting the users from software and data they
required in order to work effectively. Re-definition: The disconnection of users from a working critical
application / data for an extended period of time, for WHATEVER
Critical Server: reason.

A server that enables user access to software and data that is


fundamental to the user in their moment-by-moment operation and in
the moment-by-moment operation of the business.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
9 WHY DO WE TOLERATE DOWNTIME? ...WHY DO WE TOLERATE DOWNTIME? 10

Two dynamics have changed.

The first is that business dependency on critical applications is


increasing and will continue to do so. Therefore the cost and risk
associated with downtime is also increasing.

The second is the significant reduction in the complexity and cost of


IT solutions that remove the threat of downtime.
The objective of this book is to help senior management to re-define
their priorities and objectives in maximising the benefits of technology
within their businesses, whilst removing the threat and cost of user-
downtime for the most critical IT services.

The world has changed.

Any downtime is now a matter of choice - provided that management


have the knowledge and information to make that choice and the will
to execute on that decision.
WHY DO WE TOLERATE DOWNTIME?

Within your business, if you could remove the risk of any downtime
at zero cost, with zero complexity, zero additional resource, and with
zero overhead, wouldn’t you do it for every server in your business?

Surely, logically, it would be stupid not to!

The ONLY commercially justifiable reason for not protecting every


server against downtime is because historically it has been seen as
too expensive and too complex when compared to the probable risk
and perceived cost of downtime.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
11 QUANTIFICATION OF DOWNTIME RISK ...QUANTIFICATION OF DOWNTIME RISK 12

The downtime is the period during which the users are unable to
access the software and / or data they require in order to work
effectively.

Therefore we can commercially define downtime as:

THE PERIOD OF TIME “The period of time required to re-connect the users to a
REQUIRED TO RE- working application / data.”
CONNECT THE USERS
TO A WORKING This definition may seem obvious, but a detailed market survey
APPLICATION OR showed that almost all organisations focus on protecting data as the
DATA SOURCE cornerstone of a high availability / disaster recovery solution. User
re-connection is a secondary objective and therefore secondary
purchasing requirement. This focus on data is a reflection of the
heritage of HA / DR solutions, as that is what the vast majority of them
do.

Yet the real risk of significant damage to a business from downtime is


Is downtime something that happens to a server, or ONLY the impact on the user and their ability to function effectively or
something that happens to the user? at all without access to the critical application and the most up-to-date
data.
The answer is both, but which is more important?
Eliminating user-downtime delivers the optimum solution for both High
The physical problem is a technical issue that requires fixing. This Availability and for Disaster Recovery strategies. The commercial
is relatively low cost and low risk. A small number of people can objective is to do so at a cost and a level of simplicity that makes it
eventually get a system operational again in a matter of days with highly relevant and financially justifiable to all.
varying degrees of data loss. In terms of cost of resource and
equipment, this is inexpensive. The delivery of such a solution starts at the design stage, not as an
after-thought to a ‘heritage’ data protection solution.
The commercial damage is the impact on the users and the
consequences of that. The disconnection of users from critical That is how ‘next generation’ products are born.
applications and data can seriously damage a business. This is high
cost and high risk.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
13 HIGH AVAILABILITY OR DISASTER RECOVERY? ...HIGH AVAILABILITY OR DISASTER RECOVERY?... 14

The definitions of High Availability (HA) and Disaster after the disaster has occurred.
Recovery (DR) have evolved over time to mean different
things. The difference between HA and DR is the location of the recovery
server. In an HA environment, it is in the same location. In a DR
The most common understanding is that HA addresses the need environment, it is in a second, separate location. If your business
to improve the reliability and availability of critical IT systems in the has two locations, the difference in cost should only be the cost of
normal day-to-day working environment. bandwidth.

Whereas, DR is the planning and execution of a process that enables If you are looking for HA or DR, make sure you are buying both.
the restoration of critical IT systems after a catastrophic event, such
as the total lost of a premises through fire, flood, hurricane, tornado, If you want to quantify the level of HA / DR you are purchasing,
earthquake, terrorism (etc). measure it by the time the user is disconnected from the critical
application and data.
It is obvious, both statistically and logically, that a total failure of a
critical server for whatever reason is far more likely to happen in the A good example of an HA / DR requirement would be in the use of
normal day-to-day working environment than the occurrence of a email. This is now accepted as fundamental to business in much the
disaster that destroys the office. same way as a phone system is. There is no argument that email
downtime is damaging, however the impact of downtime after a
So it would be reasonable to assume that providing protection to disaster can literally kill a business.
critical servers would focus first on the higher, more common risk of
HA. Yet, in practice, the opposite is true. If your primary place of business was destroyed, how long do you
want to wait until you restore the communication between the key
Most organisations have formal DR plans for their IT systems (even constituents of your business?
if it is just the off-site copy of a daily backup) whilst few have any HA
plans. This usually reflects the focus of the senior management and • Staff
the availability of budget. • Customers and Prospects
• Suppliers
The purpose of both the HA and the DR plan is to get the users • Shareholders, Stakeholders
working as fast as possible after a failure, with the minimal loss of • Market & Press
data. • Insurance organisation and builders
Yet more often than not, the method of achieving this is to focus all
the attention and budget on an electronic copy of data, leaving the The full recovery process might take months to complete, but the
complex process of building the environment to take that data until communication needs to be uninterrupted. You need all of your

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
15 ...HIGH AVAILABILITY OR DISASTER RECOVERY? DOES DOWNTIME REALLY MATTER? 16

people on-line and communicating comfort, clarity and contingency The IT industry has grown up with an acceptance of the
actions, instantly. ‘break – fix mentality’. This is essentially based upon the
premise that ‘It is going to break. Then we are going to fix
For most organisations, the process commences with an IT person it.’
holding a digital tape of yesterday’s data, a box of software CDs of
operating systems, application software, anti-virus software, some In the 1980 – 1990s, few IT systems were truly critical and downtime
notes on the original configuration (if you’re lucky) and a blank was an acceptable fact of life. That statement is no longer true for a
hardware server. Usually, that means you are at least 36 – 48 hours growing number of software solutions in almost every company.
away from re-connecting email to anybody.
The impact of downtime will change based upon a number of criteria:
That is enough time for all parties to form an opinion on whether or
not to risk continue trading with you, based on what little they know 1. The number of users of the software application / data.
and what they have heard and read. If you are not communicating
with them, the information is probably coming from your competitors. 2. The level of dependency of those users on that application to
perform their jobs.
One disaster precipitates another, both of which have a disastrous 3. The importance of the activity of the user to the moment-by-
impact upon your business. moment running of the business.

Yet the loss of the same service for an extended period of time in 4. The ability for the user to “catch up” on lost time through out of
a normal trading environment is also a disaster and can deliver a hours work or whether the downtime cannot be re-captured (i.e.
similar disastrous impact. loss of a telesales function for a period).

The ideal scenario is to protect the critical applications and servers 5. The importance and value of the role of the users affected.
to ensure that the user remains connected to a working application
/ data, whatever happens, perhaps locally in HA mode, perhaps in a 6. The implications of those users not working effectively - as a result
second location offering HA and DR. a. The direct cost of the users lost productivity.
b. The risk / impact elsewhere, both internally and externally of that
Despite the obvious sense this makes, the message is not reaching lost productivity.
the organisation’s decision makers and budget holders.
For example, for most organisations, email represents over 70% of all
They are living in the past, because the information they have is their communication internally and externally. If it stops working for a
simply out of date. couple of minutes, most organisations can live with that.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
17 ...DOES DOWNTIME REALLY MATTER?... ...DOES DOWNTIME REALLY MATTER?... 18

However, if email is lost for an entire day the impact to the business Let’s do some mathematics:
is going to be significant. Email is used in every aspect of the day-to-
day operation of the business, internally and externally. Marketing, Simple Justification
sales, management, services, finance, partners, customers and
suppliers will all be affected. This cost justification is deliberately conservative in all aspects of the
maths.
It becomes obvious that not all IT systems are equal as some are far
more important than others. Whilst downtime on some applications Assume that Microsoft Exchange email is being used for 30% of
is an inconvenience, for others it is a disaster in terms of productivity the working day.
and potential real commercial damage.
Assume that there are 200 such users in the company.
One size does not fit all.
Assume that the role of these users spans the organisation
Therefore, should one HA / DR strategy fit every server? covering Senior Management, Marketing, Sales, Finance,
Production and Services.
Probably not!
Assume the average annual ‘fully loaded’ cost of an individual is
In order to determine the £75,000 (for most organisations this is extremely low). A simple
commercial argument on process to get the average cost per full time employee is to divide
the financial justification the total overhead of the business for last year by the average
and commercial need, we number of employees last year.
need to think through the
justification rational. Assume each individual works 40 hours a week for 50 weeks,
which equates to 2000 hours per annum. Then each hour costs
the company £37.50.

A four hour system outage has a quantifiable cost of:


200 users x 30% usage x 4 hours = 60 x 4 = 240 lost hours
240 x £37.50 = £9,000 for one 4 hour outage

But this is not the real cost, as it is easy to argue that the user could
find something else to do (although a quick survey of your users will
deliver a very different message).

ONE SIZE DOES NOT FIT ALL

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
19 ...DOES DOWNTIME REALLY MATTER?... ...DOES DOWNTIME REALLY MATTER? 20

The real cost is the consequence of 240 lost hours of optimum There are three questions that arise when reviewing your
productivity in one four hour period during which 80% of all strategy for downtime:
communication has ceased.
1. What is a reasonable direct cost that can be placed against user-
During that time, all those users are unable to carry out the tasks and downtime of a critical IT server?
activities they are aware of, or to address the demands and requests
that they did not receive during this period. 2. What is a reasonable estimate of the indirect cost / RISK of lost
revenue, damaged reputation, legal liability (etc) of user-downtime
The commercial consequences of this lost time are almost impossible for 2 / 4 / 8 / 24 / 48+hours?
to accurately calculate, as they are dependent on what didn’t take
place for every individual. 3. Who is responsible in your business for undertaking that
calculation and making the recommendations of appropriate action
For example, if customer service calls are captured within your and priority?
extranet, but transported by email, no extranet service calls will
be received for 4 hours. This may well exceed your service level
agreements. If you can’t name the individual, or identify the documentation that
this individual has produced and regularly maintains, the answer is
An existing customer may decide to cancel their relationship due to probably no-one.
this experience, but action it on the renewal date six months later.
Without wishing to “scaremonger”, having zero knowledge of your
A critical quote may get delayed, or a “red hot” enquiry has a slow business’ exposure to critical IT infrastructure failure could be seen as
response. Given the diversity of the moment-by-moment use of email anything from poor management through to negligence.
within almost all organisations, the list of potential disasters caused
by downtime is very long. That is a real risk.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
21 WHO IS TO BLAME? ...WHO IS TO BLAME? 22

What becomes immediately apparent is that the real cost to the


business is potentially going to be many tens of thousands of pounds.

If the sales and customer services department could cross charge


another department for the lost productivity, and make up the revenue
shortfall through inter-departmental billing, which department would
have to pay?

Would that bill be based on the tangible damage of lost man-time,


or would it include the real cost of lost customers who don’t buy;
lost customers who don’t renew their service contracts at some
IF A COMPANY’S BOARD OF DIRECTORS CAN’T HEAR IT, SEE IT later stage, or intangibles such as the damaged reputation of the
AND REFUSE TO DISCUSS IT, THEN THE THREAT DOES NOT EXIST business?

If there is not an individual who has the responsibility for reporting on


The ‘break-fix mentality’ has compounded the ownership problem, as the potential risk of downtime, for making a formal effort to quantify
it has allowed us to ignore the issue. the likely costs and looking for cost effective solutions to minimise that
risk, then I am sorry to say that the ‘Three monkey rule’ applies:
The mentality of “it breaks and we fix it” may be acceptable for low-
level non-critical servers, but it is no longer commercially sensible for If a company board of directors can’t hear it, see it and
critical applications. refuse to discuss it, then the threat does not exist.

It is remarkable that whilst we have all adopted technology into the The alternative scenario is that the threat has never been brought
very fabric of our businesses, most organisations have no reporting to the board’s attention. It should come as no surprise that the
process, knowledge or understanding of the regularity or implications alternative scenario is usually the way the board will act after a
of downtime. disaster.

A major outage of a CRM solution may take the entire telesales If there is an individual with the responsibility for ensuring user-
activity offline for a day, stop web-based enquires from reaching the uptime, you may want to give them a copy of this book.
sales department, remove the ability to take new and manage existing
support calls, stop the sales staff undertaking planned negotiation If not, then I recommend a copy is given to each member of the board
activities and sending out quotes. But who is counting that cost? and investors. Better still, provide a report on the risk and support it
with the book! (They may even thank you at some later date).

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
23 NEXT STEPS: ...NEXT STEPS: 24

Let us assume that you The first issue to recognise is that the criticality is always driven by
fully recognise that user- the nature of the application, not the data.
downtime is bad for The most common and obvious applications include:
business and warrants • Email
re-investigation. • Document Management Systems
• Customer Relationship Management
The key criterion to • Sales Force Automation
consider is that the risk • Support & Help Desk
/ consequential cost of • Critical business applications
user-downtime is greater • Intranet / Extranet / Web applications
than the cost of protecting
against it, making It is also important to recognise that the level of criticality increases
protection a commercial with the period of downtime.
necessity.
For example, losing a non-critical application for a day is a nuisance,
This is driven by the fact but not commercially dangerous. In some cases even losing the data
that the dependency from today and resorting to last night’s backup tape is acceptable.
and risk of downtime
to your business has However, with critical applications the level of risk and probable
grown, whilst the cost and commercial damage increases exponentially the longer they are
complexity of removing unavailable. Being one hour late may be redeemable whereas being
downtime has reduced. NEXT STEPS... 48 hours late is probably not.

The level of urgency of deployment should be driven by the The recovery time from the failure of a critical server is dependent
recognition and quantification of the size of the risk / commercial on what went wrong, when it happened and the steps required to
exposure. delivering a fully operational solution re-connected to the users.

How are you going to identify and prioritise the criticality of your It is almost impossible to determine the likely period of downtime
servers? for any given server, so rather than try to determine every possible
eventuality and what would be required to recover, review the
Critical Server Status: probable commercial risk and damage associated with a length of
downtime.
A survey of the business activities of an organisation will usually
quickly reveal the business IT applications that are critical to the Keep in mind that a 60 minute outage is often as likely as a 12 hour
smooth operation of the business. outage.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
25 CRITICAL SERVER IDENTIFICATION: ...CRITICAL SERVER IDENTIFICATION: 26

Produce a list of servers and the applications they run. Put an estimate of loss against the lost productivity of those
unproductive hours based on the purpose and use of the software
Consider the implications of downtime of each of these AFTER a application and user feedback.
discussion with a sample of users and their manager. Good decisions
are made on good information, so it is worth getting feedback. This is very simplistic, but you will notice that the real cost of
downtime is probably much higher than you anticipated and critical
Over 75% of email users concluded that loss of email was more applications are clearly identified.
stressful than divorce in a recent survey. That kind of feedback
provides some insight into users’ dependency on applications, yet Rather than look at and address every single server at once, focus in
most email servers remain unprotected. on the most critical and address the highest area of risk.

Suggestions for calculating the likely costs of downtime

Detail the number of users for each application and the number of
hours that they would use it in a normal day.

Add a rough valuation of the cost per hour for that group (£37.50 is a
commonly used generic total employee cost).

Downtime is a major variable in terms of risk and likely cost, so


calculate a number of them:
15 / 30 / 60 minutes
1 / 4 / 8 / 16 / 24 / 48 hours

Calculate the total lost hours of productivity and the direct costs for
each downtime period.

The real cost is in the lost productivity and time sensitivity of the
user and that depends on the use of the application. For example,
a telesales operation may go offline for a day. That is lost time
that cannot be “caught up” and would represent a loss of 0.4% of
revenues per annum from this source alone.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
27 DATA PROTECTION ...DATA PROTECTION 28

Every server already has If the only protection of a critical server data is a tape backup, then
some form of protection, the next question is whether there is any means of capturing and
even if it is just a re-entering data that has taken place since that entry. For some
regular tape backup. applications, the information can be sourced and re-entered, however
The purpose of this is for many more, the data is forever lost.
to provide a copy of the
data that can be used in Backup tapes are prone to failure when they are most needed. The
the event of a failure that time taken to restore the data onto a server can run to 12 or more
permanently removes hours. Everything that was done in between the tape backup and the
access to that data. failure will, at best, need to be done again; at worst be permanently
Whilst all data updates lost.
that took place after that
backup was completed If you add the probable risk / cost of the loss of a day’s data to the
have been lost, at least cost of the downtime itself, then the true risk / liability of downtime is
the company can revert becoming significant.
to “yesterday’s” copy.
However, to get data to work, you need a server, operating system,
Data is obviously application and connectivity to a network as a minimum.
important, but it is the BACKUP TAPES ARE PRONE TO FAILURE
first and most basic form WHEN THEY ARE MOST NEEDED The “real-time” data replication software market has grown rapidly to
of availability. Tape backup address the loss of data since last backup. It is proven, inexpensive
offers the slowest form recovery available and therefore resides at the and well understood.
bottom of the high availability and disaster recovery food chain.
Data replication is an ideal solution for less critical servers where
If the object is to get a user connected to a working application and its user-downtime for a few hours or even a day is acceptable.
data as quickly as possible, then tape backup is as far away as you
can get, (even assuming that the backup tape actually works, as that But the goal for a critical server is that the user remains working
is no guarantee based on tape failure statistics). whatever and whenever failure occurs, using the very latest and up-
to-date dataset, without any action being required by anyone.
If there is an area of misrepresentation by the high availability and
disaster recovery industry providers, then it would be in a clear This level of protection is a natural upgrade to data replication
quantification of what is really required in time, effort and risk to products.
achieve full recovery from a failure.
The protection of data is a given, but it is the non-
Full recovery is defined as the users being connected to the working productive user that represents the risk and cost.
application on a complete set of data.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
29 HIGH AVAILABILITY AND DISASTER RECOVERY: ...HIGH AVAILABILITY AND DISASTER RECOVERY: SELECTION CRITERIA 30
SELECTION CRITERIA

The objective is to keep the The objective of the next section is to help IT management identify
users seamlessly connected and review the critical components of integrated HA / DR solution
to a working application providers offerings.
/ data irrespective of the
nature of failure, with very The areas covered include:
low costs, minimal disruption
and minimal risk. Architecture: Basic structure of the combined HA/DR solution
Reliability & monitoring
To re-cap: historically, HA Data protection
and DR have been seen as Specific Application software protection
two different components of Switch and Fail-over – Switch and Fail-back
protection. HA addresses LAN and WAN Bandwidth considerations
the day-to-day issues (and
the likelihood that critical
systems will fail) whilst The HA (High Availability) component is required to address user-
DR is a costly commercial downtime locally whilst the DR (Disaster Recovery) component is
necessity to protect against off-site protection required to address a site failure.
a much less likely threat of
disaster, usually associated
with a ‘loss of site’.

What has changed is that the


level of criticality of a growing HIGH AVAILABILITY
number of IT systems means
that any downtime is a disaster and therefore a disaster can occur
without the loss of site.

The result is that a comprehensive HA & DR solution should be one


and the same thing. The only question is how it is deployed.

So what are the critical components of a full HA / DR solution that


ensures that the user remains connected to a working application,
irrespective of the nature of failure of hardware, operating system,
application software or network?

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
31 ARCHITECTURE: BASIC STRUCTURE OF ...ARCHITECTURE: BASIC STRUCTURE OF THE HA / DR SOLUTION... 32
THE HA/DR SOLUTION

A comprehensive HA For convenience, we will call these the primary server and secondary
and DR solution should server.
enable both local,
and if required, off- The advantage of this approach is that the amount of data lost in a
site protection against failure is significantly reduced to potentially zero, as there is almost
downtime from a single immediate access to aspects of up-to-date data at the time of failure.
solution. That solution The limitation is that all other changes to the environment are not
can be implemented being captured, let alone updated.
locally and stretched
to cover DR if and Changes to the operating system, the database management
when required, without software, the application, the anti-virus protection, even the network
change or cost. will all often need to be addressed prior to being able to ensure the
secondary server will function correctly.
The critical objective is
that the user remains That of course assumes that somebody knows exactly what has
seamlessly connected IN AN IDEAL WORLD, IF ONE THING changed and what relevance it has to the replicated data before
to the working BREAKS, THEN YOU HAVE ANOTHER restarting the secondary. This is very rarely the case, as detailed
application irrespective JUST LIKE IT AVAILABLE change control on IT servers is expensive in time and effort to
of the nature of failure maintain.
of hardware, operating system, application or network, without any
intervention by the user or the IT department. Data replication is a significant improvement to tape backup, as it
removes the data loss from last backup to the point of failure. But
In an ideal world, if one thing breaks, then you have another, just like it, more often than not, it is still a long way from a working user, as so
available to take over immediately. This type of solution is available but much more is required to ensure that the full system is restored and
historically required the two systems to be identical. the user can commence work again.

That approach is OK, until you are in an environment where everything True Pair Architecture
is constantly changing, 1000s of times / second.
In an ideal world, the primary and secondary servers would remain
The common approach to this has been to protect a single aspect, completely in sync, every update that occured on one, happening on
usually the data, through replication of data onto a tape, or replicating the other.
real-time to another hard disk.
Replication involves placing another (sometimes identical, sometimes
similar) server alongside the “primary” server and copying data real-time
from one server to the other.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
33 ...ARCHITECTURE: BASIC STRUCTURE OF THE HA / DR SOLUTION... ...ARCHITECTURE: BASIC STRUCTURE OF THE HA / DR SOLUTION 34

This basic architecture is a form of ‘cluster’, where in the initial state, As a true pair, the solution can be deployed in a LAN, extended LAN
the primary is active, connected to the network and servicing the or WAN environment.
users, whilst the secondary is passive, invisible to the network but
operational. The advantage being the immediately delivery of an HA
solution when implemented locally, but supporting a DR
To maximise the value of this architecture, you would want to address solution when the pair are split across two locations,
a number of requirements: delivering both an HA and a DR solution.

It is important to remove the historic characteristic that delivery of This approach provides a technology platform that addresses the
a true pair requires 100% identical hardware. Identical hardware is basic requirements of ensuring there is an operational server and
expensive to provide at the outset and difficult / expensive to maintain data available to the user, whatever happens.
going forward.
However, the most significant characteristic of this architecture is the
The advantage of the ability to use dissimilar hardware is speed of switch over and failover and switch back.
the ability to use current equipment for both the primary
and, if available, for the secondary. It also removes The architecture enables a completely automated switch or
complexities in upgrades and maintenance on either server failover in between 1 – 4 minutes.
in the future.
As important and often overlooked, a switch or failover requires no
As a true pair, it becomes possible for the passive server to undertake action from the users; they are automatically reconnected to the
comprehensive monitoring of the entire primary server environment. secondary as it becomes active.

The advantage being the ability to automatically undertake Automated user re-connection is fundamental. How would
intelligent pre-emptive actions to address problems as they you notify your email users to log out and log back in again
arise, ensure the smooth operation of the primary and only when email is down? It’s going to be a busy hour or two on
to switch or failover if absolutely required. the phone!

As a true pair, we want the ability to switch back and forth, with a The architecture delivers the foundation; the strategy is about user-
single key depression, between the servers at will, with minimal to uptime.
zero user disruption and guaranteed zero data loss.

The advantage being the ability to undertake maintenance


work on the primary at will, to test upgrades to the
environment without risk and once tested, upgrade the
secondary.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
35 RELIABILITY & MONITORING ...RELIABILITY & MONITORING... 36

Prevention is better moment-by-moment basis. Operating systems are now updated


than cure. The majority automatically via the web with service packs and “hot fixes”, as are
of failures relate to many applications such as anti-virus, spam filters, security software.
reliability issues, many Users join and leave the environment / company and there are many
of which might have examples of staff that have left the company regaining access to
been addressed if they critical systems after a restore from backup and the subsequent
had been discovered in action of their removal never taking place again.
time.
There is a real and significant overhead in the day-to-day review and
Delivering a credible maintenance of servers to make sure they remain healthy and can
HA / DR solution has meet the changing demands.
to address downtime
prevention, or it is no But very few IT departments have the resources required to maintain
better than offering a full change control on critical servers so that there is a documented
safety net to a high- recovery process to ensure they are able to rebuild the primary,
wire act. You still fall undertake all the housekeeping activities and that the protected data
immediately and still will function on a restored system.
have to take the time to
DELIVERING A CREDIBLE HA/DR SOLUTION
recover back to where The consequence is that a relatively minor failure can lead to many
HAS TO ADDRESS DOWNTIME PREVENTION
you were prior to the OR IT IS NO BETTER THAN OFFERING A hours and sometimes days to fix as the IT staff try to sort out the
crash. SAFETY NET TO A HIGH WIRE ACT mess.

In its most basic form, reliability has to address the current and on- So it obviously makes a great deal of sense to automate this entire
going health of the entire protected server environment. Ideally, function of change control and monitoring, providing a very valuable
reliability and monitoring will address both the primary and secondary tool to fully utilised IT staff and significantly reducing the likelihood of
server and be pro-active in ensuring on-going health and performance downtime.
as the first step in providing HA / DR.
There are two other reasons why this approach is useful.
A critical failure of the hardware, operating system, application,
supporting application (i.e. anti-virus software) and network failures will The first is that experience has shown that only a few organisations
take all the users off-line. can easily provide an accurate, comprehensive and complete report
on the profile of the server they wish to protect. The provision of
Yet each of these components is changing all the time. Hardware a simple utility to perform this function automatically removes the
utilisation is changing with memory and disk usage evolving on a costs and resource requirement of a manual investigation. It ensures

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
37 ...RELIABILITY & MONITORING... ...RELIABILITY & MONITORING 38

that the implementation process can be planned and executed the night, that the system will take all the necessary steps to try and
successfully, within timescale and budget as there can be no fix the issue automatically and then, only if necessary, initiate a switch
surprises. over without disrupting the users or loss of data? Or perhaps your
preferred action is an email / text / paging / phone message and allow
The second reason is that, more often than not a primary server will human intervention.
benefit from a detailed review. In the majority of cases the server’s
health can be significantly improved with a small amount of work once It is about choice. At what point and by what means should the
the issues have been clearly identified. solution notify the IT department when a problem has been
encountered? There is no single answer, it is a matter of choice
We now have a server environment in prime health and we want to and that choice should be dependent on the nature of the problem.
keep it there. But if that becomes impossible, the solution needs to Complete flexibility.
know about potential downtime threats and take the agreed steps to
effectively address it. This requires pro-active monitoring functionality. Reliability and Monitoring functionality is not a “nice to have”. They
In the past, monitoring solutions have often been passive, delivering are both essential to maximise the effectiveness of your current IT
alerts but taking no action. resources, whilst keeping your users operational.
A classic example in a market leading data replication product is that
nothing happens at all when the critical application fails. Because the
software only checks to see if the primary server is still turned on,
automated failover does not happen, as it has no way of knowing the
status of the application.

That leads to the 3am phone call where the users are unable to
gain access to software (from wherever they may be), yet the high
availability solution has failed to even notice, let alone take action,
(even if that action is just to let you know!).

When looking at this area, the focus of investigation should be built


around the ability of the IT management to determine the initial status
and health of the critical server. It should also review what is being
checked in the live environment, what manual and automated actions
to take initially on the identification of a problem and what steps to
take should they re-occur a second and even a third time.
Would you be confident that, should a disaster occur in the middle of

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
39 APPLICATION SOFTWARE PROTECTION / ...APPLICATION SOFTWARE PROTECTION / AUXILIARY SOFTWARE PROTECTION... 40
AUXILIARY SOFTWARE PROTECTION

A normal server will have a number of applications running in order to never documented or electronically captured and the result is an ever
provide full service to the users. growing number of bespoke sites.

For example, a Microsoft® Exchange server will always have anti- This may be a sale-able solution, but it is rarely understood by the
virus software. Usually there will also be anti-spam and backup purchaser what is really being offered at the time of sale. Once
software, but each environment is different. discovered, it is too late. The result is the widely reported problems
that occur through undocumented bespoke software for support
When a failure occurs, the secondary server must be able to provide services, on-going consulting, upgrades etc.
all the services of the primary with a similar level of security. A live
Microsoft® Exchange server operating without fully updated virus The consulting / bespoke approach can work, but it assumes that
protection is probably more dangerous and damaging than no service everything then remains the same. All too often the consequence is
at all. that the failover protection fails in the moment of need.

HA / DR solutions have to protect both primary and supporting If the architecture recognises the requirement to deliver application
applications and ensure that whichever server is providing the service protection, then the process should be very simple and enable
to the user, it is a true pair and therefore fully protected. However, this the rapid development of application modules that protect specific
is more complex than it may seem. applications.

Protection of primary applications (the main activity) and auxiliary These application modules are products that can be developed,
applications (required for security etc) is touted by many HA and DR fully tested, swiftly deployed and easily maintained. They can be
providers. produced without any change to the software application simplifying
all aspects of support.
In some upmarket clustering environments, it requires significant
change by the authors of the application to make their products As important, they can be updated to address changes that take
“cluster aware” for a particular clustering vendor. Having done place at a future date and made available to all within a normal
this once, it then needs to be maintained. As a result, the cost of support agreement.
the application increases dramatically or alternatively (and more
commonly) the software authors don’t offer “cluster aware” versions The Reliability and Monitoring services can be utilised to manage
of their software. each application in the most efficient and effective manner as they
utilise a common methodology.
In the replication software market, often the delivery is through a
consulting engineer undertaking bespoke scripting “on the fly” at the But most importantly, the solution ensures that changes and upgrades
user’s site, on their critical server. More often than not, the scripting is to the environment are maintained on both the primary and secondary

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
41 ...APPLICATION SOFTWARE PROTECTION / AUXILIARY SOFTWARE PROTECTION SWITCH / FAILOVER CRITERIA AND 42
PERFORMANCE

servers, whether active or passive, whether local or remote. As has been highlighted, a primary requirement of HA / DR solutions
should be that the user keeps working, whatever happens.
The result is a standard product approach that cuts out both the cost
and the risk of undocumented bespoke software and leverages the There are a number of reasons why it may be necessary or desirable
greatest asset of the software industry. to move processing from the primary server to the secondary (and
back again)
Write once, sell many times and charge accordingly.
1. The irrecoverable failure of a critical component of the primary
This area of HA and DR solutions is a potential minefield. A quick server for whatever reason. This can be the hardware,
review of marketing collateral will show that many organisations seem operating system, application, auxiliary applications, network,
to offer a fully productised “out of the box” application-specific solution. site etc.

Yet a detailed review and some smart questioning will show that in most 2. The enablement of maintenance work such as hardware or
cases these are consulting driven engagements around a misleading software upgrades on the primary server (followed by full
marketing message. The problem for a prospective purchaser is testing) without loss of service to the users.
that this aspect of HA / DR represents such a significant competitive
advantage that disadvantaged suppliers are compelled to “over market” 3. The regular testing of the solution as part of a best practice
their solutions. methodology on HA / DR strategy.

Does the supplier have a price list that includes products for protection In each of these scenarios, the user should remain operational with
of a number of software applications, both the primary applications minimal, if any, disruption.
such as Microsoft® Exchange, File Server (etc)?
We have already covered the objective to avoid a switch or failover
Does that price list include secondary applications such as anti-virus occurring in the Reliability and Monitoring section. If it is going to
software? happen, what are the characteristics that would make the process as
quick and painless as possible, with minimal risk?
Does their literature identify this as a critical component of their solution
and provide all the relevant information? Surprisingly, the very first requirement is that the solution allows
seamless switch back to be undertaken with the same level of
If still in doubt, ask the very specific question on how application automation and with minimal to zero disruption. Whilst this may
protection is going to be achieved and possibly find someone else other seem to be a secondary requirement when the disaster occurs, it
than a salesperson to ask. (Nothing against sales people, as more will become the next most urgent step in every failure, as the system
often than not, they are selling what they believe; it just is not always needs to be fully protected again as soon as possible with the minimal
complete in detail. Technical staff can offer greater clarity). disruption after every failure.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
43 ...SWITCH / FAILOVER CRITERIA AND PERFORMANCE... ...SWITCH / FAILOVER CRITERIA AND PERFORMANCE 44

For many solutions, an automated failover will only occur when the primary complete disk failure will require a full sync and verify process.
computer has a catastrophic failure and can no longer be reached by a What is important is that the process is fully automatic and the final
basic network ‘ping’ from the secondary. This approach fails to address switch process will not occur until the data is sync-ed and verified.
any scenario that can lead to the user becoming unable to use the system, Again, discuss the process in detail, ask to see it in action, quantify
yet the server still responds to a ”dumb ping”. Whilst this is often referred exactly what will happen and how long it may take. All too often
to as a fully automated failover, such a claim is obviously overstated or purchasers discover the questions they should have asked only when
under-explained. the solution disappoints and it is rarely ever a good time to explain this
type of problem to senior management.
There is a solution in the market that offers “automated failover and
fail-back” using the “dumb ping” method to instigate a failover and then Think about this scenario:
requiring the secondary server to be off-line during the data restoration to
the primary server after a failure. With a large data set, this could mean It is midnight and the IT department is at home in bed. The CEO is
6 – 12 hours of user-downtime to get back to the primary. If a switchover burning the midnight oil to complete sign off on the final proposal
was used to avoid downtime on an upgrade, having 12 hours of downtime for a major tender that has to be delivered by 9.00am the following
to get back to the primary is hardly a sensible solution. But if you didn’t morning.
ask.....
And the Microsoft® Exchange server stops working (for whatever
An automatic switch occurs when a failure has occurred that cannot be reason).
corrected successfully by the reliability / monitoring activity. In every event,
there should be zero data loss in a switchover process, as by definition There are two ways forward from this point.
it is a controlled process. Any data backlog on the failing active server
would be transmitted to the passive server and the failing server is shut In one, someone in IT gets the call at 12.10am from the CEO instantly
down cleanly, avoiding any further damage that might make repairs more followed by all the effort and stress that goes with resolving that
difficult. particular scenario.

The passive server kicks off, connects to the network and the users are The other is that the CEO never knows that the email service failed in
automatically connected to it. The length of time taken in this process is the first place because the service is maintained automatically.
critical. A user may not notice any disruption for a minute or two whilst this
process is taking place. However, a five minute or more outage is going to It is not a hard decision to decide which is the optimal, ‘less stress’,
lead to dozens of problem calls. solution.

A manual switchover / switch back occurs in the same way with two It is also clear which one avoided a potential disaster of the tender not
exceptions. The first is that it is on demand and the second is that the being sent at all and the business opportunity being lost.
time taken is dependent on the ‘delta’ of data between the pair. In a short
maintenance activity, the delta may just be a couple of minutes; however a

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
45 BANDWIDTH CONSIDERATIONS ...BANDWIDTH CONSIDERATIONS... 46

LAN

In a LAN environment it is possible to use the existing network to


carry this data volume, however this is often not ideal. The link to the
current network by a single NIC represents a single point of failure.
It is also rarely desirable to add significant data volumes to existing
networks unnecessarily.

It makes sense to have a low cost dedicated channel, particularly if


this is duplicated to remove any single point of failure. Very high local
bandwidth with zero impact on the current network ensures absolute
minimum data loss in the event of a catastrophic server failure.

WAN

However, moving to a WAN or extended LAN raises the issue of cost


BANDWIDTH BOTTLENECK of bandwidth. It is essential to have an accurate picture of the data
volumes generated and the implications to get an accurate cost of
There are implications for a solution which is based upon a replication bandwidth needs.
model with regard to available bandwidth, both in a LAN and
particularly in a WAN environment. The cost of bandwidth will continue to fall in the future, but this will
be countered by the likelihood of increasing volumes of data, so this
The volume of data to be replicated should not be a mystery as long is a long term expense and deserves close attention in the buying
as a utility is provided to accurately measure it. It is very important process.
that all data volume information is captured as it is all too easy to miss
critical activities such as a relatively simple “defrag” which is massively The decision on bandwidth is a practical one. If the bandwidth does
system and data intensive. not address the peak requirements, then a backlog of data is going to
build up on the active server and would be lost if a catastrophic failure
The risk of data loss is equal to the amount of “lag” between the active took place. But having bandwidth that is able to address peak data
server processing data and any delay caused by bandwidth bottleneck requirements is going to be expensive, with much of the bandwidth
that slows data transmission to the passive server. idle most of the time.

This is usually 100% related to the amount of bandwidth available.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
47 ...BANDWIDTH CONSIDERATIONS. SUMMARY 48

How can bandwidth requirements be optimised? Downtime is about users, not systems and data.

The most obvious route is to reduce the volume of data being User downtime represents very considerable cost and real business
transmitted, as this will have a direct relationship on the bandwidth risk for a growing number of business applications, but not all.
requirements. A data compression tool can reduce bandwidth
requirements by 60 – 80%, which means 60 - 80% less bandwidth An HA and DR strategy should not be represented by a “one size fits
required. all” approach. A mixed strategy from tape backup through replication
to zero user-downtime makes commercial sense. Invest where the
The ability to review data backlog and likely clearance / catch up time risk is greatest.
allows the most sensible commercial decision on bandwidth. The
ability to play “what if” scenarios on actual data requirements with The process of identifying and quantifying risk should be a standard
different bandwidths ensures the most cost effective decision both business practice that is repeated on a regular basis as the level of
initially and through out the life of the solution as your requirements risk will constantly change.
(and the cost of bandwidth) change.
If the solution to remove downtime was free, everyone would have it
because it would be stupid not to.

A complete implemented HA and DR solution costs from just £6,000.

If this cost is considerably less that the cost and risk of one extended
period of user-downtime on a critical server during the next 3- 5
years, “Don’t you think you should do something about it?”

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
49 THE CRITICAL COMPONENTS REQUIRED ...THE CRITICAL COMPONENTS REQUIRED TO REMOVING DOWNTIME: 50
TO REMOVING DOWNTIME:

9. Ensure there is “no single point of failure” within the solution.


1. The first and most important is the silent inclusion of the
phrase “without adding undue cost and complexity” to each 10.Ensure that application protection is achieved through the
and every aspect of the solution. use of products and not delivered through consulting and
un-documented bespoke software unique to your installation.
The goal after all is to make this solution available to the
widest possible market. 11.Fully automate the process of a switchover and failover that
enables the users to continue working without any actions by
2. Undertake a detailed review of business critical applications the user or the IT department.
and identify the largest area of risk to your business.
12.Ensure the recovery process to get back to the repaired
3. Quantify a value of that risk for cost justification purposes. primary server is fully automated (on demand) and with
minimal to zero user-downtime.
4. Obtain management buy-in to the value of a High Availability
and Disaster Recovery strategy. 13.Enable planned maintenance to be undertaken without
causing downtime
5. Avoid the common mistake of budgeting to protect everything
and getting nothing. Select the greatest area of risk and 14.Minimise data loss to zero for a controlled switchover and
address that. A step-by-step process is less risk than all or minimal data loss for catastrophic failures.
nothing!
15.Enable the solution to address both LAN and WAN
6. Commence a procurement process that reflects the size of deployment within a single solution thereby address both the
risk / urgency of obtaining protection from downtime both in HA requirements and enabling DR cover on demand.
terms of budget and the decision making process. Downtime
is probably the greatest area of risk to the business. 16.Make the decision and deploy.

7. Ensure that the focus of every aspect of the solution is geared 17.Repeat as appropriate for other critical applications / servers
to maintaining the user connectivity to a working application / over time.
data and not just protecting data.

8. Minimise the probability of a failure in the first place with


the creation and maintenance a healthy “self healing”
environment by fixing minor failures on the fly, before they
lead to downtime.

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.
51 ABOUT THE AUTHOR ...ABOUT THE AUTHOR 52

Neil Robertson has more than 26 years’ experience within the Neil has written 4 other books over the past decade. In each
IT industry. From joining the computer marketplace in 1979 with case, these books enabled senior management to quickly grasp,
Olivetti’s electro-mechanical accounting systems, his experience understand and put new technology and methodologies in to practice.
spans the advent of word processing, spreadsheets, FAX machines
and accounting software for small to medium sized businesses, the
rise of Windows-based business applications and the domination of Tricks of the Trade – a buyers guide to financial software
the Internet in today’s business world. (1996)

After leaving Olivetti, he founded Team Systems Group in 1983, Tricks of the Trade II – an FD’s guide to cut through the sales
rapidly becoming the UK market leader for PC-based technology. pitch to get at the critical facts (1997).
Team was sold to Misys Plc in 1989 where he became an Operating
Board Director with the role of Chief Executive of a number of Misys E-business: The Invisible Revolution, - A decision
companies. maker’s guide to the business benefits of e-business. (1998)

Having left Misys, he joined Kewill Systems Plc to head up their newly The 29 Most Common Mistakes in purchasing and
formed Great Plains Dynamics operation. He engineered the re- implementing eCRM. (2001)
purchase of the distribution rights by Great Plains and set up its first
off-shore subsidiary in the UK. Great Plains became the recognised
leader in its market sector, signing a global distribution arrangement
with Siebel Systems in 1999, offering CRM solutions through the
Great Plains channel. Copyright of the Neverfail Group 2005

He left Great Plains in 2000 to found 30/30 Vision, an eCRM All rights reserved. Except for the quotation of short passages for the purposes
consultancy that offered a unique insight into maximising the benefits of criticism and review, no part of this publication may be reproduced, stored in
a retrieval system or transmitted, in any other form or by any means, electronic,
of CRM and avoiding the common pitfalls. 30/30 Vision was sold to a
mechanical, photocopying, recording or otherwise, without prior permission of the
Global Enterprise in August 2002. author or his agents.

In September 2002, Neil joined the Neverfail Group as Group CEO The right of Neil Robertson to be identified as author of this work has been asserted
with the mandate to migrate the business from a successful hardware in accordance with the Copyright, Designs and Patents Act 1998.
based Disaster Recovery Company into a global software business.
Any registered trademarks used are acknowledged and recognised as being the
That remains the job at hand and this book is part of that process.
property of the organisations to whom they belong.

Contact Neverfail

To contents page © Copyright of the Neverfail Group 2005 Connected! A management guide to removing user downtime as part of a core business initiative.