Академический Документы
Профессиональный Документы
Культура Документы
Business
Continuity and
Disaster
Recovery
MODULE-III
Preface
The purpose of this book is to give an overview of the Business continuity
Planning and its Implementation. It covers the topics such as Need and
importance of BCDR, Types of disasters, Disaster Recovery, BCP and
Governance, Industry Standards supporting BCP and DRP and Benefits of BCP
and DR
This book first introduces the basics of Business Continuity Plan, BCP Process
Steps for Development of Business Continuity Plan. It then provides in-depth
coverage of BCP/DR and Recovery Technology and Disk system Fault
Tolerance.
Table of Contents
Business Continuity and Disaster Recovery
CHAPTER
NAME
TOPICS
1.1 Introduction
1.2 Management Commitment
1.3 PDCA
1.4 Conclusion
Business Continuity
and Disaster
Recovery
2.1 Introduction
2.2 Need of BCDR
2.3 Types of disasters
2.4 BCP and DRP Differences and Similarities
2.5 Components of BCP/ Disaster Recovery
2.6 BCP and Governance
2.7 Industry Standards supporting BCP and
DRP
2.8 Benefits of BCP and DR
BC/DR Planning
BCP/DR Plan
Development and
Implementation
BCP/DR and
Recovery Technology
5.1
5.2
5.3
5.4
5.5
5.6
5.7
6.1
6.2
6.3
OS
6.4
6.5
6.6
6.7
Chapter 1
BCP and Secure processes
Objective
1.1 Introduction
1.2 Management Commitment
1.3 PDCA
1.4 Conclusion
1.1
Introduction
ISO 27001 has 11 domains, which address key area of the information security
management. It covers the following areas:
Security policy
Organizing information security
Asset Management
Human Resource Security
Physical and Environmental security
Communication and operation management
Access Control
Information System Acquisition, Development and maintenance
Information Security Incident Management
Business Continuity Management
Compliance
It has total 134 best practices which covers all 11 domains. The best practices
are control to achieve objectives of the IT security management. ISO 27001
uses PDCA model for its implementation. The PDCA is cyclic model has to be
done for long run with solid backing & dedication of management. It ensures
that correct components are engaged, evaluated, monitored and improved on
continuous basis.
1.2
Management Commitment
1.3
PDCA
http://www.infosecwriters.com/text_resources/pdf/ISMS_VKumar.pdf
http://www.google.com/imgres?num=10&hl=en&biw=1366&bih=622&tbm=isch&tbnid=LGrV2dG58mvbYM:&img
refurl=http://www.velaction.com/pdcacycle/&docid=ZwUlOzW_qR_fnM&imgurl=http://www.velaction.com/lean-information/wpcontent/uploads/2009/10/PDCA-Cycle-Pic.jpg&w=495&h=409&ei=pHNQUKa2H--
v.
vi.
Probability of Occurrence
With respect to each and every Asset, it is important to find out the
probability of occurrence of threat for each Asset within the
organization. The probability of occurrence is required to
understand the frequency at which such failures occur. This is
based upon previous experiences and also looking at the current
implementation. Usually, probability is marked in flags Like High,
Medium & Low. Every department head or a knowledgeable person
from the department has to set this probability. They have to find
the interdependent processes and their effect in case of disruption.
ViQfdnYCIAQ&zoom=1&iact=hc&vpx=110&vpy=305&dur=2168&hovh=204&hovw=247&tx=139&ty=123&sig=1130
65180021542067817&page=1&tbnh=118&tbnw=143&start=0&ndsp=21&ved=1t:429,r:14,s:0,i:146
vii.
Risk Value
The risk value is calculated. Risk is always calculate in terms of
numbers i.e. Rupees or any respective currency. The risk value is
calculated by identifying the possible threats that can impact CIA.
It checks impact and frequency of impact.
E.g. The threats to the mail server.
Power failures
Hardware failure
Fire
Virus attacks / Malicious code injection
Intruders (Hacking), Denial of Service (DoS attack)
Mail accidentally sent to a different recipient
Data corruption / data loss
Unauthorized access
Link failure
Natural calamities
Risk can be calculated with a single formula:
Risk= Vulnerability * Threat.
The result of Risk Value Calculation is the input for the next phase
i.e. DO Phase to decide which Asset should be treated on Priority.
Usually, once the risk value calculation is done the Assets are
ranked from Highest Risky Assets to the Lowest Risky Assets.
Accordingly, the further Risk Treatment methodology is selected.
system due to
various failure
analyzed. This
other events.
ii.
iii.
iv.
In this phase, training all the employees on all the policies, guidelines,
procedures to be followed in case of Disaster or for making the attempt of
continuity of business are most important.
There are different methods to pass on the information to end users.
Some of which have been explained below.
Train the trainer approach
At times it is very difficult to reach every user in an organization (usually
organization with more than 500 employees) and also tracking will be a tedious
process. This method will be used to train a set of people (generally in the level
of middle management) and they take the responsibility of training their team.
Without train the trainer approach
This method is used generally in smaller organizations. Here the training
program will conducted to each and every employees of the organization by the
same team of trainers.
Training Materials
Preparation of training materials should depend on the targeted audience. Split
the organization based on the following:
Senior Management
Middle Management
End Users
If a training session for the senior management, it is need that to make sure
that include some statistics of vulnerability report, comparison between
previous reports. The main focus should be to show the improvements that
have been achieved through this implementation.
The end user training can be contacted through shooting a shot film by having
some in-house members to act for the video. The video can also have pictures
taken in around organizations premises that pose as examples for the common
security breaches and use those pictures can be used as your screen savers.
The handbook, hand-outs and Information Security bulletin are additional
means to spread information to all employees.
ISO 27001 provides certain possible solutions on certain types of risks which
can be referred in the following table:
10
A.9.2.2
A.9.2.4
A.9.1.4
UPS, generator
AMC's
Fire Extinguishers,
Sprinklers, keep phone
list of concern with
names
at
required
location
A.10.4.1
Anti-virus, Anti-spam,
spy ware removal tool
A.6.2.1, A.6.2.3, A.10.6.1 Perimeter Security
Devices, Adequate
Network controls
A.10.8.4
Digital Signatures
A.10.5.1
Backup
A.11.2.2, A.11.2.4,
A.11.5.2
A.9.1.4
Above is the example of how we can map each threat identified to ISO 27001
Controls and also to find how to minimize the risk.
While making BCDR applicable as per the Standards of ISO 27001, the 1st step
is to prepare the Statement of Applicability.
Statement of Applicability (SoA)
SoA is a document that states all of the ISO 27001 controls that are applicable
for a particular type of organization. A justification also needs to be given for
that control that has not been chosen for implementation. This SOA document
11
will be provided to clients and external trusted authorities on demand, for them
to identify the level of implementation of security practices in the organization.
Control Reference - A.9.2.2
Description - Fire Supplies
Implementation Yes
Justification - Company has implemented UPS systems and also a dedicated
generator for the entire building
Some of these controls require policies to support the implementation. E.g. The
anti-virus policy that defines how anti-virus is to be deployed across the
organization, what are the tools used and how is it monitored. Organization
need to make sure that all the policies are in place and also require
documenting the operating procedures of all the assets in the organization.
This is very important.
c. CHECK Phase:
In this phase, the output of DO Phase is used as input for CHECK phase.
In this phase, the team has to check, verify & audit all the controls which
are implemented for BCDR. The team has to check two important points:
1.
2.
12
http://www.infosecwriters.com/text_resources/pdf/ISMS_VKumar.pdf
13
14
control is presently giving. In ACT Phase, it is expected that all nonconformities as well as non-compliances are complied with.
ACT Phase can also be term as a phase where post audit checks are confirmed.
Asset tags Make sure all your assets is been labeled as per your policy
Mechanism to assess and improve user awareness among employees
There should be a mechanism, at least maintain records for the user
awareness training conducted
Mechanism (procedure) to record the security incidents and their
solutions There should be a process to record security incidents found
and reported by users, action taken for those incidents and learning from
those incidents need to be documented.
Mechanism to store the logs of servers and other monitoring tools for
further reference Log retention need to defined and practiced.
Back-up and restore procedures to be in place. Test of restoring data has
to be practiced and documented.
BCP needs to be documented. Any test done to check the BCP need to be
documented with test results.
DR site should be defined and documented
All cabling (power & data) should be adequately protected
License management should be demonstrated License management
using some tools or recorded in an excel file should be produced. Audits
will be conducted to check if the installation of software is same as
mentioned in the license management document.
Audit reports of VA, PT and other audits conducted in the organization
should be adequately documented, measured and improvements should
be projected for auditing
Patch management and anti-virus management is recommended to be
centralized and a dedicated person be assigned to monitor this area. A
random audit should be conducted to check if any of the machines has
been omitted by the system of any anti-virus or patch updates
The every stage of PDCA requires awareness training. This must be conducted
for better implementation. ISO 27001 provides detailed guidance for synthesis
ISMS with organizations risk profile. The system is built by iterations of PDCA
cycle. Each cycle improves effectiveness of the system. ISMSs focus is on
Confidentiality, integrity and availability of information.
15
http://www.infosecwriters.com/text_resources/pdf/ISMS_VKumar.pdf
16
Summary
ISO 27001 has 11 domains, which address key areas of the Information
Security Management.
ISO 27001 uses PDCA model for its implementation.
PDCA covers Plan-Do-Check-Act phases in implementing ISO 27001
from BCDR perspective.
Business Impact Analysis (BIA) is performed to analyze the impact on
the system due to various unprecedented events or incidents.
Risk is treated with the formula of 3T-1M i.e. Risk Transfer, Risk
Treatment, Risk Tolerate, Risk Mitigate
Risk Management is making all the attempts to minimize the level of
Risk so that it can be accepted or it cannot be treated further or made
minimum after a certain level.
Training all the employees on all the policies, guidelines and procedures
to be followed in case of disaster or for making the attempt of continuity
of business are most important.
The audit is a part of monitoring and reviewing the implemented
process.
ISMSs focus is on Confidentiality, Integrity and Availability of
information.
17
Chapter -2
Business Continuity and Disaster Recovery
Objective
2.1 Introduction
2.2 Need of BCDR
2.3 Types of disasters
2.4 BCP and DRP Differences and Similarities
2.5 Components of BCP/ Disaster Recovery
2.6 BCP and Governance
2.7Industry Standards supporting BCP and DRP
2.8 Benefits of BCP and DR
2.1Introduction
All of us know that threats are uninvited hurdles, problems directly resulting
into monetary losses. Business Continuity is the aim of every organization as
they are in the market for profit making. Whereas Disasters are sometimes
natural or man-made may or may not be avoidable in certain circumstances.
Prioritizing the IT and technology needs of a business while ensuring the IT
budget is being used effectively is arguably the biggest challenge.
Business continuity and disaster recovery solutions designed specifically for
maintaining the ongoing operation of key business systems during a major
event have traditionally been achieved via full physical backups and total
replication. This most often doubled the cost of IT infrastructure needs, giving
BCDR a reputation for being expensive and often unattainable.5
http://www.itbusinessedge.com/cm/community/features/guestopinions/blog/have-you-bcdrd-yourbusiness/?cs=50061
18
2.2
Need of BCDR
http://www.iim-edu.org/executivejournal/Whitepaper_BCDR_Best_Practices.pdf
19
20
2.3
Types of Disasters
Disasters
Natural
Disasters
Man-made
Disasters
Natural disasters
Tornadoes
Floods
Blizzards
Earthquakes
Fire
21
Man-made Disasters
Disasters
Insurable
NonInsurable
Technical
NonTechnical
Disasters can take several different forms. Some primarily impact individuals -e.g., hard drive meltdowns -- while others have a larger, collective impact.
Disasters can occur such as power outages, floods, fires, storms, equipment
failure, sabotage, terrorism, or even epidemic illness. Each of these can at the
very least cause short-term disruptions in normal business operation. But
recovering from the impact of many of the aforementioned disasters can take
much longer, especially if organizations have not made preparations in
advance. However, if proper preparations have been made, the disaster
recovery process does not have to be exceedingly stressful. Instead the process
can be streamlined, but this facilitation of recovery will only happen where
preparations have been made. Organizations that take the time to implement
Copyright Intelligent Quotient System Pvt. Ltd. |
22
disaster recovery plans ahead of time often ride out catastrophes with minimal
or no loss of data, hardware, or business revenue. This in turn allows them to
maintain the faith and confidence of their customers and investors.7
Some disasters can be insured and loss can be minimized. For Example: Fire in
the building will minimize the loss of entire value of building as well as assets
present in it due to Insurance Claim. But not all losses can be insured. For
Example: System Administrator while leaving the job formatted the hard drive
and the company lost entire data of last 3 years for which there was no back
up present. This loss due to human behavior cannot be insured.
http://itfirstaid.ca/services/disaster-recovery/
http://www.google.co.in/imgres?um=1&hl=en&client=firefox-a&sa=N&rls=org.mozilla:enUS:official&biw=1366&bih=622&tbm=isch&tbnid=xqK8s6riYBdOM:&imgrefurl=http://www.crookston.mn.us/EM/&docid=LuUbIWEypbrwTM&imgurl=http://www.crooksto
n.mn.us/EM/images/hazard%252520arrow.gif&w=480&h=346&ei=Eb5RUIavB8fJrAfIzIGIBA&zoom=1&iact=hc&vpx
=1035&vpy=318&dur=3221&hovh=191&hovw=265&tx=120&ty=124&sig=109287564227310851567&page=3&tbn
h=133&tbnw=185&start=49&ndsp=25&ved=1t:429,r:11,s:49,i:279
23
Response: With the same above example, the petrol pump should do transit
insurance, install fire extinguishers, train the employees for the emergency
procedures, install the smoke detectors, put the sand buckets ready etc.
Recovery: In case of actual fire, the sand buckets, fire extinguishers to be used
appropriately. Since all the employees are trained & they know how to execute
the emergency recovery plan, the recovery can be done with minimum damage.
Mitigation: Either from own disasters faced or from the industry to which the
organization belongs, the disasters can be anticipated and accordingly new
plans to mitigate such threats can be made by the management. This also
reduces huge cost of damage.
BCP/ Disaster Recovery Planning is the factor that makes the critical difference
between the organizations that can successfully manage crises with minimal
cost and effort and maximum speed, and those that are left picking up the
pieces for untold lengths of time and at whatever cost providers decide to
charge; organizations forced to make decision out of desperation.9
Detailed disaster recovery plans can prevent many of the heartaches and
headaches experienced by an organization in times of disaster. By having
practiced plans, not only for equipment and network recovery, but also plans
that precisely outline what steps each person involved in recovery efforts
should undertake, an organization can improve their recovery time and
minimize the time that their normal business functions are disrupted. Thus it
is vitally important that disaster recovery plans be carefully laid out and
regularly updated. Organizations need to put systems in place to regularly
train their network engineers and managers. Special attention should also be
paid to training any new employees who will have a critical role in the disaster
recovery process.
There are several options available for organizations to use once they decide to
begin creating their disaster recovery plan. The first and often most accessible
source a business can draw on would be to have any experienced managers
within the organization draw on the knowledge and experience they have to
help craft a plan that will fit the recovery needs specific to their unique
organization. For organizations that do not have this type of expertise in house,
there are a number of outside options that can be called on, such as trained
consultants and specially designed software.
One of the most common practices used by responsible organizations is a
disaster recovery plan template. While templates might not cover every need
specific to every organization, they are a great place from which to start one's
9
http://www.disasterrecovery.org/disaster_recovery.html
24
preparation. Templates help make the preparation process simpler and more
straightforward. They provide guidance and can even reveal aspects of disaster
recovery that might otherwise be forgotten.
The primary goal of any BCP/disaster recovery plan is to help the organization
maintain its business continuity, minimize damage, and prevent loss. Thus the
most important question to ask when evaluating disaster recovery plan is, "Will
the plan work?" The best way to ensure reliability of one's plan is to practice it
regularly. Have the appropriate people actually practice what they would do to
help recover business function should a disaster occur. Also regular reviews
and updates of recovery plans should be scheduled. Some organizations find it
helpful to do this on a monthly basis so that the plan stays current and reflects
the needs an organization has today, and not just the data, software, etc., it
had six months ago.
IT Disaster and WAN Redundancy
One of the most common areas of vulnerability for organizations when a
disaster strikes is the loss of their WAN connectivity. Earthquakes, floods, and
acts of war can certainly disrupt the use of an organization's data lines. But
loss of WAN connectivity can happen even without a major catastrophe. Much
simpler threats such as the accidental cutting of data lines or equipment
failure can have the same devastating net result on connectivity. Whether the
cause is a construction mishap from the new building next door, or the effects
of a far more serious event like a flood, fire, or terrorist attack, if an
organization loses their connectivity their business continuity is often lost as
well, and they are functionally in a state of disaster.
The loss of WAN connectivity can have serious consequences for an
organization's daily business activities. Emails, financial transactions,
ERP/CRM systems, order placement and processing, are just a few of the
critical operations affected by WAN connectivity. If connectivity is lost these
activities can be severely slowed or halted altogether until data lines can be
recovered. Thus, having a functioning WAN system is critical for productive
business operation and should be an integral part of any disaster recovery
plan.
There are several methods available for organizations who want to ensure a
high availability of WAN connectivity as part of their disaster recovery plan. The
earliest techniques used to back up data lines were complex and cumbersome.
They used multiple data lines that were connected to a programmable router.
Complex programming allowed data to be passed over multiple connections
which helped reduce vulnerability to a single line and helped protect against
backbone failure. This technique, though far from streamlined, was better than
no back-up system at all and did help maintain at least some business
continuity.
Copyright Intelligent Quotient System Pvt. Ltd. |
25
Since that time the technology has evolved and a more elegant technique is
available. This new technique involves the use of intelligent devices that can
handle multiple data lines of different speeds from multiple providers
simultaneously. These devices, called Router Clustering Devices, intelligently
detect if a line, component or service is failing and then proceed to switch the
flow of data to other available and working lines. These advancements provide
better protection for an organization's data flow. They reduce the potential
mess of disaster recovery and in turn increase business continuity when
disasters do happen without the complexity and awkwardness of the old
system.
2.4
DRP
2.5
1. Destructive measures
2. Response procedures and continuity of operations
3. Determination of backup requirements
Copyright Intelligent Quotient System Pvt. Ltd. |
26
4.
5.
6.
7.
8.
2.6
Not many years ago when a business wanted to find the ways to prepare itself
against disaster and ensure business continuity should catastrophe strike, the
bulk of the organization's time, money, and effort would be spent on ways that
disasters could (hopefully) be avoided altogether. Often the outcome of an
organization's search for ways to protect their most critical business
applications (in order to shore up their business continuity if disaster hit), was
that they found they could potentially avoid harm through the use of
redundant data lines.
The first step is to obtain the commitment of the management and all the
stakeholders towards the plan. They have to set down the objectives of the
plan, its scope and the policies. An example of a decision on scope would be
whether the target is the entire organization or just some divisions, or whether
it is only the data processing, or all the organizations services. Management
provides sponsorship in terms of finance and manpower. They need to weigh
potential business losses versus the annual cost of creating and maintaining
the Business Continuity Planning. For this, they will have to find answers to
questions such as how much it would cost or how much would be considered
adequate.
Broadly, the objective of the Business Continuity Planning (BCP) for a business
can only be to identify and reduce risk exposures and to proactively manage
the contingency.
A BCP contains a governance structure often in the form of a committee that
will ensure senior management commitments and define senior management
roles and responsibilities.
The BCP senior management committee is responsible for the oversight,
initiation, planning, approval, testing and audit of the BCP. It also implements
the BCP, coordinates activities, approves the BIA survey, oversees the creation
of continuity plans and reviews the results of quality assurance activities.
27
The BCP committee is commonly co-chaired by the executive sponsor and the
coordinator.
Copyright Intelligent Quotient System Pvt. Ltd. |
28
2.7
Summary
29
30
Chapter 3
BC/DR Planning
Objective
3.1
3.2
3.3
3.4
3.5
http://www.availability.com/resource/pdfs/dpro-100862.pdf
31
3.3
32
3.4
33
11
Data
Process
Network
People
Time required for process
Interdependencies of processes
The BCP covers mainly on baking up data and providing system redundancy
but this one small part of BCP. The disaster recovery includes some things like
transporting of people to proper place, developing ways of carrying out
automated tasks manually, documenting needed configurations, alerting
business processes to maintain critical functions.
Business continuity is also part of security policy and program. Every business
organization is there to make profit. This is rational objective of every business
11
http://www.thelshgroup.com/Pages/ContinuityPlanningProcesses.aspx
34
organization. So the plans are prepared to achieve this objective. The main
reason to develop the plans is to reduce risk of financial loss by improving the
companys ability to recover and restore operations. This includes the goal of
mitigating the effects of the disaster. Many companies feel that they do not
have the time or resources to devote to disaster recovery plan. BCP is
ultimately responsibility of top management. The disruptions in business need
to be managed using wisdom and foresight.
The BCP policy can be designed by considering process management and
incident management.
3.4.4 Incident Management
The business activity is dynamic so incidents and crises are also dynamic, so it
needs dynamic management along with proactive action and need
documentation. An incident is any unexpected event. It may cause damage or
may not. Depending on as estimation of the level of damage to the organization,
all types of incidents should be categorized. A classification system could
include the following categories: negligible, minor, major and crisis. Any such
classification is dynamically provisional until the incident is resolved.
These levels can be described as follows:
Negligible incidents: Negligible incidents are those causing no perceptible or
significant damage, such as very brief OS crashes with full information
recovery or momentary power outages with UPS backup or non-catastrophic
failures.
Minor events: Minor events are those that, while not negligible, produce no
negative material or financial impact.
Major incidents: Major incidents cause a negative material impact on
business processes and may affect other systems, departments or even outside
clients.
Crisis: Crisis is a major incident that can have serious material impact on the
continued functioning of the business and may also adversely impact other
systems or third parties. How serious they are depends on the industry and
circumstances, but severity is generally directly proportional to the time
elapsed from the inception of the incident to incident resolution.
Minor, major and crisis incidents should be documented, classified, and
followed up on until corrected or resolved. This is a dynamic process, as a
major incident generally deescalates for time being or momentarily then May
Copyright Intelligent Quotient System Pvt. Ltd. |
35
12
12
http://www.google.co.in/imgres?start=154&hl=en&client=firefox-a&rls=org.mozilla:enUS:official&biw=1366&bih=622&tbm=isch&tbnid=LzCOAAftKkiNlM:&imgrefurl=http://www.spherebase.com/risk-
36
Threats can take many forms, including malicious activity as well as natural
and technical disasters. Where possible, institutions should analyze a threat by
focusing on its impact on the institution, not the nature of the threat. For
example, the effects of certain threat scenarios can be reduced to business
disruptions that affect only specific work areas, systems, facilities (i.e.,
buildings), or geographic areas.
Additionally, the magnitude of the business disruption should consider a wide
variety of threat scenarios based upon practical experiences and potential
circumstances and events.
If the threat scenarios are not comprehensive, BCPs may be too basic and omit
reasonable steps that could improve business processes' resiliency to
disruptions.
Threat scenarios need to consider the impact of a disruption and probability of
the threat occurring. Threats range from those with a high probability of
occurrence and low impact to the institution (e.g., brief power interruptions), to
those with a low probability of occurrence and high impact on the institution
(e.g., hurricane, terrorism). High probability threats are often supported by
very specific BCPs. However, the most difficult threats to address are those
that have a high impact on the institution but a low probability of occurrence.
Using a risk assessment, BCPs may be more flexible and adaptable to specific
types of disruptions that may not be initially considered.
Likelihood
Level
High
Medium
Low
Likelihood Definition
The threat's source is highly motivated
and sufficiently capable, and controls that
prevent the vulnerability from being
exercised are ineffective.
The threat's source is motivated and
capable, but controls are in place that may
impede a successful exercise of the
vulnerability.
The threat's source lacks motivation or
capability, and controls are in place to
prevent or significantly impede the
vulnerability from being exercised.
assessment-analytics.htm&docid=cntNDyKnowC_0M&imgurl=http://www.spherebase.com/images/riskassessment.jpg&w=359&h=326&ei=n9ZSUI3cO8yHrAfPgoCoBA&zoom=1&iact=hc&vpx=259&vpy=118&dur=3904&
hovh=214&hovw=236&tx=86&ty=132&sig=109287564227310851567&page=7&tbnh=131&tbnw=143&ndsp=28&v
ed=1t:429,r:1,s:154,i:9
37
Risk
Likelihood
(Adapted from NIST's Risk Management Guide for Information Technology Systems)13
Levels
13
http://www.theiia.org/intAuditor/itaudit/archives/2007/may/understanding-the-risk-management-process/
38
Name
Type of information and usage of the data
Classification of the data into criticality categories (e.g. Marginal,
Normal, Critical, Highly Critical)
14
http://www.analytix.co.za/Consulting/BusinessContinuityManagement.aspx
39
Critical IT Services
Name
Name
Dependencies of the critical IT Services upon the IT infrastructure
components (relationships between IT Services and IT infrastructure
components)
Which
vulnerabilities,
impairing
the
critical
infrastructure
components in the event of a disaster, are imaginable?
Which consequences would a failure carry?
Which level of damage would be expected?
Type of risk
Based on which threat or vulnerability
Risk classification, e.g. Negligible, Marginal risk, temporarily
tolerable, Increased, still temporarily tolerable risk, High risk, not
40
15
http://www.google.co.in/imgres?hl=en&client=firefox-a&hs=tF6&sa=X&rls=org.mozilla:enUS:official&biw=1366&bih=622&tbm=isch&prmd=imvnsb&tbnid=WIiZqDud7tXdM:&imgrefurl=http://www.eci.com/solutions/bsn_resilency_protection/businesscontinuity.html&docid=FTnUr3-8aLnXkM&imgurl=http://www.eci.com/images/Eze-BCP-Life-Cycle-2010SMA.gif&w=275&h=275&ei=mNJSUKW6E8_trQeCqoDgCA&zoom=1&iact=hc&vpx=865&vpy=297&dur=642&hovh=
142&hovw=142&tx=132&ty=104&sig=109287564227310851567&page=2&tbnh=142&tbnw=142&start=21&ndsp=
26&ved=1t:429,r:11,s:21,i:176
41
42
43
44
16
16
http://www.thelshgroup.com/Pages/ContinuityPlanningProcesses.aspx
45
Summary
46
Chapter 4
BCP/DR Plan Development and Implementation
Objective
4.1 Purpose of BCP
4.2 BCP Methodology
4.3 BCP/DR Testing Techniques
4.4 BCP/DR Maintenance and Re-assessment of Plans
4.5 Features of good BCP
4.6 Data Recovery Strategies
4.7 Contents of Disaster Recovery Plan
Pre-project activities
Perform a Business Impact Assessment (BIA)
Copyright Intelligent Quotient System Pvt. Ltd. |
47
Risk Assessment
Determining Choices and Business Continuity Strategy
Developing and Implementing BCP
Test resumption and recovery plans
4.1.2
Pre-project Activities
Obtain executive support
Formally define the scope of the project
Choose project team members
Develop a project plan
Get a project manager
Develop a project charter
The BCP project has number of essential tasks which are common to all
projects. These tasks include assembling the project team and appointment of
Project manager and Deputy Project Manager. The team formation is important
task for the success of the project.
The role of Project Manager is as under:
Now a day almost all activities are done using Information Technology. The
business functions spread across more than one department. It is necessary
that each department understands its role in the plan. It is also important that
each gives its support to maintain it. In case of a disaster, each has to be
prepared for a recovery process, aimed at protection of critical functions
48
The committee consisting of senior officials from departments like HR, IT, Legal,
Business and Information Security needs to be instituted with the following
broad mandate:
The teams are formed with assigned responsibilities in the event of an incident.
The following teams are created depending upon the size of the organization.
The team is formed on the required adequacy for various aspects of BCP at
central office, as well as individual controlling offices or at a branch level, as
required. Among all the teams that can be considered are based on need. The
BCP Project team should be carefully selected. The selected member should be
formally notified their selection.
4.1.3 Critical
Framework
Components
of
Business
Continuity
Management
49
i.
BCP should evolve beyond the information technology realm and must
also cover people, processes and infrastructure
The methodology should prove for the safety and well-being of people in
the organization / outside location at the time of the disaster.
Define response actions based on identified classes of disaster.
50
To arrive at the selected process resumption plan, one must consider the
risk acceptance for the organization, industry and applicable regulations
Personnel
Addresses or telephone numbers
Business Strategy
Location facilities and resources
Legislation
Processes new or withdrawn
Risk
Contractors, key customers
51
4.2.2BCP/DRP Training
In order to provide the greatest benefit to users, gain the greatest return on
investment in a new system, and to be able to operate it effectively without
consulting support, it is critical to provide thorough and effective training.
Project Team members must become experts in the operation of the software,
and end users must become self-sufficient in its use. Executives should have
enough knowledge of the system to understand its capabilities and its
requirements for operations and on-going maintenance.17
4.2.3 BCP/DR Documentation
Documentation is critical to support end-users, to manage change to the
system throughout its lifetime, and to ensure consistent and appropriate use of
17
http://www.ohio.edu/sisrfp/OHIOSISProjectCharter.pdf
52
53
54
55
56
57
58
strategy for each system. The strategy is based ultimately on the IT budget.
Therefore, RTO and RPO metrics need to fit with the available budget and the
critical of the business process/function.
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are some of
the most important parameters of a disaster recovery or data protection plan.
These objectives guide the enterprises in choosing an optimal data backup or
restore plan.
59
when things go wrong, and RTO is how long youre willing to go without service
after a disaster.
For a large enterprise running SANs, the RTO and RPO targets are an hour or
less: the more you pay, the lower the numbers. That can mean a large
company spending the big amount is willing to lose all the email sent to them
for up to an hour after the system goes down, and go without access to email
for an hour as well. Enterprises without SANs may be literally trucking tapes
back and forth between data centers, so as you can imagine their RPOs and
RTOs can stretch into days. As for small businesses, often they just have to
start over.
Prior to selecting a data recovery (DR) strategy, a DR planner should refer to
their organizations BCP, which should indicate key metrics of recovery point
objective and recovery time objective for business processes.
4.6.3A List of Common Strategies for Data Protection:
60
Performance
Availability
Security and Access Control
Conformance to standards to ensure Interoperability
61
18
http://www.rbi.org.in/scripts/PublicationReportDetails.aspx?UrlPage=&ID=622
19
http://www.rbi.org.in/scripts/PublicationReportDetails.aspx?UrlPage=&ID=622
62
Summary
****************************************************************************************
63
Chapter - 5
BCP/DR and Recovery Technology
Objective
5.1Fault Tolerance and Disaster Recovery
5.2 Assessing Fault Tolerance and Disaster Recovery Needs
5.3 Clustering Technologies
5.4 Power Management
5.5 Issues in implementing a DC /DR solution
Introduction
Technology never gives perfect solution, and computers are not exception for it.
They can have problems that affect their users productivity. These problems
range from small errors to total system failure known as catastrophic failure.
Errors and failures can be the result of environmental problems, hardware and
software failure, hacking (malicious, unauthorized use of a computer or a
network), as well as natural disasters. Every organization needs to measure
and minimize the impact of computer and network problems. These measures
fall into two major categories known as fault tolerance and disaster recovery.
5.1 Fault Tolerance and Disaster Recovery
Fault tolerance is the capability of a computer or a network system to respond
to a condition automatically, usually resolving it and thus reducing the impact
on the system. If fault tolerant measures have been implemented, it is unlikely
that a user would know that a problem existed.
Disaster recovery is the ability to get a system functional after a total system
failure or site outage in the least amount of time. If enough fault tolerance
methods are in place, there is no need to have disaster recovery subject to
inherent risks.
We need to deal with the following when we want to have fault tolerance to
support business continuity.
1. Assessing fault tolerance and disaster recovery needs
2. Power management
3. Disk system fault tolerance methods
4. Backup considerations
5. Virus protection
Copyright Intelligent Quotient System Pvt. Ltd. |
64
In this chapter, we are going to discuss first two points. Last three will be
discussed in next chapter.
5.2 Assessing Fault Tolerance and Disaster Recovery Needs
First we need to find out what are the critical process for the organization as
well as we need to determine how long each system could afford to be
nonfunctional (down). These determinations will dictate which fault tolerance
and disaster recovery methods can be implemented and to what extent. The
more vital the system, the greater lengths (and, thus, the greater expense) you
should go to in order to protect it from downtime
For example, banking
organizations, insurance companies, the government agencies , and airlines all
run highly critical computer and network systems. Thus, they all have complex
and expensive fault tolerance and disaster recovery systems in place.
The fault tolerance and disaster recovery are implemented, by developing sites
described as hot, warm, or cold.
Hot Sites
65
66
67
NLB automatically detects servers that are disconnected from the NLB
clusters and then redistributes the client requests to the remaining active
servers.NLB does not direct the client requests to the failed or inactive
servers.
The algorithms behind NLB keep track of which servers are busy, so when a
request comes in, it is sent to a server that can handle it. In the event of an
individual server failure, NLB knows about the problem and can be
configured to automatically redirect the connection to another server in the
NLB cluster
NLB also supports the feature of scalability. When the number of client
requests increases, the number of host servers in NLB cluster can also be
increased in the server farm.
68
20
20
69
the failed node was running. This allows the users to continue to access those
resources while the failed node is out of operation.
A typical configuration for a cluster would use a shared disk technology such
as RAID (Redundant Array of Inexpensive/Independent Disks) or SAN (Storage
Area Network) to share back-end data stores.
21
Failover Cluster
Failover Cluster Terminology
Nodes
The individual servers of the cluster are called Nodes. The Nodes are
connected by physical cables and cluster software.
Failover
When one of the servers in the cluster fails, another server node in the
cluster provides the applications or services. This process is known as
Failover.
Failback
When the server which dropped out of the cluster returns to service and
rejoins the cluster, the services or applications which previously failed
over to another node can now return to the server on which they
originally ran. This is called failback
21
Source: http://Remoteitservices.com
70
Quorum
Quorums are used to determine the number of failures that can be
tolerated within a cluster before the cluster itself has to stop functioning.
This is done to protect data integrity and to prevent problems that could
occur because of failed or failing communication between nodes.
Quorums describe the configuration of the cluster and contain
information about the cluster components such as network adapters,
storage, and the servers themselves.
File
and
Servers,
Database
servers,
5.3.3Geo-Cluster
Clusters can be deployed in a server farm in a single physical facility or in
different facilities geographically separated for added resiliency. The latter type
of cluster is often referred to as a geo-cluster. Geo-clusters became very
popular as a tool to implement business continuance because they improve the
time that it takes for an application to be brought online after the servers in the
primary site become unavailable meaning that ultimately they improve the
recovery time objective (RTO).
71
receive a heartbeat from the active device in the specified interval, the failover
server considers the active server inactive, and the failover server comes online
(becomes active) and is now the active server. When the previously active server
comes back online, it starts sending out the heartbeat. The failover device,
which currently is responding to requests as the active server, hears the
heartbeat and detects that the active server is now back online. The failover
server then goes back into standby mode and starts listening to the heartbeat
of the active server again.
22
Warm Site
A warm site is, quite logically, a compromise between hot and cold sites. These
sites will have hardware and connectivity already established, though on a
smaller scale than the original production site or even a hot site. Warm sites
22
Source: http://www.fatihacar.com
72
will have backups on hand, but they may not be complete and may be between
several days and a week old. An example would be backup tapes sent to the
warm site by courier. The data and services are less critical than those in a hot
site. With hot site technologies, all fault-tolerance procedures are automatic
and are controlled by the Network Operating System. Warm site technologies
require a little more administrator intervention, but it isnt expensive. The most
commonly used warm site technology is a duplicate server. A duplicate server,
as its name suggests, is currently not being used and is available to replace
any server that fails. When a server fails, the administrator installs the new
server and restores the data; the network services are available to users with a
minimum of downtime. The administrator sends the failed server out to be
repaired. Once the repaired server comes back, it is now the spare server and is
available when another server fails. Using a duplicate server is a disaster
recovery method because the entire server is replaced in a shorter time than if
all the components had to be ordered and configured at the time of the system
failure. The major advantage of using duplicate servers rather than clustering
is that its less expensive. A single duplicate server costs much less than a
comparable cluster solution. Corporate networks dont often use duplicate
servers, and thats because there are some major disadvantages associated
with duplicate servers.
A cold site is the most inexpensive type of backup site for an organization to
operate. It does not include backed up copies of data and information from the
original location of the organization, nor does it include hardware already set
up. The lack of hardware contributes to the minimal startup costs of the cold
site, but requires additional time following the disaster to have the operation
running at a capacity close to that prior to the disaster. A cold site cannot
guarantee server uptime. Generally speaking, cold sites have little or no fault
tolerance and rely completely on efficient disaster recovery methods to ensure
data integrity. If a server fails, the IT personnel do their best to recover and fix
the problem. If a major component needs to be replaced, the server stays down
until the component is replaced. Errors and failures are handled as they occur.
Apart from regular system backups, no fault tolerance or disaster recovery
methods are implemented. This type of site has one major advantage: it is the
Copyright Intelligent Quotient System Pvt. Ltd. |
73
cheapest way to deal with errors and system failures. No extra hardware is
required (except hardware required for backing up).
Any disadvantages of implementing a cold site would stem from having an
application that cannot afford the downtime associated with service-affecting
faults and disasters.
The term near line refer to storage method that is neither online nor offline but
somewhere in the middle, like tape backup. It involves material that is not
likely to be needed except in cases of disaster recovery. While there is not a
one-to-one correspondence between any type of site (hot, warm, or cold) and
nearline storage, which is not actively accessed during normal operation, you
can see that nearline storage comes in handy when recovering from disasters
in warm and cold sites.
5.4
Power Management
Surge Protectors
74
23
Battery backup systems protect computer systems from power failures. These
systems use a battery to power the computer and its assorted peripherals.
These devices are activated due to a power failure, they permit the user to save
data and initiate a graceful shutdown of the system. They normally arent used
to run the system for an extended period.
There are two main types of battery backup systems:
drops below a factory preset threshold, the switching circuit switches from line
voltage to the battery and inverter. The battery and inverter power the outlets
(and, thus, the computers or devices plugged into them) until the switching
23
Ref: http://en.wikipedia.org/wiki/File:Surge_protector.jpg
75
circuit detects that line voltage is present again at the correct levels. The
switching circuit then switches the outlets back to line voltage.
24
24
Ref: http://www.p-wholesale.com
76
Line Conditioners
77
78
8It is also recommended that the support infrastructure at the DC and DR,
namely the electrical systems, air-conditioning environment and other support
systems have no single point of failure and do have a building management
and monitoring system to constantly and continuously monitor the resources.
If it is specified that the solution has a high availability of 99.95 measured on a
monthly basis and a mean time to restore of 2 hrs. in the event of any failure, it
has to include the support system also.
9. Data replication mechanism followed between DC and DR is the
asynchronous replication mechanism and implemented across the industry
either using database replication techniques or the storage based replication
techniques. They do have relative merits and demerits. The RTO and RPO
discussed earlier, along with the replication mechanism used and the data
transfer required to be accomplished during the peak load will decide the
bandwidth required between the DC and the DR. The RPO is directly related to
the latency permissible for the transaction data from the DC to update the
database at the DR. Therefore, the process implemented for the data replication
requirement has to conform to the above and with no compromise to data and
transaction integrity.
10. Given the need for drastically minimizing the data loss during exigencies
and enable quick recovery and continuity of critical business operations,
organizations may need to consider near site DR architecture. Major
organizations with significant customer delivery channel usage and significant
participation in financial markets/payment and settlement systems may need
to have a plan of action for creating a near site DR architecture over the
medium term (say, within three years).
To address these issues, the following controls are required:
79
b. The time window for recovery is shrinking in face of the demand for 24 /
365 operations. Some studies claim that around 30 percent of highavailability applications have to be recovered in less than three hours. A
further 45 percent within 24 hours, before losses become unsustainable;
others claim that 60 percent of Enterprise Resource Planning (ERP)
Systems have to be restored in under 24 hours. This means that
traditional off-site backup and restore methods are often no longer
adequate. It simply takes too long to recover incremental and full image
backups of various inter-related applications (backed up at different
times), synchronize them and re-create the position as at disaster.
Continuous operationdata mirroring to off-site locations and standby
computing and telecommunicationsmay be the only solution.
c. A risk assessment and business impact analysis should establish the
justification for continuity for specific IT and telecommunication services
and applications.
d. Achieving robust security (security assurance) is not a onetime activity. It
cannot be obtained just by purchasing and installing suitable software
and hardware. It is a continuous process that requires regular
assessment of the security health of the organization and proactive steps
to detect and fix any vulnerability. Every organization should have in
place quick and reliable access to expertise for tracking suspicious
behavior, monitoring users and performing forensics. Adequate reporting
to the authorities concerned.
5.5.2 Telecommunications issues In BCP/DR
It is important to ensure that relevant links are in place and that
communications capability is compatible. The adequacy of voice and data
capacity needs to be checked. Telephonic communication needs to be
switched from the disaster site to the standby site. However, diverse
routing may be difficult to achieve since primary telecommunications
carriers may have an agreement with the same sub-carriers to provide
local access service, and these sub-carriers may also have a contract
with the same local access service providers. Financial institutions do
not have any control over the number of circuit segments that will be
needed, and they typically do not have a business relationship with any
of the sub-carriers. Consequently, it is important for financial
institutions to understand the relationship between their primary
telecommunications carrier and these various sub-carriers and how this
complex network connects to their primary and back-up facilities. To
determine whether telecommunications providers use the same subcarrier or local access service provider, organizations may consider
performing an end-to-end trace of all critical or sensitive circuits to
search for single points of failure such as a common switch, router, PBX,
or central telephone office.
Copyright Intelligent Quotient System Pvt. Ltd. |
80
vi.
vii.
viii.
ix.
5.5.4Outsourcing Risks
In theory a commercial hot or warm standby site is available 24 / 365. It has
staff skilled in assisting recovery. Its equipment is constantly kept up to date,
while older equipment remains supported. It is always available for use and
offers testing periods once or twice a year. The practice may be different. These
days, organizations have a wide range of equipment from different vendors and
different models from the same vendor. Not every commercial standby site is
able to support the entire range of equipment that an organization may have.
Instead, vendors form alliances with others but this may mean that an
organizations recovery effort is split between more than one standby site. The
standby site may not have identical IT equipment; instead of the use of an
identical piece of equipment, it will offer a partition on a compatible large
computer or server. Operating systems and security packages may not be the
same version as the client usually uses. These aspects may cause setbacks
when attempting recovery of IT systems and applications and weak change
control at the recovery site could cause a disaster on return to the normal site.
81
Summary
82
Chapter 6
Disk System Fault Tolerance
Objective
6.1
6.2
6.3
6.4
6.5
6.6
6.7
Introduction
The primary requirement in an organizations network is disk storage, which is
used to store organizational data.
Various storage technologies, such as direct-attached storage (DAS), network
attached storage (NAS), and storage-area networks (SANs) are used to store
organizational data. Stored data requires fault tolerance. Backup is used to
secure data in case disaster occurs. Data and computers must be protected
from virus attack.
6.1 Server Storage Technologies
The demand for server storage these days has increased manifold.
Consequently, server storage technologies have improved with time. Initially,
DAS technology was used to store data. However, DAS technology was used to
attach storage to only one server, thereby leading to inefficient utilization of
storage resources.
Network Attached Storage (NAS) technology was introduced mainly because of
the limitations of DAS.NAS is a data storage technology, which allows you to
store the data on a network storage location and provides data accessibility to
multiple clients
The storage area network (SAN) is an architecture that helps you to attach
remote storage devices to servers. These storage devices are attached in such a
manner that it appears as if they are attached locally to servers. SAN is not
restricted to a single server; instead, SAN storage can be moved from one server
to another.
6.2 Disk Systems Fault Tolerance
Hard disk is the basic storage device of a computer system. As compared to
other hardware devices, Hard disk carries the maximum risk of failure (Hard
disk crash). When this happens, all data can be lost. Therefore, to make data
available and accessible, some methods of Fault tolerance must be
implemented.
Copyright Intelligent Quotient System Pvt. Ltd. |
83
25
Disk Mirroring
Mirroring a drive means designating a hard disk drive in the computer as a
mirror or duplicate to another, specified drive. The two drives are attached to a
single disk controller. This disk fault tolerance feature is provided by most
Network Operating Systems (NOS). When the NOS writes data to the specified
drive, the same data is also written to the drive designated as the mirror. If the
first drive fails, the mirror drive is already online, and since it has a duplicate
of the information contained on the specified drive, the users wont know that a
disk drive in the server has failed. The NOS notifies the administrator that the
failure has occurred. The down side is that if the disk controller fails, neither
drive is available.
25
Ref: http://e-university.wisdomjobs.com
84
Disk Duplexing
As with mirroring, duplexing also saves data to a mirror drive. In fact, the only
major difference between duplexing and mirroring is that duplexing uses two
separate disk controllers (one for each disk). Thus, duplexing provides not only
a redundant disk, but a redundant controller as well. Duplexing provides fault
tolerance even if one of the controllers fails. There is now an extra disk
controller in the system. Mirroring is also known as RAID-1.
Disk Striping
From a performance point of view, writing data to a single drive is slow. When
three drives are configured as a single volume, information must fill the first
drive before it can go to the second, as well as fill the second before filling the
third. If the user configures that volume to use disk striping, the user will see a
definite performance gain. Disk striping breaks up the data to be saved to disk
into small portions and sequentially writes the portions to all disks
simultaneously in small areas called stripes. These stripes maximize
performance because all the read/write heads are working constantly. Notice
that the data is broken into sections and that each section is sequentially
written to a separate disk.
Striping data across multiple disks improves only performance; it does not
improve fault tolerance. To add fault tolerance to disk striping, it is necessary
to use parity. Disk striping is also known as RAID-0.
Parity Information
Parity, as it relates to disk fault tolerance, is a general term for the fault
tolerance information computed for each chunk of data written to a disk. This
parity information can be used to reconstruct missing data should a disk fail.
Striping can use parity or not, but if the striping technology doesnt use parity,
the user wont gain any fault tolerance. When using striping with parity, the
parity information is computed for each block and written to the drive. The
advantage to using parity with striping is gaining fault tolerance. If any part of
the data gets lost or destroyed, the information can be rebuilt from the parity
information. The down side to using parity is that computing and writing parity
information reduces the total performance of a disk system that uses striping.
The parity information also reduces the total amount of free disk space.
85
RAID 0 (Commonly used): This method is the fastest because all read/
write heads are constantly being used without the burden of parity or
duplicate data being written. A system using this method has multiple
disks, and the information to be stored is striped across the disks in
blocks without parity. This RAID level only improves performance; it
does not provide fault tolerance.
RAID 1 (Commonly used): This level uses two hard disks, one mirrored
to the other (commonly known as mirroring; duplexing is also an
implementation of RAID 1). This is the most basic level of disk fault
tolerance.
If the first hard disk fails, the second automatically takes over. No parity
or error-checking information is stored. Rather, each drive has duplicate
information of the other. If both drives fail, a new drive must be installed
and configured, and the data must be restored from a backup.
RAID 2: Individual bits are striped across multiple disks. One drive
(designated as the parity drive) in this configuration is dedicated to
storing parity data. If any data drive (a drive in this configuration that is
not the parity drive) fails, the data on that drive can be rebuilt from
parity data stored on the parity drive. At least three disk drives are
required in this configuration. This is not a commonly used
implementation.
RAID 3 (Commonly used): Data is striped across multiple hard drives
using a parity drive (similar to RAID 2). The main difference is that the
data is striped in bytes, not bits, as in RAID 2. This configuration is
popular because more data is written and read in one operation,
increasing overall disk performance.
RAID 4: This is similar to RAID 2 and 3 (striping with parity drive),
except data is striped in blocks, which facilitates fast reads from one
drive. RAID 4 is the same as RAID 0, with the addition of a parity drive.
This is not a popular implementation.
RAID 5 (Commonly used): The data and parity are striped across
several drives. This allows for fast writes and reads. The parity
information for data on one disk is stored with the data on another disk,
so if any one disk fails, the drive can be replaced, and its data can be
rebuilt from the parity data stored on the other drives. A minimum of
three disks is required. Five or more disks are most often used.
86
Initially if you had to create any of the preceding volumes, you needed to first
convert the basic disks into dynamic disks. However, Windows server 2008
provides the Disk Management tool, which automatically converts the basic
disk to dynamic one, while creating volumes.
A volume refers to a data storage area that is accessible by a file system, which
may or may not be on a single partition of a hard disk. Apart from creating
different types of volumes, dynamic disks also supports Extending volumes,
shrinking volumes and creating mount point on a volume
87
26
Spanned Volume
A spanned volume is a dynamic volume consisting of disk space on more than
one physical disk. If a simple volume is not a system volume or boot volume,
you can extend it across additional disks to create a spanned volume, or you
can create a spanned volume in unallocated space on a dynamic disk.
You can extend a spanned volume onto a maximum of 32 dynamic disks.
Spanned volumes are not fault tolerant. If one of the disks containing a
spanned volume fails, the entire volume fails, and all data on the spanned
volume becomes inaccessible.
26
Ref: www.iomega.com
88
27
RAID 0 -Striping
Ref: thedatarescuecentre.com
89
28
RAID 1
Fig showing Mirrored volume
6.3.6 RAID-5
It is also known as disk striping with parity. With disk striping with parity, you
can use three or more disks (with a maximum of 32) and data is striped across
all the disks with an additional block of error-correction called parity, which is
used to reconstruct the data in the event of a disk failure. RAID-5 has slower
write performance than the other RAID types because the OS must calculate
the parity information for each stripe that is written, but the read performance
is equivalent to RAID-0, because the parity information is not read. Like RAID1, RAID-5 comes with additional cost considerations. For every RAID-5 set,
roughly an entire hard disk is consumed for storing the parity information.
For example, a minimum RAID-5 set requires three hard disks, and if those
hard disks are 300GB each, approximately 600GB of disk space is available to
the OS and 300GB is consumed by parity information, which equates to 33.3
percent of the available space. Similarly, in a five disk RAID-5 set of 300 GB
disks, approximately 1200 GB of disk space is available to the OS which means
that 20 percent of the total available space is consumed by the parity
information.
The capacity of the volume is limited to the smallest section of unallocated
space on any one of the disk that belongs to RAID-5 set. Suppose we want to
create a RAID-5 volume using three disks -disk1, disk2 and disk3. If Disk2 has
50GB of unallocated space, but Disks 1 and 3 have 100GB of unallocated
space, the stripe can use only 50GB of space on Disk1, Disk2 and Disk3. Thus
28
Ref: thewebshop.com
90
the space used on each disk in the volume is identical. Out of this space, the
entire space of one hard disk will be used to store parity information. So the
capacity of the RAID-5 volume created in this example will be 100GB.
In RAID-5, fault tolerance applies only to a single drive failure. If more than one
drives fails data is lost that can be recovered only by restoring it from a
backup.
The following diagram shows the RAID-5 configuration involving 4 disks.
29
RAID 5
Fig showing RAID-5 volume
Let us now see how the Disk Management utility in Microsoft Windows OS can
be used to perform the tasks mentioned above
Ref: blog.everycity.co.uk
91
actually displays volumes only for dynamic disks; on basic disks, the top pane
contains a list of the primary partitions and logical drives.
Disk Management
Each entry in the volume list contains the following information:
Volume: Specifies the drive letter and/or volume name
Layout: Specifies the volume type, such as simple, spanned, or striped
for volumes on dynamic disks, or partitions for basic disks.
Type: specifies the type of disk on which the volume is located: basic or
dynamic
File System: Specifies the file system that was used to format the
volume.
Status: Specifies the current status of the volume, using one of the
following values:
a. Failed -Indicates that the volume could not be started
b. Failed Redundancy-Indicates that a mirrored or RAID-5 volume is
no longer fault tolerant because of a disk failure
c. Formatting-Indicates that the volume is in the process of being
formatted
d. Healthy - Indicates that the volume is operating normally
e. Regenerating-Indicates that a RAID-5 volume is in the process of
re-creating data on a newly restored disk
Copyright Intelligent Quotient System Pvt. Ltd. |
92
The bottom pane of the Disk Management console window contains a graphical
view of the physical disks in the computer. For each disk, the view specifies the
following information:
Disk identifier-Specifies the number assigned to the disk by the system.
Hard disk identifiers-begin with Disk 0 and CD-ROMs with CD-ROM 0.
Disk type-specifies whether the disk is a basic disk, dynamic disk,CD-ROM, or
DVD-ROM
Disk size-specifies the total capacity of the disk.
Disk status-specifies the current status of the disk, using one of the following
values:
a. Audio CD- Indicates that a CD-ROM or DVD-ROM drive contains an
audio CD.
b. Foreign- Indicates a dynamic disk that has been moved from another
computer but has not yet been imported into the current systems
configuration. Run the Import Foreign Disks command to access the
disk.
c. Initializing- Indicates that the disk is in the process of being converted
from a basic disk to a dynamic disk.
d. Missing- Indicates that a dynamic disk has been removed from the
computer, disconnected, or corrupted. Use the Reactivate Disk command
to access a previously disconnected disk.
e. No Media- Indicates that a CD-ROM, DVD-ROM, or removable disk drive
is currently empty.
f. Not Initialized- Indicates that the disk does not contain a valid
signature. Use Initialize Disk to activate the disk.
g. Online- Indicates that the disk is accessible and functioning normally.
h. Online (Errors) -Indicates that I/O errors have been detected on a region
of a dynamic disk.
i. Offline- Indicates that a dynamic disk is not accessible.
Copyright Intelligent Quotient System Pvt. Ltd. |
93
94
95
Ft12cr03.
New Partition Wizard
If you create a primary partition, the wizard takes you through the process of
assigning a drive letter to the partition and formatting it, or you can choose to
perform these tasks later. If you create an extended partition, you must select
the Free Space area you just created and run the New Partition Wizard again,
this time opting to create a logical drive. You can create any number of logical
drives you want, until you have used all of the space in the extended partition.
Here again, the wizard enables you to format the logical drives as you create
them, or you can choose to format them later.
6.4.3Converting Basic Disk to Dynamic Disk
If you want to use dynamic storage, you must convert a basic disk to a
dynamic disk before creating new volumes. However, when you convert the
system disk to dynamic storage, you must restart the system before you can
perform any further actions on the disk.
You can convert a basic disk to a dynamic disk at any time, even when you
have data stored on the disk. The structure of data on the disk is not modified,
so the existing data is not lost. However, the best practice when performing any
major disk manipulation is to back up your data first.
When you convert a basic disk that already contains partitions and logical
drives to a dynamic disk, those elements are converted to the equivalent
dynamic disk elements. In most cases, basic partitions and logical drives are
converted to simple volumes.
Copyright Intelligent Quotient System Pvt. Ltd. |
96
97
By default, the disk you chose when creating the volume appears in the
selected list. All of the other dynamic disks in the computer appear in the
Available list. To add a disk to the volume, you make a selection in the
Available list and click Add.
You can add up to 32 disks to a spanned, striped, or RAID-5 volume; mirrored
volume uses only two disks.
Once you have selected the disks you want to use to create the volume, you
must specify the volumes size. The process varies slightly, depending on the
type of volume you are creating:
Spanned volumes can use any amount of space from each of the drives.
For each of the disks in the selected list, you specify the amount of space (in
megabytes) that you want to add to the spanned volume. The Total Volume Size
in Megabytes (MB) field displays the combined space from all the selected
drives.
Striped, mirrored, and RAID-5 volumes must use the same amount of space on
each of the selected disks. After you select the disks you want to use for the
volume, select the amount of Space in MB option specifies the maximum
amount of space that each disk can contribute, which is determined by the
Copyright Intelligent Quotient System Pvt. Ltd. |
98
disk with the least amount of space free. When you change the amount of
space for one disk, the wizard changes the amount of space contributed by the
other disks.
The total size of the volume is also calculated differently for the various volume
types:
For a spanned volume, the total size of the volume is the number of megabytes
you specified for the selected disks combined.
For a striped volume, the total size of the volume is the number of megabytes
you specified, multiplied by the number of disks you selected.
For a mirrored volume, the total size of the volume is the number of megabytes
you specified. This is because each of the disks contains an identical copy of
the data on the other disks.
For a RAID-5 volume, the total size of the volume is the number of megabytes
you specified, multiplied by the number of disks you selected minus one. This
is because the RAID-5 volume uses one disks worth of space to store the parity
for the rest of the disk array.
After you configure these parameters, the wizard enables you to assign a drive
letter to the volume and format it, so that it is ready to use.
Working with Mirrored Volumes
Mirror volume is also known as RAID-1. Mirror volume requires only two disks
A mirrored volume provides good performance along with excellent fault
tolerance. Two disks participate in a mirrored volume, and all data is written to
both volumes simultaneously. For the best possible fault tolerance, you should
use disks connected to separate host adapters. This creates a configuration
called duplexing, which provides better performance and enables the volume to
survive an adapter failure as well as a disk failure.
Converting a simple volume to a mirrored volume:
In addition to creating a new mirrored volume, you can also convert a simple
volume into a mirrored volume by selecting the simple volume and, on the
Action menu, pointing to All Tasks and selecting Add Mirror. You must have
another dynamic disk in the computer with sufficient unallocated space to hold
a copy of the simple volume you selected.
Once you have created the mirror volume, the system begins copying data,
sector by sector, to the newly added disk. During that time, the volume status
is reported as resyncing.
Recovering from Mirrored Disk Failures:
The recovery process for a failed disk within a mirrored volume depends on the
type of failure. If a disk has experienced transient I/O errors, the volume on
Copyright Intelligent Quotient System Pvt. Ltd. |
99
After you correct the cause of the I/O errorperhaps a bad cable connection or
power supplyselect the volume on the problematic disk and, on the Action
menu, point to All Tasks and select Reactivate Volume. Or you can select the
disk and choose Reactivate Disk. Reactivating brings the disk or volume back
online. The system then resynchronizes the disks.
If you want to stop mirroring, you have three choices, depending on what you
want the outcome to be:
Delete the volume: If you delete the volume, the volume and all the
information it contains is removed. The resulting unallocated space is
then available for new volumes.
Remove the mirror: If you remove the mirror, the mirror is broken and
the space on one of the disks becomes unallocated. The other disk
maintains a copy of the data that had been mirrored, but that data is of
course no longer fault tolerant.
Break the mirror: If you break the mirror, the mirror is broken but both
disks maintain copies of the data. The portion of the mirror that you
select when you select Break Mirror maintains the original mirrored
Copyright Intelligent Quotient System Pvt. Ltd. |
100
volumes drive letter, shared folders, paging file, and reparse points. The
secondary drive is given the next available drive letter.
If you have a mirrored volume in which one physical disk has failed completely
and must be replaced, you cant simply remirror the mirrored volume, even
though one of the disks in the mirror set no longer exists. You must first
remove the failed disk from the mirror set to break the mirror. Select the
volume and, on the Action menu, point to All Tasks and select Remove Mirror.
In the Remove Mirror dialog box, it is important to select the disk that is
missing. The disk you select is deleted when you click Remove Mirror, and the
remaining disk becomes a simple volume. Once the operation is complete, you
can select the simple volume and use the Add Mirror command to use the
replacement disk to create a new mirror volume.
Working with RAID
As mentioned earlier in this chapter, RAID is a series of fault tolerance
technologies that enable a computer or operating system to respond to a
catastrophic event, such as a hardware failure, so that no data is lost and work
in progress is not corrupted or interrupted. You can implement RAID fault
tolerance as either hardware or a software solution. In a hardware solution, a
RAID adapter handles the creation and regeneration of redundant information.
Some vendors implement RAID data protection directly in their hardware, as
with disk array adapter cards. Because these methods are vendor specific and
bypass the operating systems fault tolerance software drivers, they offer
performance improvements over software implementations of RAID, like that
included in Windows Server 2003 and Server 2008.
Consider the following points when you decide whether to use a software or
hardware RAID implementation:
Windows Server 2003 and 2008 supports three levels of RAID; RAID-0, RAID-1
and RAID-5. Only RAID-1 and RAID-5 are fault tolerant.
The fault tolerance applies only to a single drive failure. If more than one disk
failed, data is lost that can be recovered only by restoring it from a backup.
BecauseRAID-5 volumes are created as native dynamic volumes from
Copyright Intelligent Quotient System Pvt. Ltd. |
101
unallocated space, you cannot convert any other type of volume into a RAID-5
volume without backing up that volumes data and restoring into a newly
created RAID-5 volume.
If a single disk fails in a RAID-5 volume, the entire data store on the volume
remains accessible. During read operations, any missing data is regenerated on
the fly through a calculation involving remaining data and parity. Performance
is degraded during this time, and if a second drive fails, data is lost
irretrievably.
Once the failed drive is returned to service, you might need to use the Rescan
Disks command in Disk Management and then reactivate the volume on the
newly restored disk. The system then rebuilds missing data from the parity
information and repopulates the disk, leaving the volume fully functional and
fault tolerant again.
6.6Backup Considerations
The organization can never be completely prepared for every natural disaster or
human foible that can bring down the network, the user can make sure that
the user has a solid backup plan in place to minimize the impact of lost data.
Even if the worst happens, the user doesnt have to lose days or weeks of work,
provided that the user has a solid plan in place. A backup plan is the set of
guidelines and schedules that determine which data should be backed up and
how often. A backup plan includes information such as:
What to back up
Where to back it up
When to back up
How often to back up
Who should be responsible for backups
Where media should be stored
How often to test backups
The procedure to follow in case of data loss
102
proper place. This keeps all user data centralized and makes it easy for the
administrator to back up the data.
Full Backup
In a full backup, all network data is backed up (without skipping any files).
This type of backup is straightforward because the user simply tell the software
which servers (and, if applicable, workstations) to back up, where to back up
the data, and start the backup. If the user have to do a restore after a crash,
the user have only one set of tapes to restore from (as many tapes as it took to
back up everything). Simply insert the most recent full back up into the drive
and start the restore.
Differential Backup
103
longer, but each differential backup takes much less time than a full backup.
This type of backup is used when the amount of time each day available to
perform a system backup (called the backup window) is smaller during the
week and larger on the weekend.
Notice that the amount of data becomes gradually larger every day as the
number of files that needs to get backed up increases. Remember that the
archive bit isnt cleared each day; so by the end of the week, the files that
changed at the beginning of the week may have been backed up several times,
even though they havent changed since the first part of the week.
Incremental Backup
104
30
30
Ref: blogs.technet.com
105
106
To access backup and recovery tools for Windows Server 2008, you must
install the Windows Server Backup, Command-line Tools, and Windows
Power Shell items that are available in the Add Features Wizard in Server
Manager. This installs the following tools:
Windows Server Backup Microsoft Management Console (MMC)
snap-in
Wbadmin command-line tool
Windows Server Backup cmdlets (Windows Power Shell commands)
6.6.3Scheduled Backup
Scheduled backups are data backup processes which proceed
automatically on a scheduled basis without additional computer or user
intervention. The advantage of using scheduled backups instead of
manual backups is that a backup process can be run during off-peak
hours when data is unlikely to be accessed, precluding or reducing the
impact of backup downtime.
Scheduled backups allow you to automate the backup process. After you
set the schedule, Server Backup takes care of everything else. You can
set the schedule according to your organizations requirement.
6.6.4Remote Backup
A remote, online, or managed backup service, sometimes marketed as
cloud backup, is a service that provides users with a system for the
backup and storage of computer files. Online backup providers are
companies that provide this type of service to end users (or clients). Such
backup services are considered a form of cloud computing.
Online backup systems are typically built around a client software
program that runs on a schedule, typically once a day, and usually at
night while computers aren't in use. This program typically collects,
compresses, encrypts, and transfers the data to the remote backup
service provider's servers or off-site hardware.
The Windows Server Backup tool can be used to connect to another
Windows Server 2008 computer and perform backup tasks as though the
backup were being performed on the local computer.
6.6.5Offsite Backup
Offsite backups ensure that if the building that hosts your servers is
destroyed by flood, fire, or earthquake, your organization can still recover
its data.
107
Backup on Windows7
In windows7, you are provided with the Backup and Restore window, which
contains various options that allow you to backup and restore the information
present on your system. While creating a backup of the important files, the first
thing that you should consider is the backup destination, that is; where the
backup should be stored. You can save the backup of your data on any of
these storage devices:
After deciding about the backup destination for your data, the next step is to
create the backup. Windows7 supports the following two kinds of backup:
108
want to restore not only the saved files but also all the running
applications on the computer.
System Image is the exact image of the drive in which the
windows operating system is installed, which includes system
settings, programs, and files. A system image helps in restoring
the contents of your computer, which includes the drive in
which Windows is installed. Therefore, if you use a system image
to restore your computer, you need to perform a complete
restoration of the system, instead of restoring only selected files
and folders.
At the time of creating the System Image backup, a Windows
Image Backup folder is automatically created in the backup
media. A folder having the same name as your computer is
created in this backup folder to store image of your system. Two
folders namely the Catalog folder and the Backup folder are
further created in your computer folder
109
Backup on Windows XP
110
111
In Windows XP, the backup utility can be used to restore the backup as
well as to create Automated System Recovery (ASR) disk, which can be
used to repair system from serious errors.
6.6.7 Volume Shadow Copy
Volume Shadow Copy (Volume Snapshot Service or Volume Shadow
Copy Service or VSS), is a technology included in Microsoft Windows that
allows taking manual or automatic backup copies or snapshots of data,
even if it has a lock, on a specific volume at a specific point in time over
regular intervals. It is implemented as a Windows service called the
Volume Shadow Copy service. Shadow Copy technology requires the file
system to be NTFS to be able to create and store shadow copies. Shadow
Copies can be created on local and external (removable or network)
volumes by any Windows component that uses this technology, such as
when creating a scheduled Windows Backup or automatic System
Restore point.
Snapshots have two primary purposes: they allow the creation of
consistent backups of a volume, ensuring that the contents cannot
change while the backup is being made; and they avoid problems with
file locking. By creating a read-only copy of the volume, backup programs
are able to access every file without interfering with other programs
writing to those same files.
The Volume Shadow Copy Service provides the backup infrastructure for
the Microsoft Windows XP, Microsoft Windows Server 2003, Windows 7
and Microsoft windows server 2008 operating systems, as well as a
mechanism for creating consistent point-in-time copies of data known as
shadow copies.
112
The VSS service automatically takes a snapshot of the files and folders
located on any volume or partition where the service has been enabled.
These snapshots include an image of the contents of the folder at a given
point in time. Depending on the space you make available to it, you
could have up to 512 different snapshots of a disk volume. And because
Microsoft has made a client component of VSS, the Previous Versions
client, available along with VSS, users and administrators can have
access to these snapshots.
On regular File Servers, this means that once VSS is implemented, users
can recover any lost file by themselves, at the privacy of their own desk.
Shadow copy service is designed to assist in the process of recovering
previous versions of files without having to resort to backups. VSS only
works well if you have a lot of free space on your disks, but nevertheless,
it is a good solution and requires a very little overhead to run.
Shadow copies can never be a replacement for backup, because files are
not backed-up. So if a shadow copy is no longer available, files are not
available
By default, windows server 2008 creates shadow copies twice a day: at
7:00 A.M and noon. This schedule can be changed if you find that it does
not meet your requirement.
Copyright Intelligent Quotient System Pvt. Ltd. |
113
114
Sometimes, even the System Restore feature is not able to recover the lost data.
In such situations, you can implement any of the following three methods to
recover the lost data.
System Protection
Creates and saves information about the system files and settings in restore
points, which are created just before you start installing a program or device.
The system protection feature is turned on by default for the drive that holds
the operating system.
Copyright Intelligent Quotient System Pvt. Ltd. |
115
Specifies that you can either use a system image to restore your computer or
reinstall Windows. You can access these options from the Advanced Recovery
Methods window
Rewrite the complete content of the system volume. To recover your system
using this feature, you need to first boot your system from Windows 7
installation DVD-ROM. Go to Advanced Boot Options screen by pressing F8
key and select Repair your computer option. Next, select Image Recovery option
and specify the location of the backup
6.7Virus Protection
A virus is a program that causes malicious change in your computer and
makes copies of it. Sophisticated viruses encrypt and hide themselves to thwart
detection. There are tens of thousands of viruses that your computer can
catch.
Viruses can shut down an entire corporation. The types vary, but the approach
to handling them does not. You need to install virus protection software on all
computer equipment. Workstations, personal computers, servers, and firewalls
all must have virus protection, even if they never connect to your network.
They can still get viruses from removable storage media or Internet downloads.
As viruses can cause great damage to your organization and data, it is
necessary to detect and eradicate viruses from computer and network.
Types of Viruses
Several types of viruses exist, but the popular ones are file viruses, macro (data
file) viruses, and boot sector viruses. Each type differs slightly in the way it
works and how it infects your system. Many viruses attack popular
applications such as Microsoft Word, Excel, and PowerPoint; they are easy to
use and its easy to create a virus for them. Because writing a unique virus is
considered a challenge to a bored programmer, viruses are becoming more and
more complex and harder to eradicate.
File Viruses
A file virus attacks executable application and system program files, such as
those ending in .COM, .EXE, and .DLL. Most of these types of viruses replace
some or all of the program code with their own. Only once the file is executed
can the virus cause its damage. This includes loading itself into memory and
waiting to infect other executable, further propagating its potentially
destructive effects throughout a system or network. Examples of file viruses are
Jerusalem and Nimda (although Nimda is usually seen as an Internet worm)
Copyright Intelligent Quotient System Pvt. Ltd. |
116
may also infect common Windows files, as well as files with extensions such as
.HTML, .HTM, and .ASP.
Macro Viruses
A macro is a series of commands and actions that are used to automatically
perform operations without a users intervention. Macro viruses use the Visual
Basic macro scripting language to perform malicious or mischievous functions
in data files created with Microsoft Office products. Macro viruses are the most
harmless but also the most annoying viruses. Since macros are easy to write,
macro viruses are among the most common viruses and are frequently found in
Microsoft Word and PowerPoint. They affect the file you are working on. For
example, you might be unable to save the file even though the Save function is
working, or you might be unable to open a new documentyou can only open
a template. These viruses will not crash your system, but they are annoying.
Cap and Cap A are examples of macro viruses.
Boot Sector Viruses
Boot sector viruses get into the Master Boot Record (MBR). This is track one,
sector one on your hard disk and no applications are supposed to reside there.
The computer at boot up checks this section to find a pointer for the operating
system. If you have a multi-operating-system boot between various versions or
instances of Windows, this is where the pointers are stored. A boot sector virus
will overwrite the boot sector, thereby making it look as if there is no pointer to
your operating system. When you power up the computer, you will see a
Missing Operating System or Hard Disk Not Found error message. Monkey B,
Michelangelo, Stoned, and Stealth Boot are examples of boot sector viruses.
Nearly any virus that falls under one of these three categories can be
implemented as a Trojan horse. Just as the Greeks in legend attacked Troy by
hiding within a giant horse, a Trojan virus hides within other programs and is
launched when the program in which it is hiding is launched. DMSETUP.EXE
and LOVE-LETTER-FOR-YOU.TXT.VBS are examples of known Trojan Horses.
Displaying extensions for known file types can help you remain vigilant against
such naming tricks. These are only a few of the types of viruses out there.
Updating Antivirus Components
A typical antivirus program consists of two components:
The definition files
The engine
The definition files list the various viruses, their type, and their footprints and
specify how to remove them. More than 100 new viruses are found in the wild
each month. An antivirus program would be useless if it did not keep up with
all the new viruses. The engine accesses the definition files (or database), runs
the virus scans, cleans the files, and notifies the appropriate people and
accounts. Eventually viruses become so sophisticated that a new engine and
new technology are needed to combat them effectively.
Copyright Intelligent Quotient System Pvt. Ltd. |
117
118
On-Demand Scans
An on-demand scan is a virus scan initiated by either a network administrator
or a user. You can manually or automatically initiate an on-demand scan.
Typically, youd schedule a monthly on-demand scan, but youll also want to do
an on-demand scan in the following situations:
After you first install the antivirus software.
When you upgrade the antivirus software engine.
When you suspect a virus outbreak.
Before you initiate an on-demand scan, be sure that you have the latest
virus definitions. When you encounter a virus, scan all potentially affected
hard disks and any floppy disks that could be suspicious. Establish a
cleaning station, and quarantine the infected area. Ask all users in the
infected area to stop using their computers. Perform a scan and clean at the
cleaning station. Run a full scan and clean the entire system on all
computers in the office space.
On-Access Scans
An on-access scan runs in the background when you open a file or use a
program. For example, an on-access scan can run when you do any of the
following:
Insert a floppy disk
Download a file with FTP
Receive e-mail messages and attachments
View a web page
The scan slows the processing speed of other programs, but it is worth the
inconvenience.
A relatively new form of malicious attack makes its way to your computer
through ActiveX and Java programs (applets). These are miniature programs
that run on a web server or that you download to your local machine. Most
ActiveX and Java applets are safe, but some contain viruses or snoop
programs. The snoop programs allow a hacker to look at everything on your
hard drive from a remote location without your knowing. Be sure that you
properly configure the on-access component of your antivirus software to check
and clean for all these types of attacks.
There is a host of great shareware and freeware available on the Internet today.
Titles include Microsoft Antispyware, Spybot Search & Destroy and Ad-Aware,
as well as Windows Update.
Many programs will not install unless you disable the on-access portion of your
antivirus software. This is dangerous if the program has a virus. Your safest
bet is to do an on-demand scan of the software before installation. Disable onaccess scanning during installation, and then reactivate it when the
installation is complete.
119
Emergency Scans
In an emergency scan, only the operating system and the antivirus program are
running. An emergency scan is called for after a virus has invaded your system
and taken control of a machine. In this situation, insert your antivirus
emergency boot disk and boot the infected computer from it. Then scan and
clean the entire computer.
Another possibility is to use an emergency scan website like
housecall.trendmicro.com. It allows you to scan your computer via a high
speed Internet access without using an emergency disk.
Software Revisions
Patches, fixes, service packs, and updates are all the same thingfree software
revisions. These are intermediary solutions until a new version of the product
is released. They may solve a particular problem, as does a security patch, or
change the way your system works, as does an update. You can apply a socalled hot patch without rebooting your computer; in other cases, applying a
patch requires that the server go down.
Necessity of patches
Because patches are designed to fix problems, it would seem that you would
want to download the most current patches and apply them immediately. That
is not always the best thing to do. Patches can sometimes cause problems with
existing, older software. Different opinions exist regarding the application of the
newest patches. The first opinion is to keep your systems only as up-to-date as
necessary to keep them running. This is the if it isnt broken, dont fix it
approach. After all, the point of a patch is to fix your software. Why fix it if it
isnt broken? The other opinion is to keep the software as up-to-date as
possible because of the additional features that a patch will sometimes provide.
You must choose the approach that is best for your situation.
Where to Get Patches
Patches are available from several locations:
The manufacturers website
The manufacturers CD or DVD
The manufacturers support subscriptions on CD or DVD
The manufacturers bulletin
Youll notice in every case that the source of the patch, regardless of the
medium being used to distribute it, is the manufacturer. You cannot be sure
that patches available through online magazines, other companies, and
shareware websites are safe. Also, patches for the operating system are
sometimes included when you purchase a new computer.
How to Apply Patches
120
Just as you always need to plan for an upgrade, you need to plan for a patch.
Never blindly install patches (or any other new software) without examining the
potential impact on the network. Although patches are designed to fix known
problems, they may create new ones. It is best to try patches on a test network
or system before installing them on all systems on the network.
Summary
************************************************************************************
121
Case Study
The World Trade Center Disaster: Who Was Prepared?
A little after 8am on Tuesday morning, September 11, 2001, four cross-country
passenger jetliners were hijacked with loaded fuel tanks. One was crashed into
a section of the Pentagon, another plunged into the Pennsylvania countryside
when passengers prevented the hijackers from hitting their target. The other
two planes were crashed into New York City's two World Trade Center (WTC)
towers, ultimately causing them to implode and kill 5,000 people.
All WTC offices were destroyed, a total of over 15 million square feet of office
space, an area equal to all Atlanta office space. Some of the nearby buildings,
including the World Financial Center (WFC), the American Express Building,
and 1 Liberty Plaza, were badly damaged and were immediately evacuated.
Some may have to be demolished. With the New York Stock Exchange (NYSE)
located so very close, the WTC area was the center of global finance and many
nearby financial firms were also adversely affected. Also affected were many
other companies, such as Lufthansa Airlines, and New York recruiting firm
Digital Market Research Inc., which lost telephone service and contact with
customers for a number of days because their telecommunications providers
were located in or near the WTC complex.
The financial industry's equipment loss was immense. The Tower Group
technology research company estimated that the securities firms alone will
spend up to $3.2 billion just to replace computer equipment. Much of the WTC
IT and telecommunications equipment was underground and was destroyed by
the collapsed debris. Tower calculates replacements will include 16,000
trading-desk workstations, 34,000 PCs, 8,000 servers, plus large numbers of
information computer terminals, printers, storage devices, and network hubs
and switches. Setting up this equipment will cost an additional $1.5 billion.
The most vital issue for many companies was their loss of staff. Few recovery
plans anticipated such a catastrophe. Organizations that were directly hit did
not even know who in their companies had survived or where they were
because hardly any kept secure, accessible lists of employees or contact
information. The New York Board of Trade (NYBT), which had its trading floor
in the WTC where it dealt in such commodities as coffee, orange juice, cocoa,
sugar and cotton, had to call all employees, one by one. Often survivors
couldn't be reached because area telephone facilities were destroyed while any
working circuits were overloaded. A few companies had considered some staff
problems. The Nasdaq stock exchange, with headquarters at nearby 1 Liberty
Plaza, had required many managers to carry two cell phones in case both the
telephone and one cell phone did not work. It also required every employee
from the chairman on down to carry a crisis line number card.
Copyright Intelligent Quotient System Pvt. Ltd. |
122
Disaster recovery companies did provide some work space for their customers.
Comdisco had seven WTC customers, and it made space available for 3,000
customer employees, enabling those companies to continue operations. Some
recovery companies, including SunGard, made available tractor-trailers
equipped with portable data centers. Not all plans worked. Barclays Bank had
planned for evacuating its 1,200-person investment-banking unit to its disaster
recovery site in New Jersey, but the site proved to be too small for so many
employees. Moreover, the bridges and tunnels crossing the Hudson River were
immediately closed so most employees could not get there. Fortunately Barclay
was able to shift much of its work to its London, Hong Kong, and Tokyo offices,
although the time differences forced those workers to do double shifts.
Data loss is extremely critical, often requiring extensive planning. Many
organizations already relied on disaster recovery companies such as SunGard,
Comdisco and Recall, which offer office space, computers, and
telecommunications equipment when disasters occur. "Cold site" recovery
requires the companies to back up their own data onto tapes, storing them
offsite. If a disaster occurs, the organizations transport their backup tapes to
the recovery sites where they load and boot their applications from scratch
using their backup tapes. Although the cold site approach is relatively
inexpensive, restoring data can be slow, often taking up to 24 hours. If the
tapes are stored at the affected site or relatively close by, all data may be
permanently lost, which could put some companies out of business. Moreover
the data for all activity since the last backup will be lost.
"Hot site" backups can solve some problems, but it could cost some companies
as much as $1 million monthly. A hot site is located offsite where a reserve
computer continually creates a mirror image of the production computer's
data. Should a data disaster occur, the company can quickly switch over to the
backup computer and continue to operate? If the production site itself is
destroyed, the staff will actually go to the hot site to operate.
While many companies lost a lot of data in the attack, a recent Morgan Stanley
technology team report said the WTC was "probably one of the best-prepared
office facilities from a systems and data recovery perspective." Lower
Manhattan's extraordinary data security concern erupted in 1993 when a large
bomb exploded in the subterranean parking area of the WTC in a terrorist
attack. Six people were killed and more than 1,000 were injured. Realizing how
vulnerable they were, many companies took steps to protect themselves.
Pressures for emergency planning further increased as companies faced the
feared Y2K problems. As a result, the data for many organizations were
relatively well protected when the 9/11 WTC attack occurred. Let us look at
how some organizations responded to the attack.
Prior to 1993, to protect itself, the NYBT had contracted with SunGard Data
Systems Inc. for "cold site" disaster recovery. After the 1993 bombing it decided
Copyright Intelligent Quotient System Pvt. Ltd. |
123
to establish its own hot site. It rented a computer and trading floor space in
Queens for $300,000 annually. It hired Comdisco to help it set up the hot
backup site, which it hoped to never have to use despite the expense. After the
attack the NYBT quickly moved its operations to Queens and began trading on
September 17, along with the NYSE, Nasdaq, and the other exchanges that had
not suffered direct hits.
Sometimes backups are too limited. Most disaster recovery companies and
their clients have been too focused on recovery of mainframes and needed
extensive help to recover midrange systems and servers. Moreover, backups are
often stored in the same office or site and so are useless if the location is
destroyed. For example the Board of Trade backed up only some servers and
PCs, and those backups were stored in a fireproof safe in the WTC where they
are now buried beneath many thousands of tons of rubble.
Giant bond trader Cantor Fitzgerald occupied several top floors in one of the
WTC buildings and lost its offices and perhaps 700 of its 1000 American staff.
No company could have adequately planned for the magnitude of its disaster.
However Cantor was almost immediately able to shift its functions to its
Connecticut and London offices and its surviving U.S. traders began settling
trades by telephone. Despite its enormous losses, the company amazingly
resumed operations in just two days, partly with the help of backup
companies, software and computer systems. One reason for its rapid recovery
was Recall, Cantor's disaster recovery company. Recall had up-to-date Cantor
data because it had been picking up Cantor backup tapes three to five times
daily. Moreover, in 1999 Cantor had started switching much of its trading to
eSpeed, its fully automated online system. Investors were attracted partly
because users of eSpeed were given a 10% discount. After the WTC disaster
Peter DaPuzzo, a founder and head of Cantor Fitzgerald, decided that the
company would not replace any of the over 100 lost bond traders. Instead the
company switched its entire bond trading to eSpeed.
America's oldest bank, the Bank of New York (BONY), is a critical hub for
securities processing because it is one of the largest custodians and clearing
institutions in the United States. Half the trading in U.S. government bonds
moves through its settlement system. The bank also handles around 140, 00
fund transfers totaling $900 billion every day. Since the bank facilitates the
transfer of cash between buyers and sellers, any outage of disruption of its
systems would leave some firms short of anticipated cash already promised to
others. BONY was under extraordinary pressure to keep running at full speed.
BONY operations were heavily concentrated in downtown Manhattan, very
close to the World Trade Center. The bank is headquartered at 1 Wall Street,
almost abutting the WTC and had two other sites on Barclay and Church
Streets that were even closer. These buildings housed 5,300 employees plus
the bank's main computer center. On September 11, the bank lost the two
Copyright Intelligent Quotient System Pvt. Ltd. |
124
closest sites and their equipment. The bank had arranged for its computer
processing to revert to centers outside New York in case of emergency, but it
was not able to follow its plan. The World Trade Center attack had heavily
damaged a major Verizon switching station at 140 West Street serving 3 million
data circuits in lower Manhattan. The loss of this switching station left BONY
without any bandwidth for transmitting voice and data communications to
downtown New York, and the bank struggled to find ways to connect with
customers.
The bank's disaster recovery plan called for paper check processing to be
moved from its financial district computer center to its Cherry Hill, New Jersey
facility. With communication so disrupted, BONY management decided Cherry
Hill was too distant and moved the functions to its closer center in Lodi, New
Jersey. However, that center lacked machines for its lockbox business, in
which it opens envelopes that contain bill payments, deposits checks, and
reads payment stubs to credit the right accounts.
The bank had deliberately planned to have different level of backup for different
functions. The bank's government bond processing was backed up by a second
computer that could take over on a moment's notice. No such backup existed
for the bank's 350 automated teller machines. The bank rationalized that its
customers could use other banks' machines in case of a problem and its
customers were forced to do that. Even the backup system for the government
bond business did not work properly because the communication lines between
its backup sites and clients' backup sites were often of low capacity and had
not been fully tested and debugged. For example BONY's required connection
to the Government Securities Clearing Corporation, a central component of the
government bond market, failed so tapes had to be driven to them for several
days. Trades were properly posted but clients could not obtain timely reports
on their positions. The bank had also established redundant
telecommunications facilities in case of problems with one line, but they turned
out to be routed through the same physical phone facilities. John Costas, the
president and COO of UBS Warburg, explained "We've all learned that when we
have backup lines, we should know a lot more about where they run."
As a result the Bank of New York's customers expecting funds from the Bank of
New York didn't receive them on time and had to borrow emergency cash for
the Federal Reserve. Yet Thomas A. Renyi, the Bank of New York's chairman,
expressed pride in how the bank had responded. He said "Our longstanding
disaster recovery plans worked, and they worked in the extreme." It will be
months before BONY can return to its computer center at 101 Barclay Street
and the bank is working with IBM on where to locate an interim computer
center and ways to improve its backup systems.
The Nasdaq stock exchange seems to have had more success. It has no trading
floor anywhere but instead is a vast distributed network with over 7,000
Copyright Intelligent Quotient System Pvt. Ltd. |
125
workstations at about 2,500 sites, all connected to its network through at least
20 points of presence (POPs). The POPs in turn are doubly or triply connected
to its main network and data centers in Connecticut and Maryland. Nasdaq's
headquarters at 1 Liberty Plaza were heavily damaged. Its operational staff and
its press and broadcast functions are housed in its Times Square building. On
September 11 (Tuesday), Nasdaq opened at 8am as usual, but it closed at
9:15AM, and did not open again until the following Monday, when the NYSE
and other exchanges resumed trading. NASDAQ was well prepared for the
disaster with its highly redundant setup. It even had many cameras and
monitoring systems so that the company would know what actually happened
if a disaster or other crisis should strike. Nasdaq had even purposely
established a very close relationship with WorldCom, its telecommunications
provider, and it had made sure WorldCom had access to different networks for
the purpose of redundancy.
At first Nasdaq established a command center at its Times Square office, but
the implosion of the WTC buildings destroyed Nasdaq's telephone switches
connected to that office, and so the essential staff members were quickly moved
to a nearby hotel. Management immediately addressed the personnel situation,
creating an executive locator system in Maryland with everyone's names and
telephone numbers and a list of the still missing. Next they evaluated the
physical situationwhat was destroyed, what ceased to work, where work
could proceedwhile finding offices for the 127 employees who worked near
the WTC. Next they started to evaluate the regulatory and trading industry
situations and the conditions of Nasdaq's trading companies. The security staff
was placed on high alert to search for attempted penetration of the building or
the network.
On Wednesday September 12 Nasdaq management determined that 30 of the
300 firms they called would not be able to open the next day, 10 of which
needed to operate out of backup centers. Management assigned some of its
own staff to work with all 30 of them to help solve their problems. The next day
they learned that the devastated lower Manhattan telecommunications would
not be ready to support a Nasdaq opening the following day. They decided to
postpone Nasdaq's opening until Monday, September 17. On Saturday and
again on Sunday they successfully ran industry-wide testing. On Monday, only
six days after the attack, Nasdaq opened and successfully processed 2.7 billion
shares, by far its largest volume ever.
Nasdaq found its distributed systems worked very well, while its rapid recovery
validated the necessity for two network topologies. Moreover, while Nasdaq lost
no senior staff, the company had three dispersed management sites, and had it
lost one, the company could still operate because of the leadership at its two
remaining sites. Nasdaq also realized its extensive crisis management
rehearsals for both Y2K and the conversion to decimals had proven vital,
verifying the need to schedule more rehearsals regularly. The company even
Copyright Intelligent Quotient System Pvt. Ltd. |
126
************************************************************************************
127
References
http://books.google.co.in/books?isbn=0782150780
http://books.google.co.in/books?isbn=0470550058
http://zkamioni.dnsalias.com/E...Network%20Study%20Guide/
http://www.centos.org/docs/3/html/rhel-isa-en.../s1-disaster-recovery.html
http://googleenterprise.blogspot.com/.../disaster-recovery-by-google.html
http://gigaom.com/2010/03/03/Google-apps/
http://cloudcomputing.info/.../how-Google-implements-disaster-recovery-f...
http://www.centos.org/docs/3/html/rhel-isa-en.../s1-disaster-recovery.html
http://www.fixit-me.com/dms.html
http://www.rbi.org.in Publications
http://www.niiconsulting.com/innovation/RBI%20Guidelines_Summary.pdf
http://en.wikipedia.org/wiki/Telemetry
http://en.wikipedia.org/wiki/File:Surge_protector.jpg
http://books.google.co.in/books?isbn=0782150780
http://e-university.wisdomjobs.com/networking/chapter-52.../index.html
http://www.kingswell.net/news%20items/Contractual%20arrangements%20fo
r%20DR.htm
http://e-university.wisdomjobs.com/networking/chapter-52-273/faulttolerance-and-disaster-recovery.html
http://books.google.co.in/books?
http://thewebshop.com
http://thedatarescuecentre.com
http://www.iomega.com
Copyright Intelligent Quotient System Pvt. Ltd. |
128
http://blog.everycity.co.uk
Network+ Study Guide: Exam N10-003By David Groth, Toby Skandier
http://technet.microsoft.com/en-us/library/cc770266 (v=ws.10).asp
http://technet.microsoft.com/en-us/library/cc770266%28v=ws.10%29.aspx
http://en.wikipedia.org/wiki/Remote_backup_service
http://en.wikipedia.org/wiki/Off-site_data_protection
http://Remoteitservices.com
http://www.fatihacar.com
129