Вы находитесь на странице: 1из 22

11gR2 New feature - Storage Failover cases

Clusterware Test
Category

[Test Code]
Action

Test 1

Detailed Test Execution

Target

Storage 1

Preconditions:

Normal redundancy ASM


Diskgroup with at lease three
Failgroups, which are on separated

ASM Voting Files


destructive Test
Steps:
Sanity Check

1- Make sure Voting files are in ASM


diskgroup;
2- If VF are not on ASM Diskgroup, use crsctl
replace votedisk +DGNAME to migrate VF to
ASM Diskgroup;
3- Remove physical access to one of ASM failgroups
for example by removing the disk itself, cables, or
switches.
4. After long disk IO timeout , run crsctl
query css votedisk to check the voting file
5. Resume the access and check the Voting
file status;
Variants:
None

[HW-STOR-15b]

Preconditions:

Normal redundancy ASM


Diskgroup with at lease three
Failgroups, which are on separate

ASM Voting Files


destructive Test
Sanity Check

Steps:
1- Make sure Voting files are in ASM
diskgroup;

2- If VF are not in ASM Diskgroup, use crsctl


replace votedisk +DGNAME to migrate VF to
ASM Diskgroup;

3- Remove physical access to majority of ASM


failgroups for example by removing the disk itself,
cables, or switches.
5- CSS will mark the disk as stale. Since there is no
quorum, the clusterware will stop.
6- After the nodes reboot, restore access. Restart the
clusterware using crsctl start crs.

In 11.2.0.2, if node doesnt reboot after cssd


terminate, use crsctl stop crs f to stop the
remaining clusterware processes, restore the
physical access to a majority of the voting
disks, then manually use crsctl start crs to
start crs stack.

[HW-STOR-16a]

Preconditions:

Normal redundancy ASM


Diskgroup, its failgroups are on separate
storage path.

ASM OCR destructive


Tests
Steps:
Sanity Check

1- Make sure only OCR file in this ASM


diskgroup;
2- If not, use ocrconfig add +DGNAME to
add ASM OCR and use ocrconfig delete to
remove other OCR files;
3- Remove physical access to one of ASM failgroups
for example by removing the disk itself, cables, or
switches
4. Run ocrcheck as privilege user to check
the ASM OCR status;
5. Resume the access to the physical storage
and check the OCR status again;

[HW-STOR-16b]

Preconditions:

At least two normal redundancy


ASM Diskgroup, their failure groups on
separate storage path

At least two OCR store on different


diskgroups.

ASM OCR Mirror test

Sanity Check

Steps:
1- Back up the OCR file
2- Remove physical access to one diskgroup, which
stores OCR for example by removing the disk itself,
cables, or switches.
3- Use crsctl to create/start/stop/remove
dummy resources (hence there are some write
operations to OCR)
4- CRSD should mark the bad device and do not access
it.
5- Wait for 15 minutes and check all the resources

6. Resume the access to the physical storage


and check the OCR status again;

e - Storage Failover cases


Expected Test Outcome

Clusterware:

- If more than three failgroups are used, the VF should


be allocated to another failgroup. Or else its VF status
should change to OFFLINE;

Clusterware:
- CSS will evict the nodes because there is no quorum. This will
occur immediately the voting disk operations incur IO failures and
the total availability is below quorum.
- On reboot, the system should restart and wait without CSS also
because there is no quorum.
- After quorum is restored, crsctl start crs should restore the
clusterware stacks. Reboot should not be required.
-For 11.2.0.2, if all crs resources and asm&rdbms
processes are cleanup prior to the cssd terminating,
node wont reboot after cssd terminate. Otherwise,
node will still reboot.

- For 11.2.0.2, CVU resource should also failover to a


surviving RAC node.

Actual Test Outcome

Clusterware:
- No impact

Clusterware:
-No impact

-If for 11R2, collect


crsctl stat res t in a 60s loop from beginning till the end of
run. Attach the output for auditing.

Oracle File Corruption or I/O Fencing Cases


Storage Test
Category

[Test Code]
Action

Test 1

Detailed Test Execution

Target

Storage - 1

Preconditions:

Use an odd number of voting disks on separate storage


paths. When certifying NAS, CFS, or Shared volume
managers, creates the voting disk using this media.

Fence or corrupt a
subset of Oracle
Clusterware voting disk
mirrors ==>

Initiate client Workloads

Any RAC host


Steps:
1- For 11gR2, voting file will be automatically backed up in OCR.
Dont need to backup VF for 11R2.
Not 11R2 new
feature, VF not in
ASM diskgroup.
2- Remove physical access to the voting disk for example by removing the disk
itself, cables, or switches. Do NOT use dd to simulate this test.

3- CSS will mark the disk as stale. Wait for the long disk voting interval .
Use crsctl query css votedisk to get the VF status.
4- Restore access. CSS will recognize that the voting disk is now accessible

Variants:
Test 2

Storage - 2

Preconditions:

Use an odd number of voting disks on separate storage


paths. When certifying NAS, CFS, or shared volume managers,
creates the voting disk with this media.

Fence or corrupt a
majority of Oracle
Clusterware voting disk
mirrors ==>

Initiate client Workloads

Any RAC hosts


Steps:

Not 11R2 new


feature, VF not in
ASM diskgroup.

1- For 11gR2, voting file will be automatically backed up in OCR.


Dont need to backup VF for 11R2.

2- Remove physical access to a majority of the voting disks for example by


removing the disk itself, cables, or switches. If voting disks cannot be isolated at the
disk level, Do NOT use dd to simulate this test.

3- .CSS will mark the disk as stale. Since there is no quorum, the clusterware will
stop.
1After the nodes reboot, restore access. Restart the
clusterware using crsctl start crs.

Test 3

Storage - 3

In 11.2.0.2, if node doesnt reboot after cssd terminate, use crsctl


stop crs f to stop the remaining clusterware processes, restore
the physical access to a majority of the voting disks, then manually
use crsctl start crs to start crs stack.
Preconditions:

Use OCR mirrors on separate storage paths. When


certifying NAS, CFS, or Shared volume managers, create the
OCR mirrors using this media.

Make I/O error on a


subset of Oracle
Clusterware Registry
(OCR) file with Oracle
Clusterware mirrors
==>

OCR Master node

Initiate client Workloads

Induce stress conditions I/O stress

Identify the OCR master

Back up the OCR file

Steps:
Not 11R2 new
feature, OCR not
in ASM diskgroup.

Remove physical access to the OCR mirror disk for


example by removing the disk itself, cables, or switches.

Use crsctl to create/start/stop/remove dummy


resources (hence there are some write operations to OCR)

OCRD/CRSD should mark the bad device and do not


access it.

Wait for 15 minutes and check all the resources

Use crsctl to create/start/stop/remove dummy


resources (hence there are some write operations to OCR)

You can now enable the OCR mirror disk.


Variants:
Var 1. Reboot the master CSS and other nodes with the OCR master both offline and
when it is restored.
Test 4

Storage - 4

Preconditions:

Initiate client Workloads

Fence or corrupt a
random set of Oracle
datafiles (e.g. on
different tablespaces)
==>
Any RAC host

Steps:
1- Back up the Oracle datafile filesystem
2- Corrupt a random set of (one or more) Oracle datafile by removing disks or
overwriting disk headers.
3- After RAC database fails, replace the backed up datafile before restarting RAC
instances

Test 5

Storage - 5

Preconditions:

Set up multiplexed Oracle redo logs on separate


storage paths

Initiate client Workloads

Corrupt a multiplexed
redo member

Steps:

Any RAC host

1- Physically corrupts one redo member of a group.


2- Force redo log operations to this group
3- Repair the media and continue redo operations until the redo log member is
brought online.
Variants:
Var 1. Repeat this test for each redo log member number first, second etc in the
current group.

tion or I/O Fencing Cases


Expected Test Outcome

NAS/SAN includes ASM, SLVM, and CFS:

Clusterware:
- Extra voting disk files protect Oracle Clusterware from media failures and human
errors.
- CSS marks the voting disk as STALE while it is unavailable.

- For 11R2, collect

crsctl stat res t in a 60s loop from beginning till the end of run. Attach the
output for auditing.

- After restore access, use crsctl query css votedisk to check


the VF status

NAS/SAN:
- None

Clusterware:
- CSS will evict the nodes because there is no quorum. This will occurs immediately
the voting disk operations incur IO failures and the total availability is below
quorum.
- On reboot, the system should restart and wait without CSS also because there is
no quorum.
- After quorum is restored, crsctl start crs should restore the clusterware stacks.
Reboot should not be required.

-For 11.2.0.2, if all crs resources and asm&rdbms processes are


cleanup prior to the cssd terminating, node wont reboot after cssd
terminate. Otherwise, node will still reboot.

- For 11.2.0.2, CVU resource should also failover to a surviving


RAC node.

Actual Test
Outcome

NAS/SAN includes ASM, SLVM, and CFS:

Clusterware:
- Multiple OCR files protect Oracle Clusterware from cluster suicide. There
should be no interruption to clusterware operations.
- For 11R2, collect
crsctl stat res t in a 60s loop from beginning till the end of run. Attach the
output for auditing.

RAC:
- When using RMAN, use the RMAN restores operations to resume normal
operations without a complete database rebuild. Otherwise restore the data files
from the third party clusterware

RAC:
- The redo members should prevent blocking of redo IO.

Measure the brownout from a client perspective as redo log members switch. Also
measure the time that the system continues without restoring the recovered member
once the disk is repaired.

- No RAC instance crash is reported.

RAC Host-to-Storage NIC/HBA Failures


Storage Test Category

[Test Code]
Action

Test 1

Detailed Test Execution

Target

Storage - 1

Preconditions:

Initiate client Workloads

Induce stress conditions: Multiple


CRS actions in progress

Fail one of two bonded NICs for


storage (NAS) or HBAs(SAN)_
==>
CRS/OCR master

Identify Vendor and CRS masters. See if these can be separated

Steps:
1- Disable one of the two bonded NICs/HBAs
from the CRS master.
2- Wait at least CSS DISK TIMEOUT sec.
Default 200s,
In this test, pls wait at least 600
second.
Use crsctl query css votedisk and
ocrcheck to get the Voting disk and
OCR status;
3- Enable the NIC/HBA.
Variants:
Var 1. Repeat for a subset of the voting disks
below quorum.
Test 2

Storage - 2

Preconditions:

Fail both bonded NICs for storage


or HBAs
CRS/OCR master

Initiate client Workloads

Induce stress conditions:


Multiple CRS actions in progress

Identify Vendor and CRS masters. See if these can be separated

Steps:
1- Disable both bonded storage NICs/HBAs
from the current CRS master.
2- Wait at least CSS DISK TIMEOUT sec
3- Determine the interim time (in sec) database
I/Os freeze, if any.

4- Re-enable both NICs/HBAs.


In 11.2.0.2 if node doesnt reboot
after cssd terminate, use crsctl stop
crs f to stop the remaining
clusterware processes, re-attach both
storage cables, then manually use
crsctl start crs to start crs stack.

Variants:
Var1 - Do Step 4 before voting disk times out
(200s default) (as shown in CSS log) and use
crsctl query css votedisk to get the
voting files status;
- RAC host resumes I/O activity as usual.

Var2 - Do Step 4 after voting disk times out


(as shown in CSS log)
- RAC host suicides and rejoin the cluster upon
reboot.

rage NIC/HBA Failures


Expected Test Outcome

NAS/SAN includes OCR residing on ASM, SLVM, and CFS:

Clusterware

- The CRS/OCR master does not fail, both after failing and restoring
IO connectivity
No node should be evicted or CRS resources should go offline.

RAC:
- The HBA should redirect IO down the redundant path.

- ASM and RAC should remain fully functional.

- For 11R2, collect


crsctl stat res t in a 60s loop from beginning till the end of
run. Attach the output for auditing.

NAS/SAN includes OCR residing on ASM, SLVM, and CFS:

RAC:

Var2 -

- No report of complete cluster failures/reboots.


- Oracle Clusterware resources managed by the CRS master should
either go OFFLINE or fail over to a surviving RAC node. Resources
that fail over include: VIP, SCAN VIP, SCAN Listener and
singleton services

- For 11R2, collect

Actual Test
Outcome

crsctl stat res t in a 60s loop from beginning till the end of
run. Attach the output for auditing.
-For 11.2.0.2, if all crs resources and asm&rdbms
processes are cleanup prior to the cssd terminating,
node wont reboot after cssd terminate. Otherwise,
node will still reboot.

- For 11.2.0.2, CVU resource should also failover to a


surviving RAC node.

RAC Host-to-Storage Cable Failures


Storage Test Category

[Test Code]
Action

Test 1

Detailed Test Execution

Target

Storage - 1

Preconditions:

Single redundant storage paths:

Initiate client Workloads

Induce stress conditions low free swap, IO


saturation

Identify the master nodes for Vendor


Clusterware, CSS, OCR

Remove Primary single storage


cable or single path in SAN or
NAS storage unit ==> Volume
group master if different from
vendor clusterware master nodes

Steps:
1- Physically remove the Primary storage cable or primary switch in the
IO path to the storage unit.2- Determine the interim time (in sec) storage
I/Os freeze, if any. (Note SCSI timeout and SCSI retry settings) Wait at
least CSS DISK TIMEOUT sec. Default 200s. In this test, pls wait
at least 600s.

Use crsctl query css votedisk and ocrcheck to get the


Voting disk and OCR status;
3- Restore storage path to the primary unit

Variants:
Var1 - Remove the Secondary IO path; wait less than disk timeout;
restore Secondary IO path.
Var2 If third-party volume manager is used in the storage, remove the
Primary storage cable from the volume manager master node (in lieu of
vendor clusterware node).
Var3 Repeat this test removing a single storage path to the OCR master
node concurrent with node failure at the vendor/CSS master node. This
test is important for metro and stretch clusters

Test 2

Storage - 2

Preconditions:

ALL redundant storage paths:


Remove both storage cables
anywhere in network or SAN IO
path ==> Repeat for each of
Vendor, CSS, and OCR Master
nodes

Initiate client Workloads

Induce stress conditions low free swap, high


CPU
Identify the master nodes for Vendor Clusterware, CSS, and OCR

Steps:
Sanity Check

1- Physically remove all redundant IO paths to the master node. Repeat


this at the host and at the storage switch.

2- Restore the IO paths after the master is evicted and rebooted.

In 11.2.0.2, if node doesnt reboot after cssd terminate,


use crsctl stop crs f to stop the remaining clusterware
processes, restore the IO paths, then manually use crsctl
start crs to start crs stack.

3- Wait 600s to ensure no evictions due to the join operations.

Variants:

Var1 If third-party volume manager is used in the storage, remove the


redundant IO paths to this node.
Var2 Repeat this test removing all redundant storage paths to the OCR
master node concurrent with node failure at the vendor/CSS master node.
This test is important for metro and stretch clusters

Test 3

Storage 3

Preconditions:

Remove Primary + Secondary


storage cables anywhere in
network path ==>

Initiate client Workloads

Identify a set of T=N-1 RAC hosts (N=number of


clustered database hosts), including the Vendor, CSS, and OCR
master nodes

T multiple RAC hosts


Steps:
Sanity Check

1- Physically remove both Primary and Secondary storage cables; Or


other redundant component in the IO path from the current vendor and
CSS masters.
2- Start monitoring survivors ocssd.log and wait until the vendor
clusterware starts reconfiguration.
3- When reconfiguration starts, repeat Step 1 against the surviving nodes
until there is only one surviving node left. Include parallel evictions.

In 11.2.0,2, if node doesnt reboot after cssd terminate,


use crsctl stop crs f to stop the remaining clusterware
processes, re-attach both storage cables, then manually
use crsctl start crs to start crs stack.

4- Allow the evicted nodes to rejoin the cluster.

5- Restore the IO paths and start the clusterware stacks on all nodes that
are now without stacks running. Do this concurrently

Cable Failures
Expected Test Outcome

NAS/SAN includes ASM, SLVM, and CFS:


Clusterware
- No clusterware heartbeat, clusterware resource, or
network failures occur. Multi-path takes care of routing
I/O to the Secondary IO path, or back to the (restored)
Primary storage network (if applying Variant).

RAC:
- No impact on the stability of RAC hosts. For 11g
hang incidents are resolved.

- Uninterrupted cluster-wide I/O operations.


- For 11R2, collect
crsctl stat res t in a 60s loop from beginning till
the end of run. Attach the output for auditing.

NAS/SAN includes ASM, SLVM, and CFS:


- For CSS and vendor clusterware the master node is
evicted.
When OCR can be isolated e.g. separate SANs or
NAS devices, the backup OCR mirror survives.

- No cluster failures/reboots including when nodes


rejoin.

Actual Test Outcome

RAC:

- No data corruption reported from surviving nodes


- Zero impact on stability of surviving RAC hosts.

- Oracle Clusterware resources managed by the CRS


master should either go OFFLINE or fail over to a
surviving RAC node. Resources that fail over include:
VIP, SCAN VIP, SCAN Listener and singleton
services

- For 11R2, collect


crsctl stat res t in a 60s loop from beginning till
the end of run. Attach the output for auditing.

-For 11.2.0.2, if all crs resources and


asm&rdbms processes are cleanup prior to
the cssd terminating, node wont reboot
after cssd terminate. Otherwise, node will
still reboot.
- For 11.2.0.2, CVU resource should also
failover to a surviving RAC node.
NAS/SAN includes ASM, SLVM, and CFS:
- Each of the failed nodes reboots, followed by cluster
reconfigurations, and then successfully rejoins the
cluster.
- No data corruption or I/O interruption from surviving
nodes.

- Specifically ensure NO IO INTERRUPTION when


the nodes rejoin.
RAC:

- Surviving RAC hosts should remain stable.

- Uninterrupted cluster-wide I/O operations.

- No report of complete cluster failures/reboots


including after nodes join.
- Oracle Clusterware resources managed by the CRS
master should either go OFFLINE or fail over to a
surviving RAC node, Resources that fail over include:
VIP, SCAN VIP, SCAN Listener and singleton
services

- For 11R2, collect


crsctl stat res t in a 60s loop from beginning till
the end of run. Attach the output for auditing.

-For 11.2.0.2, if all crs resources and


asm&rdbms processes are cleanup prior to
the cssd terminating, node wont reboot
after cssd terminate. Otherwise, node will
still reboot
- For 11.2.0.2, CVU resource should also
failover to a surviving RAC node.

Вам также может понравиться