Storage Testing Cases

11gR2 New feature - Storage Failover cases
Clusterware Test
Category
[Test Code]
Action
Test 1
Detailed Test Execution
Target
Storage 1
Preconditions:
Normal redundancy ASM

Diskgroup with at lease three
Failgroups, which are on separated
ASM Voting Files

destructive Test
Steps:
Sanity Check
1- Make sure Voting files are in ASM

diskgroup;
2- If VF are not on ASM Diskgroup, use crsctl
replace votedisk +DGNAME to migrate VF to
ASM Diskgroup;
3- Remove physical access to one of ASM failgroups
for example by removing the disk itself, cables, or
switches.
4. After long disk IO timeout , run crsctl
query css votedisk to check the voting file
5. Resume the access and check the Voting
file status;
Variants:
None
[HW-STOR-15b]
Preconditions:

Diskgroup with at lease three
Failgroups, which are on separate
ASM Voting Files

destructive Test
Sanity Check
Steps:
1- Make sure Voting files are in ASM
diskgroup;
2- If VF are not in ASM Diskgroup, use crsctl

replace votedisk +DGNAME to migrate VF to
ASM Diskgroup;
3- Remove physical access to majority of ASM

failgroups for example by removing the disk itself,
cables, or switches.
5- CSS will mark the disk as stale. Since there is no
quorum, the clusterware will stop.
6- After the nodes reboot, restore access. Restart the
clusterware using crsctl start crs.
In 11.2.0.2, if node doesnt reboot after cssd

terminate, use crsctl stop crs f to stop the
remaining clusterware processes, restore the
physical access to a majority of the voting
disks, then manually use crsctl start crs to
start crs stack.
[HW-STOR-16a]
Preconditions:

Diskgroup, its failgroups are on separate
storage path.
ASM OCR destructive

Tests
Steps:
Sanity Check
1- Make sure only OCR file in this ASM

diskgroup;
2- If not, use ocrconfig add +DGNAME to
add ASM OCR and use ocrconfig delete to
remove other OCR files;
3- Remove physical access to one of ASM failgroups
for example by removing the disk itself, cables, or
switches
4. Run ocrcheck as privilege user to check
the ASM OCR status;
5. Resume the access to the physical storage
and check the OCR status again;
[HW-STOR-16b]
Preconditions:
At least two normal redundancy

ASM Diskgroup, their failure groups on
separate storage path
At least two OCR store on different

diskgroups.
ASM OCR Mirror test
Sanity Check
Steps:
1- Back up the OCR file
2- Remove physical access to one diskgroup, which
stores OCR for example by removing the disk itself,
cables, or switches.
3- Use crsctl to create/start/stop/remove
dummy resources (hence there are some write
operations to OCR)
4- CRSD should mark the bad device and do not access
it.
5- Wait for 15 minutes and check all the resources
6. Resume the access to the physical storage

and check the OCR status again;
e - Storage Failover cases

Expected Test Outcome
Clusterware:
- If more than three failgroups are used, the VF should

be allocated to another failgroup. Or else its VF status
should change to OFFLINE;
Clusterware:
- CSS will evict the nodes because there is no quorum. This will
occur immediately the voting disk operations incur IO failures and
the total availability is below quorum.
- On reboot, the system should restart and wait without CSS also
because there is no quorum.
- After quorum is restored, crsctl start crs should restore the
clusterware stacks. Reboot should not be required.
-For 11.2.0.2, if all crs resources and asm&rdbms
processes are cleanup prior to the cssd terminating,
node wont reboot after cssd terminate. Otherwise,
node will still reboot.
- For 11.2.0.2, CVU resource should also failover to a

surviving RAC node.
Actual Test Outcome
Clusterware:
- No impact
Clusterware:
-No impact
-If for 11R2, collect

crsctl stat res t in a 60s loop from beginning till the end of
run. Attach the output for auditing.
Oracle File Corruption or I/O Fencing Cases

Storage Test
Category
[Test Code]
Action
Test 1
Target
Storage - 1
Preconditions:
Use an odd number of voting disks on separate storage

paths. When certifying NAS, CFS, or Shared volume
managers, creates the voting disk using this media.
Fence or corrupt a
subset of Oracle
Clusterware voting disk
mirrors ==>
Initiate client Workloads
Any RAC host

Steps:
1- For 11gR2, voting file will be automatically backed up in OCR.
Dont need to backup VF for 11R2.
Not 11R2 new
feature, VF not in
ASM diskgroup.
2- Remove physical access to the voting disk for example by removing the disk
itself, cables, or switches. Do NOT use dd to simulate this test.
3- CSS will mark the disk as stale. Wait for the long disk voting interval .
Use crsctl query css votedisk to get the VF status.
4- Restore access. CSS will recognize that the voting disk is now accessible
Variants:
Test 2
Storage - 2
Preconditions:
Use an odd number of voting disks on separate storage

paths. When certifying NAS, CFS, or shared volume managers,
creates the voting disk with this media.
Fence or corrupt a
majority of Oracle
Clusterware voting disk
mirrors ==>
Any RAC hosts

Steps:
Not 11R2 new

feature, VF not in
ASM diskgroup.
1- For 11gR2, voting file will be automatically backed up in OCR.

Dont need to backup VF for 11R2.
2- Remove physical access to a majority of the voting disks for example by

removing the disk itself, cables, or switches. If voting disks cannot be isolated at the
disk level, Do NOT use dd to simulate this test.
3- .CSS will mark the disk as stale. Since there is no quorum, the clusterware will
stop.
1After the nodes reboot, restore access. Restart the
clusterware using crsctl start crs.
Test 3
Storage - 3
In 11.2.0.2, if node doesnt reboot after cssd terminate, use crsctl

stop crs f to stop the remaining clusterware processes, restore
the physical access to a majority of the voting disks, then manually
use crsctl start crs to start crs stack.
Preconditions:
Use OCR mirrors on separate storage paths. When

certifying NAS, CFS, or Shared volume managers, create the
OCR mirrors using this media.
Make I/O error on a

subset of Oracle
Clusterware Registry
(OCR) file with Oracle
Clusterware mirrors
==>
OCR Master node
Induce stress conditions I/O stress
Identify the OCR master
Back up the OCR file
Steps:
Not 11R2 new
feature, OCR not
in ASM diskgroup.
Remove physical access to the OCR mirror disk for

example by removing the disk itself, cables, or switches.
Use crsctl to create/start/stop/remove dummy

resources (hence there are some write operations to OCR)
OCRD/CRSD should mark the bad device and do not

access it.
Wait for 15 minutes and check all the resources
Use crsctl to create/start/stop/remove dummy

resources (hence there are some write operations to OCR)
You can now enable the OCR mirror disk.

Variants:
Var 1. Reboot the master CSS and other nodes with the OCR master both offline and
when it is restored.
Test 4
Storage - 4
Preconditions:
Fence or corrupt a
random set of Oracle
datafiles (e.g. on
different tablespaces)
==>
Any RAC host
Steps:
1- Back up the Oracle datafile filesystem
2- Corrupt a random set of (one or more) Oracle datafile by removing disks or
overwriting disk headers.
3- After RAC database fails, replace the backed up datafile before restarting RAC
instances
Test 5
Storage - 5
Preconditions:
Set up multiplexed Oracle redo logs on separate

storage paths
Corrupt a multiplexed
redo member
Steps:
Any RAC host
1- Physically corrupts one redo member of a group.

2- Force redo log operations to this group
3- Repair the media and continue redo operations until the redo log member is
brought online.
Variants:
Var 1. Repeat this test for each redo log member number first, second etc in the
current group.
tion or I/O Fencing Cases

NAS/SAN includes ASM, SLVM, and CFS:
Clusterware:
- Extra voting disk files protect Oracle Clusterware from media failures and human
errors.
- CSS marks the voting disk as STALE while it is unavailable.
- For 11R2, collect
crsctl stat res t in a 60s loop from beginning till the end of run. Attach the
output for auditing.
- After restore access, use crsctl query css votedisk to check

the VF status
NAS/SAN:
- None
Clusterware:
- CSS will evict the nodes because there is no quorum. This will occurs immediately
the voting disk operations incur IO failures and the total availability is below
quorum.
- On reboot, the system should restart and wait without CSS also because there is
no quorum.
- After quorum is restored, crsctl start crs should restore the clusterware stacks.
Reboot should not be required.
-For 11.2.0.2, if all crs resources and asm&rdbms processes are

cleanup prior to the cssd terminating, node wont reboot after cssd
terminate. Otherwise, node will still reboot.
- For 11.2.0.2, CVU resource should also failover to a surviving

RAC node.
Actual Test
Outcome
Clusterware:
- Multiple OCR files protect Oracle Clusterware from cluster suicide. There
should be no interruption to clusterware operations.
- For 11R2, collect
crsctl stat res t in a 60s loop from beginning till the end of run. Attach the
output for auditing.
RAC:
- When using RMAN, use the RMAN restores operations to resume normal
operations without a complete database rebuild. Otherwise restore the data files
from the third party clusterware
RAC:
- The redo members should prevent blocking of redo IO.
Measure the brownout from a client perspective as redo log members switch. Also
measure the time that the system continues without restoring the recovered member
once the disk is repaired.
- No RAC instance crash is reported.
RAC Host-to-Storage NIC/HBA Failures

Storage Test Category
[Test Code]
Action
Test 1
Target
Storage - 1
Preconditions:
Induce stress conditions: Multiple

CRS actions in progress
Fail one of two bonded NICs for

storage (NAS) or HBAs(SAN)_
==>
CRS/OCR master
Identify Vendor and CRS masters. See if these can be separated
Steps:
1- Disable one of the two bonded NICs/HBAs
from the CRS master.
2- Wait at least CSS DISK TIMEOUT sec.
Default 200s,
In this test, pls wait at least 600
second.
Use crsctl query css votedisk and
ocrcheck to get the Voting disk and
OCR status;
3- Enable the NIC/HBA.
Variants:
Var 1. Repeat for a subset of the voting disks
below quorum.
Test 2
Storage - 2
Preconditions:
Fail both bonded NICs for storage

or HBAs
CRS/OCR master
Induce stress conditions:

Multiple CRS actions in progress
Identify Vendor and CRS masters. See if these can be separated
Steps:
1- Disable both bonded storage NICs/HBAs
from the current CRS master.
2- Wait at least CSS DISK TIMEOUT sec
3- Determine the interim time (in sec) database
I/Os freeze, if any.
4- Re-enable both NICs/HBAs.

In 11.2.0.2 if node doesnt reboot
after cssd terminate, use crsctl stop
crs f to stop the remaining
clusterware processes, re-attach both
storage cables, then manually use
crsctl start crs to start crs stack.
Variants:
Var1 - Do Step 4 before voting disk times out
(200s default) (as shown in CSS log) and use
crsctl query css votedisk to get the
voting files status;
- RAC host resumes I/O activity as usual.
Var2 - Do Step 4 after voting disk times out

(as shown in CSS log)
- RAC host suicides and rejoin the cluster upon
reboot.
rage NIC/HBA Failures

NAS/SAN includes OCR residing on ASM, SLVM, and CFS:
Clusterware
- The CRS/OCR master does not fail, both after failing and restoring
IO connectivity
No node should be evicted or CRS resources should go offline.
RAC:
- The HBA should redirect IO down the redundant path.
- ASM and RAC should remain fully functional.
- For 11R2, collect

NAS/SAN includes OCR residing on ASM, SLVM, and CFS:
RAC:
Var2 -
- No report of complete cluster failures/reboots.

- Oracle Clusterware resources managed by the CRS master should
either go OFFLINE or fail over to a surviving RAC node. Resources
that fail over include: VIP, SCAN VIP, SCAN Listener and
singleton services
- For 11R2, collect
Actual Test
Outcome
-For 11.2.0.2, if all crs resources and asm&rdbms
processes are cleanup prior to the cssd terminating,
node wont reboot after cssd terminate. Otherwise,
node will still reboot.
- For 11.2.0.2, CVU resource should also failover to a

surviving RAC node.
RAC Host-to-Storage Cable Failures

Storage Test Category
[Test Code]
Action
Test 1
Target
Storage - 1
Preconditions:
Single redundant storage paths:
Induce stress conditions low free swap, IO

saturation
Identify the master nodes for Vendor

Clusterware, CSS, OCR
Remove Primary single storage

cable or single path in SAN or
NAS storage unit ==> Volume
group master if different from
vendor clusterware master nodes
Steps:
1- Physically remove the Primary storage cable or primary switch in the
IO path to the storage unit.2- Determine the interim time (in sec) storage
I/Os freeze, if any. (Note SCSI timeout and SCSI retry settings) Wait at
least CSS DISK TIMEOUT sec. Default 200s. In this test, pls wait
at least 600s.
Use crsctl query css votedisk and ocrcheck to get the

Voting disk and OCR status;
3- Restore storage path to the primary unit
Variants:
Var1 - Remove the Secondary IO path; wait less than disk timeout;
restore Secondary IO path.
Var2 If third-party volume manager is used in the storage, remove the
Primary storage cable from the volume manager master node (in lieu of
vendor clusterware node).
Var3 Repeat this test removing a single storage path to the OCR master
node concurrent with node failure at the vendor/CSS master node. This
test is important for metro and stretch clusters
Test 2
Storage - 2
Preconditions:
ALL redundant storage paths:

Remove both storage cables
anywhere in network or SAN IO
path ==> Repeat for each of
Vendor, CSS, and OCR Master
nodes
Induce stress conditions low free swap, high

CPU
Identify the master nodes for Vendor Clusterware, CSS, and OCR
Steps:
Sanity Check
1- Physically remove all redundant IO paths to the master node. Repeat

this at the host and at the storage switch.
2- Restore the IO paths after the master is evicted and rebooted.
In 11.2.0.2, if node doesnt reboot after cssd terminate,

use crsctl stop crs f to stop the remaining clusterware
processes, restore the IO paths, then manually use crsctl
start crs to start crs stack.
3- Wait 600s to ensure no evictions due to the join operations.
Variants:
Var1 If third-party volume manager is used in the storage, remove the

redundant IO paths to this node.
Var2 Repeat this test removing all redundant storage paths to the OCR
master node concurrent with node failure at the vendor/CSS master node.
This test is important for metro and stretch clusters
Test 3
Storage 3
Preconditions:
Remove Primary + Secondary

storage cables anywhere in
network path ==>
Identify a set of T=N-1 RAC hosts (N=number of

clustered database hosts), including the Vendor, CSS, and OCR
master nodes
T multiple RAC hosts

Steps:
Sanity Check
1- Physically remove both Primary and Secondary storage cables; Or

other redundant component in the IO path from the current vendor and
CSS masters.
2- Start monitoring survivors ocssd.log and wait until the vendor
clusterware starts reconfiguration.
3- When reconfiguration starts, repeat Step 1 against the surviving nodes
until there is only one surviving node left. Include parallel evictions.
In 11.2.0,2, if node doesnt reboot after cssd terminate,

use crsctl stop crs f to stop the remaining clusterware
processes, re-attach both storage cables, then manually
use crsctl start crs to start crs stack.
4- Allow the evicted nodes to rejoin the cluster.
5- Restore the IO paths and start the clusterware stacks on all nodes that
are now without stacks running. Do this concurrently
Cable Failures

Clusterware
- No clusterware heartbeat, clusterware resource, or
network failures occur. Multi-path takes care of routing
I/O to the Secondary IO path, or back to the (restored)
Primary storage network (if applying Variant).
RAC:
- No impact on the stability of RAC hosts. For 11g
hang incidents are resolved.
- Uninterrupted cluster-wide I/O operations.

- For 11R2, collect
crsctl stat res t in a 60s loop from beginning till
the end of run. Attach the output for auditing.

- For CSS and vendor clusterware the master node is
evicted.
When OCR can be isolated e.g. separate SANs or
NAS devices, the backup OCR mirror survives.
- No cluster failures/reboots including when nodes

rejoin.
Actual Test Outcome
RAC:
- No data corruption reported from surviving nodes

- Zero impact on stability of surviving RAC hosts.
- Oracle Clusterware resources managed by the CRS

master should either go OFFLINE or fail over to a
surviving RAC node. Resources that fail over include:
VIP, SCAN VIP, SCAN Listener and singleton
services
- For 11R2, collect

-For 11.2.0.2, if all crs resources and

asm&rdbms processes are cleanup prior to
the cssd terminating, node wont reboot
after cssd terminate. Otherwise, node will
still reboot.
- For 11.2.0.2, CVU resource should also
failover to a surviving RAC node.
- Each of the failed nodes reboots, followed by cluster
reconfigurations, and then successfully rejoins the
cluster.
- No data corruption or I/O interruption from surviving
nodes.
- Specifically ensure NO IO INTERRUPTION when

the nodes rejoin.
RAC:
- Surviving RAC hosts should remain stable.
- Uninterrupted cluster-wide I/O operations.
- No report of complete cluster failures/reboots

including after nodes join.
- Oracle Clusterware resources managed by the CRS
master should either go OFFLINE or fail over to a
surviving RAC node, Resources that fail over include:
VIP, SCAN VIP, SCAN Listener and singleton
services
- For 11R2, collect

-For 11.2.0.2, if all crs resources and

asm&rdbms processes are cleanup prior to
the cssd terminating, node wont reboot
after cssd terminate. Otherwise, node will
still reboot
- For 11.2.0.2, CVU resource should also
failover to a surviving RAC node.

Storage Testing Cases

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Storage Testing Cases

Загружено:

Авторское право:

Доступные форматы

11gR2 New feature - Storage Failover cases

Detailed Test Execution

Normal redundancy ASM

ASM Voting Files

1- Make sure Voting files are in ASM

Normal redundancy ASM

ASM Voting Files

2- If VF are not in ASM Diskgroup, use crsctl

3- Remove physical access to majority of ASM

In 11.2.0.2, if node doesnt reboot after cssd

Normal redundancy ASM

ASM OCR destructive

1- Make sure only OCR file in this ASM

At least two normal redundancy

At least two OCR store on different

ASM OCR Mirror test

6. Resume the access to the physical storage

e - Storage Failover cases

- If more than three failgroups are used, the VF should

- For 11.2.0.2, CVU resource should also failover to a

Actual Test Outcome

-If for 11R2, collect

Oracle File Corruption or I/O Fencing Cases

Detailed Test Execution

Use an odd number of voting disks on separate storage

Initiate client Workloads

Any RAC host

Use an odd number of voting disks on separate storage

Initiate client Workloads

Any RAC hosts

Not 11R2 new

1- For 11gR2, voting file will be automatically backed up in OCR.

2- Remove physical access to a majority of the voting disks for example by

In 11.2.0.2, if node doesnt reboot after cssd terminate, use crsctl

Use OCR mirrors on separate storage paths. When

Make I/O error on a

OCR Master node

Initiate client Workloads

Induce stress conditions I/O stress

Identify the OCR master

Back up the OCR file

Remove physical access to the OCR mirror disk for

Use crsctl to create/start/stop/remove dummy

OCRD/CRSD should mark the bad device and do not

Wait for 15 minutes and check all the resources

Use crsctl to create/start/stop/remove dummy

You can now enable the OCR mirror disk.

Initiate client Workloads

Set up multiplexed Oracle redo logs on separate

Initiate client Workloads

Any RAC host

1- Physically corrupts one redo member of a group.

tion or I/O Fencing Cases

NAS/SAN includes ASM, SLVM, and CFS:

- For 11R2, collect

- After restore access, use crsctl query css votedisk to check

-For 11.2.0.2, if all crs resources and asm&rdbms processes are

- For 11.2.0.2, CVU resource should also failover to a surviving

NAS/SAN includes ASM, SLVM, and CFS:

- No RAC instance crash is reported.

RAC Host-to-Storage NIC/HBA Failures

Detailed Test Execution

Initiate client Workloads