Академический Документы
Профессиональный Документы
Культура Документы
Clusterware Test
Category
[Test Code]
Action
Test 1
Target
Storage 1
Preconditions:
[HW-STOR-15b]
Preconditions:
Steps:
1- Make sure Voting files are in ASM
diskgroup;
[HW-STOR-16a]
Preconditions:
[HW-STOR-16b]
Preconditions:
Sanity Check
Steps:
1- Back up the OCR file
2- Remove physical access to one diskgroup, which
stores OCR for example by removing the disk itself,
cables, or switches.
3- Use crsctl to create/start/stop/remove
dummy resources (hence there are some write
operations to OCR)
4- CRSD should mark the bad device and do not access
it.
5- Wait for 15 minutes and check all the resources
Clusterware:
Clusterware:
- CSS will evict the nodes because there is no quorum. This will
occur immediately the voting disk operations incur IO failures and
the total availability is below quorum.
- On reboot, the system should restart and wait without CSS also
because there is no quorum.
- After quorum is restored, crsctl start crs should restore the
clusterware stacks. Reboot should not be required.
-For 11.2.0.2, if all crs resources and asm&rdbms
processes are cleanup prior to the cssd terminating,
node wont reboot after cssd terminate. Otherwise,
node will still reboot.
Clusterware:
- No impact
Clusterware:
-No impact
[Test Code]
Action
Test 1
Target
Storage - 1
Preconditions:
Fence or corrupt a
subset of Oracle
Clusterware voting disk
mirrors ==>
3- CSS will mark the disk as stale. Wait for the long disk voting interval .
Use crsctl query css votedisk to get the VF status.
4- Restore access. CSS will recognize that the voting disk is now accessible
Variants:
Test 2
Storage - 2
Preconditions:
Fence or corrupt a
majority of Oracle
Clusterware voting disk
mirrors ==>
3- .CSS will mark the disk as stale. Since there is no quorum, the clusterware will
stop.
1After the nodes reboot, restore access. Restart the
clusterware using crsctl start crs.
Test 3
Storage - 3
Steps:
Not 11R2 new
feature, OCR not
in ASM diskgroup.
Storage - 4
Preconditions:
Fence or corrupt a
random set of Oracle
datafiles (e.g. on
different tablespaces)
==>
Any RAC host
Steps:
1- Back up the Oracle datafile filesystem
2- Corrupt a random set of (one or more) Oracle datafile by removing disks or
overwriting disk headers.
3- After RAC database fails, replace the backed up datafile before restarting RAC
instances
Test 5
Storage - 5
Preconditions:
Corrupt a multiplexed
redo member
Steps:
Clusterware:
- Extra voting disk files protect Oracle Clusterware from media failures and human
errors.
- CSS marks the voting disk as STALE while it is unavailable.
crsctl stat res t in a 60s loop from beginning till the end of run. Attach the
output for auditing.
NAS/SAN:
- None
Clusterware:
- CSS will evict the nodes because there is no quorum. This will occurs immediately
the voting disk operations incur IO failures and the total availability is below
quorum.
- On reboot, the system should restart and wait without CSS also because there is
no quorum.
- After quorum is restored, crsctl start crs should restore the clusterware stacks.
Reboot should not be required.
Actual Test
Outcome
Clusterware:
- Multiple OCR files protect Oracle Clusterware from cluster suicide. There
should be no interruption to clusterware operations.
- For 11R2, collect
crsctl stat res t in a 60s loop from beginning till the end of run. Attach the
output for auditing.
RAC:
- When using RMAN, use the RMAN restores operations to resume normal
operations without a complete database rebuild. Otherwise restore the data files
from the third party clusterware
RAC:
- The redo members should prevent blocking of redo IO.
Measure the brownout from a client perspective as redo log members switch. Also
measure the time that the system continues without restoring the recovered member
once the disk is repaired.
[Test Code]
Action
Test 1
Target
Storage - 1
Preconditions:
Steps:
1- Disable one of the two bonded NICs/HBAs
from the CRS master.
2- Wait at least CSS DISK TIMEOUT sec.
Default 200s,
In this test, pls wait at least 600
second.
Use crsctl query css votedisk and
ocrcheck to get the Voting disk and
OCR status;
3- Enable the NIC/HBA.
Variants:
Var 1. Repeat for a subset of the voting disks
below quorum.
Test 2
Storage - 2
Preconditions:
Steps:
1- Disable both bonded storage NICs/HBAs
from the current CRS master.
2- Wait at least CSS DISK TIMEOUT sec
3- Determine the interim time (in sec) database
I/Os freeze, if any.
Variants:
Var1 - Do Step 4 before voting disk times out
(200s default) (as shown in CSS log) and use
crsctl query css votedisk to get the
voting files status;
- RAC host resumes I/O activity as usual.
Clusterware
- The CRS/OCR master does not fail, both after failing and restoring
IO connectivity
No node should be evicted or CRS resources should go offline.
RAC:
- The HBA should redirect IO down the redundant path.
RAC:
Var2 -
Actual Test
Outcome
crsctl stat res t in a 60s loop from beginning till the end of
run. Attach the output for auditing.
-For 11.2.0.2, if all crs resources and asm&rdbms
processes are cleanup prior to the cssd terminating,
node wont reboot after cssd terminate. Otherwise,
node will still reboot.
[Test Code]
Action
Test 1
Target
Storage - 1
Preconditions:
Steps:
1- Physically remove the Primary storage cable or primary switch in the
IO path to the storage unit.2- Determine the interim time (in sec) storage
I/Os freeze, if any. (Note SCSI timeout and SCSI retry settings) Wait at
least CSS DISK TIMEOUT sec. Default 200s. In this test, pls wait
at least 600s.
Variants:
Var1 - Remove the Secondary IO path; wait less than disk timeout;
restore Secondary IO path.
Var2 If third-party volume manager is used in the storage, remove the
Primary storage cable from the volume manager master node (in lieu of
vendor clusterware node).
Var3 Repeat this test removing a single storage path to the OCR master
node concurrent with node failure at the vendor/CSS master node. This
test is important for metro and stretch clusters
Test 2
Storage - 2
Preconditions:
Steps:
Sanity Check
Variants:
Test 3
Storage 3
Preconditions:
5- Restore the IO paths and start the clusterware stacks on all nodes that
are now without stacks running. Do this concurrently
Cable Failures
Expected Test Outcome
RAC:
- No impact on the stability of RAC hosts. For 11g
hang incidents are resolved.
RAC: