Вы находитесь на странице: 1из 40

Veritas Volume Manager 3.X/4.

X troubleshooting

This TOI addresses Veritas Volume Manager 3.X/4.X troubleshooting

TOI Objectives

• Objective 1 Booting/Veritas Not Starting Issues


• Objective 2 Troubleshooting Exercise #1
• Objective 3 Troubleshooting Exercise #2
• Objective 4 Troubleshooting Exercise #3
• Objective 5 Troubleshooting Exercise #4
• Objective 6 Troubleshooting Exercise #5
• Objective 7 Troubleshooting Exercise #6
• Objective 8 Troubleshooting Exercise #7
• Objective 9 Split Brain
Objective 1 Booting/Veritas Not Starting Issues VxVM = Veritas
Volume Manager

#1 OS/Boot disk and mirror are under Veritas Volume Manager 3.x (4.0 is different) control and
system is not booting

1. Gather Information
- history, what let up to this
- any hardware replaced
- error messages
- what versionof VxVM
- tried booting off mirror or clone disks yet

What issues can we identify just from asking the right questions?

2. Isolate issue (OS, Hardware or VxVM) by bypassing VxVM (Basic Unencapsulation)

- at ok prompt
ok> printenv use-nvramrc?
use-nvramrc? = false
ok> setenv use-nvramrc? true # if false
ok> setenv auto-boot?=false # keep system from booting to Solaris

- reset the system


ok> reset-all

- boot cdrom -s

- mount the root file system to /a

From which disk?


How do I know which is primary boot disk and/or the mirror?

- make backup copies of following files


cp /a/etc/system /a/etc/system.vx-rootdev
cp /a/etc/vfstab /a/etc/vfstab.vx
cp /a/etc/vx/volboot /a/etc/vx/volboot.org

- edit system file


cd /a/etc/
grep -v rootdev system > system.no-rootdev
cp system.no-rootdev system

- edit vfstab file converting it from volume to partition based

How can we verify vfstab.prevm is accurate? What c#t#d# should be used?

- Create a file called /a/etc/vx/reconfig.d/state.d/install-db.

What other file/directory can keep VxVM from starting?

- cd; umount /a; fsck the OS file systems on this disk


- init 0; then boot -s first
(make sure you boot off disk you editted)

If you are unsure what disk is the primary boot disk. You can use
'luxadm -y set_boot_dev /dev/dsk/c?t?d?s?' to set the boot dev to the OBP.

- If the system boots ok, then ctrl-d to continue booting to run level 3.
(this helps verify that the underlying hardware and OS are in good state or not)

3. If the system successfully boots then we need to see if we can manually start VxVM.

- vxiod set 10; vxconfigd -d; vxdctl init; vxdctl enable

- cd /etc/vx; diff volboot volboot.org


(to check for differences that may be a factor in the boot issue)

What are some of the things in the volboot file that can cause bootVxVM starting issues?

- vxdctl mode to verify if VxVM started

4. If we can manually start VxVM. Then check to see if there are OS mirrors, and if the mirror disk is
available.

IMPORTANT: If the boot disk contains mirrored volumes, one must take all the mirrors offline for
those volumes, except for the one on the boot disk. Offlining a mirror prevents VxVM from ever
performing a recovery on that plex. This step is critical in preventing data corruption.

What's the difference between offlining, disassociating, and detaching mirrors?

- vxdisk -o alldgs -e list # to check disk status

What issues can be seen in vxdisk list output that can cause boot/starting issues?

- vxprint -htrLg rootdg # rootdg disk group should be imported and all volumes should be in
a DISABLED state.

Example, if the boot disk is c0t0d0 with a vxprint output as follows:


# vxprint -htg rootdg
...
v rootvol root DISABLED ACTIVE 1026000 PREFER rootvol-01
pl rootvol-01 rootvol DISABLED ACTIVE 1026000 CONCAT - RW
sd rootdisk-B0 rootvol-01 rootdisk 8378639 1 0 c0t0d0 ENA
sd rootdisk-02 rootvol-01 rootdisk 0 1025999 1 c0t0d0 ENA
pl rootvol-02 rootvol DISABLED ACTIVE 1027026 CONCAT - RW
sd rootmir-06 rootvol-02 rootmir 0 1027026 0 c0t1d0 ENA
...

- In this case the rootvol-02 plex should be offlined as it resides on c0t1d0:


vxmend -g rootdg off rootvol-02
- Start all volumes:
vxrecover -ns
- Start any recovery operations on volumes if needed:
vxrecover -bs
5. At this point the OS is booted, and VxVM has been started successfully, and mirrors offlined.

- next step is to re-enable VxVM and reboot to test if VxVM will start automatically

What can cause VxVM to start manually but not automatically on boot up?

6. Once any debugging actions and/or any other operations are completed, VxVM can be re-enabled again
with the following steps.

- cp /etc/system.vx-rootdev /etc/system
- cp / etc/vfstab.vx /etc/vfstab
- rm /etc/vx/reconfig.d/state.d/install-db
- cd /; ls -l | grep -i vx # if see a /VXVM_.... type directory rename it to old.VXVM
- sync; sync; init 0
- at ok prompt boot off of same disk

7a. If system boots up:

- check for status of volumes and plexes and address any issues
vxinfo -g rootdg -p | egrep -iv "started|active"

- verify all file systems mounted and system is operating correctly

- online mirrors and start recovery operations on the mirrors that were just onlined.

- Example:
vxmend -g rootdg on rootvol-02
vxrecover -bs

7b. If system does NOT boot up use the following document to enable debug logging of boot up.

Document ID:17461Title:Veritas Volume Manager: How to log error messages

also send customer following link or document so they can generate a vxexplorer and send to us and Veritas

Document ID: 243150


http://support.veritas.com/docs/243150 E-Mail this document to a colleague

vxexplorer: How to download, execute, and send it to VERITAS Technical Support


#2 System boots to run level 3 but Veritas Volume Manager does not start, causing
lost of access to volumes containing file systems/data
1. Delete/rename following files if present and either reboot or manaully start VxVM (see procedure
above)

- /etc/vx/reconfig.d/state.d/install-db

- cd /; ls -l | grep -i vx # if see a /VXVM_.... type directory rename it to old.VXVM

2. If these files are not there check for:

- disks with private region

If there are a lot of disks, you can try using this script:

#! /usr/bin/ksh
file="/tmp/diskvtoc.out"
y=0
rm $file
for x in `/bin/ls /dev/rdsk/c*s2`
do
echo "" >> $file
echo "" >> $file
let y="$y"+1
echo "number:" $y >> $file; echo $x >> $file
echo "#########################" >> $file
/bin/ls -l $x >> $file 2>&1
echo "" >> $file
/usr/sbin/prtvtoc $x >> $file 2>&1
echo "*************************" >> $file
echo "" >> $file
echo "" >> $file
done

- Can the private regions be read? Are there disks in rootdg diskgroup? Does the host id
match the /etc/volboot file? Are the group ids of all the rootdg disks the same?

You can try this script as well.

#! /usr/bin/ksh
file="/tmp/vx.out"
y=0
rm $file
for x in `/bin/ls /dev/rdsk/c*s2`
do
echo "" >> $file
echo "" >> $file
let y="$y"+1
echo "number:" $y >> $file; echo $x >> $file
echo "#########################" >> $file
/bin/ls -l $x >> $file 2>&1
echo "" >> $file
/usr/lib/vxvm/diag.d/vxprivutil list $x >> $file 2>&1
echo "*************************" >> $file
echo "" >> $file
echo "" >> $file
done
3. use vxdctl init <hostid> to change volboot file DO NOT EDIT MANUALLY

Note: You can edit the volboot file if it becomes necessary. Just keep in mind that you can
cause vxconfigd to not recognize it, if the format is changed, and/or the file size is not 512 bytes.

4. otherwise:

Document ID:17461Title:Veritas Volume Manager: How to log error messages

also send customer following link or document so they can generate a vxexplorer and send to us and Veritas

Document ID: 243150


http://support.veritas.com/docs/243150 E-Mail this document to a colleague

vxexplorer: How to download, execute, and send it to VERITAS Technical Support


Objective 2 Troubleshooting Exercise #1 Part A

c1t0d0 (rootdisk) was proactively removed from Veritas and was replaced, but vxdiskadm option 5
completes with no errors, (or at least none that are noticed) without bringing the disk out of the removed
state?

Why?

What would you do to fix this?

What would you do to keep this from happening again?

# vxdisk list
DEVICE TYPE DISK GROUP STATUS
c1t0d0s2 sliced - - error
c1t0d0s2 sliced - - error
c1t0d0s2 sliced - - offline
c1t1d0s2 sliced disk01 rootdg online
c3t0d0s2 sliced - - offline
c3t0d89s2 sliced datadg01 datadg online
c3t0d90s2 sliced datadg02 datadg online
c3t0d146s2 sliced - - online
c3t0d217s2 sliced datadg03 datadg online
c3t0d218s2 sliced datadg04 datadg online
c4t0d0s2 sliced - - offline
c4t0d89s2 sliced - - offline
c4t0d90s2 sliced - - offline
c4t0d146s2 sliced - - offline
c4t0d217s2 sliced - - online
c4t0d218s2 sliced - - online
- - rootdisk rootdg removed was:c1t0d0s2
Objective 2 Troubleshooting Exercise #1 Part B

Same customer reboots but system fails to come up and gets following message.

"The file just loaded does not appear to be executable"

Why?

What would you do to get around this?

What would you do to keep this from happening again?

NOTE: see following page for needed outputs

Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c1t0d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w2100000c50ac76dc,0
1. c1t1d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w2100002037e3d51b,0
2. c3t0d0 <EMC-SYMMETRIX-5267 cyl 14 alt 2 hd 15 sec 64>
/pci@8,700000/fibre-channel@1/sd@0,0
...

Specify disk (enter its number):

# eeprom

auto-boot?=true
boot-command=boot
boot-file: data not available.
boot-device=/pci@8,600000/SUNW,qlc@4/fp@0,0/disk@w2100002037e3e93e,0:a disk
nvramrc=devalias net /pci@8,700000/network@3
devalias vx-disk01 /pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w2100002037e3d51b,0:a
devalias vx-rootdisk /pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w2100002037e3e93e,0:a
error-reset-recovery=boot
Objective 2 Troubleshooting Exercise #1 Part C

Same customer system is now booted, vxdiskadm option 5 completes (no errors) without
bringing the disk out of the removed state.

# vxdisk list
DEVICE TYPE DISK GROUP STATUS
c1t0d0s2 sliced - - online
c1t1d0s2 sliced disk01 rootdg online
c3t0d0s2 sliced - - error
c3t0d89s2 sliced datadg01 datadg online
c3t0d90s2 sliced datadg02 datadg online
c3t0d146s2 sliced - - online
c3t0d217s2 sliced datadg03 datadg online
c3t0d218s2 sliced datadg04 datadg online
c4t0d0s2 sliced - - error
c4t0d89s2 sliced - - error
c4t0d90s2 sliced - - error
c4t0d146s2 sliced - - error
c4t0d217s2 sliced - - online
c4t0d218s2 sliced - - online
- - rootdisk rootdg removed was:c1t0d0s2

When we try to manually bring back c1t0d0 (rootdisk) we get following error.

# vxdg -g rootdg -k adddisk rootdisk=c1t0d0s2


vxvm:vxdg: ERROR: associating disk-media rootdisk with c1t0d0s2:
Disk public region is too small

Why?

How would you fix this?

NOTE: see following pages for needed outputs


# vxprint -htg rootdg
DG NAME NCONFIG NLOG MINORS GROUP-ID
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
V NAME RVG KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE

dg rootdg default default 0 992973065.1025.ralph

dm disk01 c1t1d0s2 sliced 2888 71124291 -


dm rootdisk - - - - REMOVED

sd c1t0d0s2Priv - rootdisk 4194828 2888 PRIVATE - RMOV

v crash - ENABLED ACTIVE 4194828 ROUND - fsgen


pl crash-01 crash DISABLED REMOVED 4194828 CONCAT - WO
sd c1t0d0s2-04 crash-01 rootdisk 22875101 4194828 0 - RMOV
pl crash-02 crash ENABLED ACTIVE 4194828 CONCAT - RW
sd disk01-01 crash-02 disk01 0 4194828 0 c1t1d0 ENA

v home - ENABLED ACTIVE 44057250 ROUND - fsgen


pl home-01 home DISABLED REMOVED 44057250 CONCAT - WO
sd c1t0d0s2-03 home-01 rootdisk 27069929 44057250 0 - RLOC
pl home-02 home ENABLED ACTIVE 44057250 CONCAT - RW
sd disk01-02 home-02 disk01 4194828 44057250 0 c1t1d0 ENA

v rootvol - ENABLED ACTIVE 4194828 ROUND - root


pl rootvol-01 rootvol DISABLED REMOVED 4194828 CONCAT - RW
sd c1t0d0s2-B0 rootvol-01 rootdisk 4194827 1 0 - RMOV
sd c1t0d0s2-02 rootvol-01 rootdisk 0 4194827 1 - RLOC
pl rootvol-02 rootvol ENABLED ACTIVE 4194828 CONCAT - RW
sd disk01-03 rootvol-02 disk01 48252078 4194828 0 c1t1d0 ENA

v swapvol - ENABLED ACTIVE 16579971 ROUND - swap


pl swapvol-01 swapvol DISABLED REMOVED 16579971 CONCAT - WO
sd c1t0d0s2-01 swapvol-01 rootdisk 4197716 16579971 0 - RMOV
pl swapvol-02 swapvol ENABLED ACTIVE 16579971 CONCAT - RW
sd disk01-04 swapvol-02 disk01 52446906 16579971 0 c1t1d0 ENA

v var - ENABLED ACTIVE 2097414 ROUND - fsgen


pl var-01 var DISABLED REMOVED 2097414 CONCAT - WO
sd c1t0d0s2-05 var-01 rootdisk 20777687 2097414 0 - RLOC
pl var-02 var ENABLED ACTIVE 2097414 CONCAT - RW
sd disk01-05 var-02 disk01 69026877 2097414 0 c1t1d0 ENA
#Root disk partition
* /dev/rdsk/c1t0d0s2 partition map
*
* Dimensions:
* 512 bytes/sector
* 107 sectors/track
* 27 tracks/cylinder
* 2889 sectors/cylinder
* 24622 cylinders
* 24620 accessible cylinders
*
* Flags:
* 1: unmountable
* 10: read-only
*
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
2 5 01 0 71127180 71127179
3 15 01 0 2889 2888
4 14 01 2889 71124291 71127179
<root@ralph:/tmp>
#

#Mirror disk partition

* /dev/rdsk/c1t1d0s2 partition map


*
* Dimensions:
* 512 bytes/sector
* 107 sectors/track
* 27 tracks/cylinder
* 2889 sectors/cylinder
* 24622 cylinders
* 24620 accessible cylinders
*
* Flags:
* 1: unmountable
* 10: read-only
*
* Unallocated space:
* First Sector Last
* Sector Count Sector
* 71127180 4272095083 48254966
*
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
0 2 00 48254967 4194828 52449794
1 3 01 52449795 16579971 69029765
2 5 01 0 71127180 71127179
3 15 01 0 2889 2888
4 14 01 2889 71124291 71127179
6 7 00 69029766 2097414 71127179
# vxdisk list c1t0d0s2
Device: c1t0d0s2
devicetag: c1t0d0
type: sliced
hostid:
disk: name= id=1108863756.2353.ralph
group: name= id=
info: privoffset=1
flags: online ready private autoconfig autoimport
pubpaths: block=/dev/vx/dmp/c1t0d0s4 char=/dev/vx/rdmp/c1t0d0s4
privpaths: block=/dev/vx/dmp/c1t0d0s3 char=/dev/vx/rdmp/c1t0d0s3
version: 2.1
iosize: min=512 (bytes) max=2048 (blocks)
public: slice=4 offset=0 len=71124291
private: slice=3 offset=1 len=2888
update: time=1108863756 seqno=0.1
headers: 0 248
configs: count=1 len=2112
logs: count=1 len=320
Defined regions:
config priv 000017-000247[000231]: copy=01 offset=000000 disabled
config priv 000249-002129[001881]: copy=01 offset=000231 disabled
log priv 002130-002449[000320]: copy=01 offset=000000 disabled
<root@ralph:/>
#
#
#
# vxdisk list c1t1d0s2
Device: c1t1d0s2
devicetag: c1t1d0
type: sliced
hostid: ralph
disk: name=disk01 id=992973519.1091.ralph
group: name=rootdg id=992973065.1025.ralph
flags: online ready private autoconfig autoimport imported
pubpaths: block=/dev/vx/dmp/c1t1d0s4 char=/dev/vx/rdmp/c1t1d0s4
privpaths: block=/dev/vx/dmp/c1t1d0s3 char=/dev/vx/rdmp/c1t1d0s3
version: 2.1
iosize: min=512 (bytes) max=2048 (blocks)
public: slice=4 offset=0 len=71124291
private: slice=3 offset=1 len=2888
update: time=1108862981 seqno=0.153
headers: 0 248
configs: count=1 len=2112
logs: count=1 len=320
Defined regions:
config priv 000017-000247[000231]: copy=01 offset=000000 enabled
config priv 000249-002129[001881]: copy=01 offset=000231 enabled
log priv 002130-002449[000320]: copy=01 offset=000000 enabled
Objective 2 Troubleshooting Exercise #1 Answer Sheet
Part A

c1t0d0 (rootdisk) was proactively removed from Veritas and was replaced, but vxdiskadm option 5
completes (no errors) without bring disk out of removed state?

Why? duplicate entries in vxdisk list

What would you do to fix this? reboot or follow infodoc 70929 (luxadm offline)

What would you do to keep this from happening again? Installed Patch 113201-05

Part B

Same customer reboots but system fails to come up.

Why? boot-device is set to wwn of replaced disk

What would you do to get around this?boot off mirror

What would you do to keep this from happening again? use luxadm command to set boot-device to new
disk wwn

# luxadm -v set_boot_dev /devices//pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w2100002037e3d51b,0:a

Current boot-device = /pci@8,600000/SUNW,qlc@4/fp@0,0/disk@w2100002037e3e93e,0:a disk


New boot-device = /pci@8,600000/SUNW,qlc@4/fp@0,0/disk@w2100002037e3d51b,0:a
Do you want to change boot-device to the new setting? (y/n) y
<root@ralph:/tmp>

# eeprom boot-device
boot-device=/pci@8,600000/SUNW,qlc@4/fp@0,0/disk@w2100002037e3d51b,0:a
<root@ralph:/tmp>
#

Part C

Same customer system is now booted, vxdiskadm option 5 completes (no errors) without bring disk out of
removed state.

When we try to manually bring back c1t0d0 (rootdisk) we get following error.

# vxdg -g rootdg -k adddisk rootdisk=c1t0d0s2

vxvm:vxdg: ERROR: associating disk-media rootdisk with c1t0d0s2:

Disk public region is too small

Why?

From what I could gather from Veritas Knowledge Base for a encapsulated primary boot disk where the
special subdisk to protect the private region is created; if something other than "rootdisk" is used for boot
disk or something other than "rootvol" is used for the root filesystem volume, using vxdiskadm option 5 will
not work since Veritas will not ignore the special subdisk in this case called "c1t0d0s2Priv"

We can see from the vxprint that the dm list rootdisk but the subdisk in rootvol are c1t0d0

How would you fix this? 2 options to recover

#1 Document ID:73801

Title:VERITAS Volume Manager 3.2 or higher: During replacement of a disk, the error "Disk public region
is too small" is displayed

# vxdg -g <dg> -p -k adddisk rootdisk=c#t#d#

Using this command can be helpful, especially on a root disk, because the subdisk offsets get skewed,
disallowing the same subdisks to be created on the new disk in the same locations as on the previous disk.
The '-p' option disregards the offsets of the subdisks' starting points, allowing the subdisks to be created on
the replacement disk.

#2 this will clear up name issue and resolve issue

disassociate original bootdisk plexes and recursively remove them, then remove subdisk "c1t0d0s2Priv"
then rootdisk, then initialize rootdisk and remirror

# vxplex -f -g rootdg dis rootvol-01

# vxedit -rf rm rootvol-01

# vxplex -f -g rootdg dis crash-01

# vxedit -rf rm crash-01

# vxplex -f -g rootdg dis home-01

# vxedit -rf rm home-01

# vxplex -f -g rootdg dis swapvol-01

# vxedit -rf rm swapvol-01

# vxplex -f -g rootdg dis var-0

# vxedit -rf rm var-01

# vxedit -sf rm c1t0d0s2Priv

# vxedit -rf rm rootdisk

# vxdctl enable
# vxprint -htg rootdg

DG NAME NCONFIG NLOG MINORS GROUP-ID


DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
V NAME RVG KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE

dg rootdg default default 0 992973065.1025.ralph

dm disk01 c1t1d0s2 sliced 2888 71124291 -

v crash - ENABLED ACTIVE 4194828 ROUND - fsgen


pl crash-02 crash ENABLED ACTIVE 4194828 CONCAT - RW
sd disk01-01 crash-02 disk01 0 4194828 0 c1t1d0 ENA

v home - ENABLED ACTIVE 44057250 ROUND - fsgen


pl home-02 home ENABLED ACTIVE 44057250 CONCAT - RW
sd disk01-02 home-02 disk01 4194828 44057250 0 c1t1d0 ENA

v rootvol - ENABLED ACTIVE 4194828 ROUND - root


pl rootvol-02 rootvol ENABLED ACTIVE 4194828 CONCAT - RW
sd disk01-03 rootvol-02 disk01 48252078 4194828 0 c1t1d0 ENA

v swapvol - ENABLED ACTIVE 16579971 ROUND - swap


pl swapvol-02 swapvol ENABLED ACTIVE 16579971 CONCAT - RW
sd disk01-04 swapvol-02 disk01 52446906 16579971 0 c1t1d0 ENA

v var - ENABLED ACTIVE 2097414 ROUND - fsgen


pl var-02 var ENABLED ACTIVE 2097414 CONCAT - RW
sd disk01-05 var-02 disk01 69026877 2097414 0 c1t1d0 ENA
<root@ralph:/>
#>

then initialize back in and mirror using vxdiskadm option 6

What would you do to keep this from happening again?

use default names, reserved names for bootdisk & os volume, no hotsparing on rootdisk
Objective 3 Troubleshooting Exercise #2 Part A

The Customer attempted to grow a file system and volume from 2 gb to 12 gb since it was almost full.

Command typed in by the customer...

vxresize -bx -F vxfs -g iodg ccs 25161728 iodg03 iodg04

df -k shows it succeeded, but the volume is still at 2 gb in size, and can not create large files in file system.

They can create small files though.

df -k
Filesystem kbytes used avail capacity Mounted on
/dev/vx/dsk/iodg/ccs 12580864 1071444 10790204 10% /ccs

12,580,864 k (12 gb)

Disk group: iodg


...
v ccs - ENABLED ACTIVE 4194304 SELECT - fsgen
pl ccs-01 ccs ENABLED ACTIVE 4195072 CONCAT - RW
sd iodg01-02 ccs-01 iodg01 4195072 4195072 0 Disk_10 ENA
pl ccs-02 ccs ENABLED ACTIVE 4195072 CONCAT - RW
sd iodg02-02 ccs-02 iodg02 16778496 4195072 0 Disk_44 ENA

4194304 (vxprint default unit are sectors) x 512 (bytes per sector)= 2,147,483,648 (2 gb)

Why?

What would you do to fix this?


Objective 3 Troubleshooting Exercise #2 Answer Sheet

actions taken

tries vxdctl enable and vxconfigd -k -x cleartempdir - no change

checked file system orginal creation options using " mkfs -F vxfs -m"

unix> mkfs -F vxfs -m /dev/vx/rdsk/iodg/ccs

mkfs -F vxfs -o ninode=unlimited,bsize=1024,version=5,inosize=256,logsize=16384,nolargefiles /


dev/vx/rdsk/iodg/ccs 25161728

unix>

checked sunsolve checked patches

veritas recommended first reducing file system size back to original size

root@ndwest # cd /
root@ndwest # fsadm -F vxfs -b 4194304 /ccs
UX:vxfs fsadm: INFO: /dev/vx/rdsk/iodg/ccs is currently 25161728 sectors - size will be reduced

root@ndwest # df -k /ccs
Filesystem kbytes used avail capacity Mounted on
/dev/vx/dsk/iodg/ccs 2097152 1072970 960293 53% /ccs

root@ndwest # cd /ccs
root@ndwest # touch john
root@ndwest # rm john
root@ndwest # cd /

next try to grow volume by itself then file system separately

root@ndwest # vxassist -g iodg growto ccs 25161728 iodg03 iodg04


vxvm:vxassist: WARNING: dm:iodg03: No disk space matches specification
vxvm:vxassist: WARNING: dm:iodg04: No disk space matches specification

command completed even with messages and vxprint shows correct new volume size

vxprint -ht
v ccs - ENABLED ACTIVE 25161728 SELECT - fsgen
pl ccs-01 ccs ENABLED ACTIVE 25162112 CONCAT - RW
sd 603e_503-03 ccs-01 603e_503 4195072 4195072 0 Disk_10 ENA
sd 603W_1299-01 ccs-01 603W_1299 0 20967040 4195072 Disk_52 ENA
pl ccs-02 ccs ENABLED ACTIVE 25162112 CONCAT - RW
sd 603W_1022-03 ccs-02 603W_1022 16778496 4195072 0 Disk_44 ENA
sd 603E_299-01 ccs-02 603E_299 0 20967040 4195072 Disk_53 ENA

next sse successfully grew file system as well

issue resolved
Objective 4 Troubleshooting Exercise #3

Customer temporarily lost access to several fibre disks, now unable to access several file systems even
though df -k show them mounted

gemini:#vxdisk -o alldgs list


DEVICE TYPE DISK GROUP STATUS
c0t0d0s2 sliced rootdisk rootdg online
c0t8d0s2 sliced rootmirror rootdg online nohotuse
c1t0d0s2 sliced disk02 rootdg online
c1t1d0s2 sliced disk03 rootdg online
c1t2d0s2 sliced disk04 rootdg online
c2t8d0s2 sliced disk05 rootdg online
c2t9d0s2 sliced disk07 rootdg online
c2t10d0s2 sliced disk06 rootdg online
c6t0d0s2 sliced - - error
c6t0d1s2 sliced whsdthordg01 whsdthordg online
c6t0d2s2 sliced whsdthordg02 whsdthordg online
c6t0d3s2 sliced - - error
c6t0d4s2 sliced - (prodthordg) online
c6t0d5s2 sliced - - error
c6t0d6s2 sliced - (prodthordg) online
c6t0d7s2 sliced - - error
c6t0d8s2 sliced - - error
c6t0d10s2 sliced prodthordg03 prodthordg online dgdisabled
c6t0d11s2 sliced prodthordg00 prodthordg online dgdisabled
c6t0d20s2 sliced tmp-new-oraapplokidg00 oraapplokidg online
c6t0d21s2 sliced tmp-new-ncamplokidg00 ncamplokidg online
c6t0d22s2 sliced tmp-new-data1lokidg00 data1lokidg online
c6t0d55s2 sliced - - error
c6t0d56s2 sliced - - error
c6t0d60s2 sliced data1thordg00 data1thordg online
gemini:#vxdg list
NAME STATE ID
rootdg enabled 1067639375.1025.gemini
data1lokidg enabled 1092461553.2008.loki.recycled-greetings.com
data1thordg enabled 1084465351.1614.thor-new
ncamplokidg enabled 1092461211.1995.loki.recycled-greetings.com
oraapplokidg enabled 1092461176.1986.loki.recycled-greetings.com
prodthordg disabled 1091985480.1979.thor-new
whsdthordg enabled 1091985481.1982.thor-new

gemini:#vxprint -htg prodthordg


vxvm:vxprint: ERROR: Disk group prodthordg: No such disk group

Why?
What would you do to fix this?
Objective 4 Troubleshooting Exercise #3 Answer Sheet

- grep prodthordg /etc/vfstab

- lockfs -fhv /mountpoint

- fuser -kc /mountpoint

- umount /mountpoint

- vxdg deport prodthordg

- vxdctl enable

- vxdg import prodthordg


Objective 5 Troubleshooting Exercise #4

Customer Upgrade from Solaris 8 to Solaris 9 using Live Upgrade, customer then tried to boot off disk with
Solaris 9 and got following message:

...
SunOS Release 5.9 Version Generic_117171-02 64-bit

Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved.

Use is subject to license terms.

WARNING: vxio: incompatible kernel version (5.9), expecting 5.8 <<<<<<<<<<<<

WARNING: forceload of misc/md_trans failed

WARNING: forceload of misc/md_raid failed


...

Why?

What would you do to fix this?

Objective 5 Troubleshooting Exercise #4 Answer Sheet

why? veritas still using sol 8 driver

What would you do to fix this? had customer to pkgrm & pkgadd VRTSvxvm so veritas would now use
correct driver
Objective 6 Troubleshooting Exercise #5

A5200 has hung, customer had to recycle array to regain access to disks.
RAID5 volumes are in "DISABLED ACTIVE" state and show several disks in "RCOV"
state

Reviewed of /var/adm/messages file and root emails from veritas confirm lose of access
to several disks at same time

How would you address this issue so as to have best chance of recovering data?

Volume " standby"

v standby - DISABLED ACTIVE 640092800 RAID - raid5


pl findata-01 standby DISABLED ACTIVE 640115520 RAID 11/32 RW
sd bcpsdg08-01 findata-01 bcpsdg08 0 64011573 0/0 c4t6d0 ENA
sd bcpsdg09-01 findata-01 bcpsdg09 0 64011573 1/0 c4t7d0 ENA
sd bcpsdg10-01 findata-01 bcpsdg10 0 64011573 2/0 c4t8d0 ENA
sd bcpsdg11-01 findata-01 bcpsdg11 0 64011573 3/0 c4t9d0 ENA
sd bcpsdg02-01 findata-01 bcpsdg02 0 64011573 4/0 c4t10d0 ENA
sd bcpsdg18-01 findata-01 bcpsdg18 0 64011573 5/0 c6t22d0 RCOV
sd bcpsdg19-01 findata-01 bcpsdg19 0 64011573 6/0 c6t23d0 RCOV
sd bcpsdg20-01 findata-01 bcpsdg20 0 64011573 7/0 c6t24d0 RCOV
sd bcpsdg21-01 findata-01 bcpsdg21 0 64011573 8/0 c6t25d0 RCOV
sd bcpsdg22-01 findata-01 bcpsdg22 0 64011573 9/0 c6t26d0 RCOV
sd bcpsdg17-02 findata-01 bcpsdg17 11556 64011573 10/0 c6t21d0 RCOV

Volume "swrdata"

v swrdata - DETACHED REPLAY 640092800 RAID - raid5


pl swrdata-01 swrdata ENABLED ACTIVE 640115520 RAID 11/32 RW
sd bcpsdg01-01 swrdata-01 bcpsdg01 0 64011573 0/0 c4t0d0 ENA
sd bcpsdg03-01 swrdata-01 bcpsdg03 0 64011573 1/0 c4t1d0 ENA
sd bcpsdg04-01 swrdata-01 bcpsdg04 0 64011573 2/0 c4t2d0 ENA
sd bcpsdg05-01 swrdata-01 bcpsdg05 0 64011573 3/0 c4t3d0 ENA
sd bcpsdg06-01 swrdata-01 bcpsdg06 0 64011573 4/0 c4t4d0 ENA
sd bcpsdg07-01 swrdata-01 bcpsdg07 0 64011573 5/0 c4t5d0 ENA
sd bcpsdg12-01 swrdata-01 bcpsdg12 0 64011573 6/0 c6t16d0 RCOV
sd bcpsdg13-01 swrdata-01 bcpsdg13 0 64011573 7/0 c6t17d0 RCOV
sd bcpsdg14-01 swrdata-01 bcpsdg14 0 64011573 8/0 c6t18d0 RCOV
sd bcpsdg15-02 swrdata-01 bcpsdg15 0 56901744 9/0 c6t19d0 RCOV
sd bcpsdg15-01 swrdata-01 bcpsdg15 64011573 7109829 9/56901744 c6t19d0 RCOV
sd bcpsdg16-01 swrdata-01 bcpsdg16 0 64011573 10/0 c6t20d0 RCOV
pl swrdata-02 swrdata ENABLED LOG 11556 CONCAT - RW <<<< log plex
sd bcpsdg08-02 swrdata-02 bcpsdg08 64011573 11556 0 c4t6d0 ENA
Objective 6 Troubleshooting Exercise #5 Answer Sheet

NOTE: actual SUN customer case 64395069 where veritas support was involved, also referencing doc from
veritas http://seer.support.veritas.com/docs/251793.htm and sun internal document on raid5 volume
recovery procedures.

see Document ID:79162 Title:VERITAS Volume Manager: Recovering a RAID5 Volume State After a
Disk Channel Failure

How would you address this issue so as to have best chance of recovering data?

actions taken to TRY to recover raid5 volumes


- verify connectivity issue resolved and OS (ie. format) can see all disks in a good state

#### NOTE: do not run format > anaylsis > refresh on a raid5 ####

- vxconfigd -k -x cleartempdir, vxdctl enable to verify we have most uptodate info


- verify vxdisk list shows all relevant disks online now
- check vxprint -htrLg bcpsdg output to see if still have volumes with several subdisks in RCOV state
- run vxtask -l -g <dg> list to verify no current recovery processes running on it
- first try /etc/vx/bin/vxreattach -br <<< give it a few minutes to complete
- run vxtask -l -g <dg> list to see if any resyncs started
- if no change

# vxvol -g <diskgroup> -o delayrecover start <volume>

if this does not work then use force option even though document states not to force start volume because
this is a situation where VxVM lost access to several disks at same time which is not covered in document.

# vxvol -g <diskgroup> -f -o delayrecover start <volume>

delayrecover

Does not perform any plex revive operations when starting a volume. Instead, the volume and any plexes
are enabled. This may leave some stale plexes, and may leave a mirrored volume in a special read-
writeback (NEEDSYNC) recover state that performs limited plex recovery for each read to the volume.

-. check file system integrity

# fsck -y -F <file sys type> /dev/vx/rdsk/<diskgroup>/<volume>

- mount and check data in file system

- if data is good, then recommended a full backup be done immediately

NOTE: For RAID 0 volumes and RAID 5 volumes without LOG plexes
the use of the "-f" option of the vxvol command will be necessary
to start the volume or the log plex will have to be disassociated then reattach
after starting volume

NOTE: For RAID 1+0, known as Striped Pro and Concat Pro volume types
you will need to run "vxrecover -s -E <volume>" AFTER all underlying layered
volumes have been repaired.
$LOGNAME@sunbcpsunx001$PWD# vxvol -g bcpsdg -o delayrecover start standby
vxvm:vxvol: ERROR: Volume standby is not startable; some subdisks are unusable and the parity is stale
vxvm:vxvol: ERROR: Volume standby is invalid
$LOGNAME@sunbcpsunx001$PWD#

veritas advised customer to force start volume by running

vxvol -g <dg> -f -o delayrecover start standby

v standby - ENABLED NEEDSYNC 640092800 RAID - raid5


pl findata-01 standby ENABLED ACTIVE 640115520 RAID 11/32 RW
sd bcpsdg08-01 findata-01 bcpsdg08 0 64011573 0/0 c4t6d0 ENA
sd bcpsdg09-01 findata-01 bcpsdg09 0 64011573 1/0 c4t7d0 ENA
sd bcpsdg10-01 findata-01 bcpsdg10 0 64011573 2/0 c4t8d0 ENA
sd bcpsdg11-01 findata-01 bcpsdg11 0 64011573 3/0 c4t9d0 ENA
sd bcpsdg02-01 findata-01 bcpsdg02 0 64011573 4/0 c4t10d0 ENA
sd bcpsdg18-01 findata-01 bcpsdg18 0 64011573 5/0 c6t22d0 ENA
sd bcpsdg19-01 findata-01 bcpsdg19 0 64011573 6/0 c6t23d0 ENA
sd bcpsdg20-01 findata-01 bcpsdg20 0 64011573 7/0 c6t24d0 ENA
sd bcpsdg21-01 findata-01 bcpsdg21 0 64011573 8/0 c6t25d0 ENA
sd bcpsdg22-01 findata-01 bcpsdg22 0 64011573 9/0 c6t26d0 ENA
sd bcpsdg17-02 findata-01 bcpsdg17 11556 64011573 10/0 c6t21d0 ENA

customer ran fsck then cleared flag and was able to mount the vxfs file system in this volume

root@sunbcpsunx001/standby#vxvol -g bcpsdg -o delayrecover start swrdata


vxvm:vxvol: ERROR: Volume swrdata is not startable; Raid5 plex does not map the entire volume length

dissassociated the log plex

then ran vxvol -g <dg> -f -o delayrecover start swrdata

enabled needsync now

attached log plex

v swrdata - ENABLED NEEDSYNC 640092800 RAID - raid5


pl swrdata-01 swrdata ENABLED ACTIVE 640115520 RAID 11/32 RW
sd bcpsdg01-01 swrdata-01 bcpsdg01 0 64011573 0/0 c4t0d0 ENA
sd bcpsdg03-01 swrdata-01 bcpsdg03 0 64011573 1/0 c4t1d0 ENA
sd bcpsdg04-01 swrdata-01 bcpsdg04 0 64011573 2/0 c4t2d0 ENA
sd bcpsdg05-01 swrdata-01 bcpsdg05 0 64011573 3/0 c4t3d0 ENA
sd bcpsdg06-01 swrdata-01 bcpsdg06 0 64011573 4/0 c4t4d0 ENA
sd bcpsdg07-01 swrdata-01 bcpsdg07 0 64011573 5/0 c4t5d0 ENA
sd bcpsdg12-01 swrdata-01 bcpsdg12 0 64011573 6/0 c6t16d0 ENA
sd bcpsdg13-01 swrdata-01 bcpsdg13 0 64011573 7/0 c6t17d0 ENA
sd bcpsdg14-01 swrdata-01 bcpsdg14 0 64011573 8/0 c6t18d0 ENA
sd bcpsdg15-02 swrdata-01 bcpsdg15 0 56901744 9/0 c6t19d0 ENA
sd bcpsdg15-01 swrdata-01 bcpsdg15 64011573 7109829 9/56901744 c6t19d0 ENA
sd bcpsdg16-01 swrdata-01 bcpsdg16 0 64011573 10/0 c6t20d0 ENA
pl swrdata-02 swrdata ENABLED LOG 11556 CONCAT - RW
sd bcpsdg08-02 swrdata-02 bcpsdg08 64011573 11556 0 c4t6d0 ENA

fsck file system and mounted file system


Objective 7 Troubleshooting Exercise #6

vxvm:vxvol: ERROR: Volume <vol_name> has no CLEAN or non-volatile ACTIVE plexes

Starting a volume reports the error above. The vxprint output shows that the plexes for the volume are in
"DISABLED RECOVER" state.

What would you do to fix this?

Here is an example:
The disk group dg01 has 2 volumes, apps and home. Trying to start all the volumes reported the following
error:

# vxvol -g dg01 startall


vxvm:vxvol: ERROR: Volume home has no CLEAN or non-volatile ACTIVE plexes

# vxprint -g dg01 -th <== Showed the following


...
dg dg01 2 2 123000 1021305687.1295.obp1

dm appsdisk c0t1d0s2 sliced 11555 71112735 -


dm appsmirror c1t1d0s2 sliced 11555 71112735 -
dm homedisk c2t0d0s2 sliced 14135 35349424 -
dm homemirror c3t0d0s2 sliced 14135 35349424 -

v apps - ENABLED ACTIVE 70840320 SELECT - fsgen


pl apps-01 apps ENABLED ACTIVE 70841169 CONCAT - RW
sd appsdisk-01 apps-01 appsdisk 0 70841169 0 c0t1d0s2 ENA
pl apps-02 apps ENABLED ACTIVE 70841169 CONCAT - RW
sd appsmirror-01 apps-02 appsmirror 0 70841169 0 c1t1d0s2 ENA

v home - DISABLED ACTIVE 16896000 SELECT - fsgen


pl home-01 home DISABLED RECOVER 16897232 CONCAT - RW
sd homedisk-01 home-01 homedisk 0 16897232 0 c2t0d0 ENA
pl home-02 home DISABLED RECOVER 16897232 CONCAT - RW
sd h omemirror-01 home-02 homemirror 0 16897232 0 c3t0d0 ENA
Objective 7 Troubleshooting Exercise #6 Answer Sheet

- If both of the Plexes in a Volume are in RECOVER state, it is recommened to stop one plex, force start
the Volume, and check the filesystem data. Similarly, stop the volume, stop the previously started plex, start
the previously stopped plex, force start the volume and check the filesystem data. Compare which data is
recent and change the state of the Plexes accordingly to synchronize.

# vxmend -o force off home-02

# vxmend -o force off home-01

# vxmend on home-01

# vxmend fix clean home-01

The volume will then start successfully using the cleaned plex:

# vxrecover -s -g dg01 home

# fsck -F <fs type> /dev/vx/rdsk/<diskgroup>/<volume>

# mount -F <fs type> /dev/vx/dsk/<diskgroup>/<volume> /mountpoint

check data

check other side by

# vxvol stop home

# vxmend -o force off home-01

# vxmend on home-02

# vxmend fix clean home-02

The volume will then start successfully using the cleaned plex:

# vxrecover -s -g dg01 home

# fsck -F <fs type> /dev/vx/rdsk/<diskgroup>/<volume>

# mount -F <fs type> /dev/vx/dsk/<diskgroup>/<volume> /mountpoint

check data

then decided which side to resync from


Objective 8 Troubleshooting Exercise #7

Issue:
- getting "...Disk public region is too small..." error when bring new/replacemnet
disk back under Veritas control

...
Select a removed or failed disk [<disk>,list,q,?] disk01

Select disk device to initialize [<address>,list,q,?] c0t11d0

The requested operation is to initialize disk device c0t11d0 and


to then use that device to replace the removed or failed disk
disk01 in disk group crackle_bkup.

Continue with operation? [y,n,q,?] (default: y)

Use fastresync for plex synchronization? [y,n,q,?] (default: n)

Use a default private region length for the disk?


[y,n,q,?] (default: y)

Replacement of disk disk01 in group crackle_bkup with device c0t11d0


failed.
vxvm:vxdg: ERROR: associating disk-media disk01 with c0t11d0s2:
Disk public region is too small

Replace a different disk? [y,n,q,?] (default: n)

Why?

What would you do to fix this?


# vxprint -htrLg crackle_bkup

dg crackle_bkup default default 126000 1101067439.1662.crackle

dm disk01 - - - - REMOVED
dm disk02 c0t13d0s2 sliced 3590 17674902 -

v swapvol2 - DISABLED ACTIVE 35352576 SELECT - fsgen


pl swapvol2-01 swapvol2 DISABLED REMOVED 35353395 CONCAT - RW
sd disk01-01 swapvol2-01 disk01 3590 17678493 0 - RMOV
sd disk02-01 swapvol2-01 disk02 0 17674902 17678493 c0t13d0 ENA

prtvtoc /dev/rdsk/c0t11d0s2
* /dev/rdsk/c0t11d0s2 partition map
*
...
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
2 5 00 0 17682084 17682083
3 15 01 3591 3591 7181
4 14 01 7182 17674902 17682083

# prtvtoc /dev/rdsk/c0t13d0s2
* /dev/rdsk/c0t13d0s2 partition map
...
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
2 5 00 0 17682084 17682083
3 15 01 3591 3591 7181
4 14 01 7182 17674902 17682083

# vxconfigd -k -x cleartempdir

# /usr/sbin/vxdctl enable

# vxdisk -o alldgs list


DEVICE TYPE DISK GROUP STATUS
c0t10d0s2 sliced rootdisk rootdg online
c0t11d0s2 sliced - - online
c0t12d0s2 sliced rootmirr rootdg online
c0t13d0s2 sliced disk02 crackle_bkup online
c10t0d1s2 sliced CRACKLE01 CRACKLEdg online
c10t0d2s2 sliced CRACKLE02 CRACKLEdg online
...
c10t0d42s2 sliced CRACKLE42 CRACKLEdg online
- - disk01 crackle_bkup removed was:c0t11d0s2
Objective 8 Troubleshooting Exercise #7 Answer Sheet

2 options to recover

#1 Document ID:73801

Title:VERITAS Volume Manager 3.2 or higher: During replacement of a disk, the error "Disk public region
is too small" is displayed

# vxdg -g <dg> -p -k adddisk rootdisk=c#t#d#

Using this command can be helpful, especially on a root disk, because the subdisk offsets get skewed,
disallowing the same subdisks to be created on the new disk in the same locations as on the previous disk.
The '-p' option disregards the offsets of the subdisks' starting points, allowing the subdisks to be created on
the replacement disk.

#2 since data was already gone we were able to get around issue by:

-deleting volume swapvol2 then disk01

- then reinitializing disk01 back into diskgroup

- then recreate volume

# vxedit -g crackle_bkup -fr rm swapvol2

# vxedit -g crackle_bkup rm disk01

# vxdisk list
DEVICE TYPE DISK GROUP STATUS
c0t10d0s2 sliced rootdisk rootdg online
c0t11d0s2 sliced - - error
c0t12d0s2 sliced rootmirr rootdg online
c0t13d0s2 sliced disk02 crackle_bkup online
c10t0d1s2 sliced CRACKLE01 CRACKLEdg online
...

# vxprint -htg crackle_bkup


dg crackle_bkup default default 126000 1101067439.1662.crackle
dm disk02 c0t13d0s2 sliced 3590 17674902 -

# ./vxdisksetup -i c0t11d0
# prtvtoc /dev/rdsk/c0t11d0s2
* Partition Tag Flags Sector Count Sector Mount Directory
2 5 00 0 17682084 17682083
3 15 01 3591 3591 7181
4 14 01 7182 17674902 17682083

# vxdg -g crackle_bkup adddisk disk01=c0t11d0

# vxassist -g crackle_bkup maxsize


Maximum volume size: 35348480 (17260Mb)

# vxassist -g crackle_bkup make swapvol2 35348480

# vxprint -htg crackle_bkup


dg crackle_bkup default default 126000 1101067439.1662.crackle
dm disk01 c0t11d0s2 sliced 3334 17674902 -
dm disk02 c0t13d0s2 sliced 3590 17674902 -
v swapvol2 - ENABLED ACTIVE 35348480 SELECT - fsgen
pl swapvol2-01 swapvol2 ENABLED ACTIVE 35349804 CONCAT - RW
sd disk01-01 swapvol2-01 disk01 0 17674902 0 c0t11d0 ENA
sd disk02-01 swapvol2-01 disk02 0 17674902 17674902 c0t13d0 ENA

# vxdisk list
DEVICE TYPE DISK GROUP STATUS
c0t10d0s2 sliced rootdisk rootdg online
c0t11d0s2 sliced disk01 crackle_bkup online
c0t12d0s2 sliced rootmirr rootdg online
...
Objective 9 Information on Split Brain
cannot get disk(s) to re-attach getting the following message: VxVM 4.X./vxreattach -r c#t#d#

VxVM vxdg ERROR V-5-1-10127 associating disk-media vistadg101 with c#t#d#s2:


Serial Split Brain detected. Run vxsplitlines
OR
VxVM vxconfigd NOTICE V-5-0-33 Split Brain. da id is 0.1, while dm id is 0.0 for DM <dg name>
VxVM vxdg ERROR V-5-1-587 Disk group <dg name>: import failed: Serial Split Brain detected. Run
vxsplitlines

Veritas Document ID: 269233 http://support.veritas.com/docs/269233


How to recover from a serial split brain

Details:
Background:
The Serial Split Brain condition arises because VERITAS Volume Manager (tm) increments the serial ID in
the disk media record of each imported disk in all the disk group configurations on those disks. A new serial
(SSB) ID has been included as part of the new disk group version=110 in Volume Manager 4 to assist with
recovery of the disk group from this condition. The value that is stored in the configuration database
represents the serial ID that the disk group expects a disk to have. The serial ID that is stored in a disk's
private region is considered to be its actual value.

If some disks went missing from the disk group (due to physical disconnection or power failure) and those
disks were imported by another host, the serial IDs for the disks in their copies of the configuration
database, and also in each disk's private region, are updated separately on that host. When the disks are
subsequently reimported into the original shared disk group, the actual serial IDs on the disks do not agree
with the expected values from the configuration copies on other disks in the disk group.

The disk group cannot be reimported because the databases do not agree on the actual and expected serial
IDs. You must choose which configuration database to use. This is a true serial split brain condition, which
Volume Manager cannot correct automatically. In this case, the disk group import fails, and the vxdg utility
outputs error messages similar to the following before exiting:

VxVM vxconfigd NOTICE V-5-0-33 Split Brain. da id is 0.1, while dm id is 0.0 for DM <dg name>
VxVM vxdg ERROR V-5-1-587 Disk group <dg name>: import failed: Serial Split Brain detected. Run
vxsplitlines

The import does not succeed even if you specify the -f flag to vxdg.

Although it is usually possible to resolve this conflict by choosing the version of the configuration database
with the highest valued configuration ID (shown as config_tid in the output from the vxprivutil dumpconfig
<device>), this may not be the correct thing to do in all circumstances.

To resolve conflicting configuration information, you must decide which disk contains the correct version
of the disk group configuration database. To assist you in doing this, you can run the vxsplitlines command
to show the actual serial ID on each disk in the disk group and the serial ID that was expected from the
configuration database. For each disk, the command also shows the vxdg command that you must run to
select the configuration database copy on that disk as being the definitive copy to use for importing the disk
group.
The following example shows the result of JBOD losing access to one of the four disks in the disk group:
# vxdisk -o alldgs list
DEVICE TYPE DISK GROUP STATUS
c2t1d0s2 auto:cdsdisk - (dgD280silo1) online
c2t2d0s2 auto:cdsdisk d2 dgD280silo1 online
c2t3d0s2 auto:cdsdisk d3 dgD280silo1 online
c2t9d0s2 auto:cdsdisk d4 dgD280silo1 online
- - d1 dgD280silo1 failed was:c2t1d0s2

# vxreattach -c c2t1d0s2
dgD280silo1 d1

# vxreattach -br c2t1d0s2


VxVM vxdg ERROR V-5-1-10127 associating disk-media d1 with c2t1d0s2:
Serial Split Brain detected. Run vxsplitlines

# vxsplitlines -g dgD280silo1
VxVM vxsplitlines NOTICE V-5-2-2708 There are 1 pools.
The Following are the disks in each pool. Each disk in the same pool
has config copies that are similar.
VxVM vxsplitlines INFO V-5-2-2707 Pool 0.
c2t1d0s2 d1

To see the configuration copy from this disk issue /etc/vx/diag.d/vxprivutil dumpconfig /
dev/vx/dmp/c2t1d0s2

To import the diskgroup with config copy from this disk use the following command;

/usr/sbin/vxdg -o selectcp=1092974296.21.gopal import dgD280silo1

The following are the disks whose serial split brain (SSB) IDs don't match in this configuration copy:
d2

At this stage, you need to gain confidence prior to running the recommended command by generating the
following outputs :

In this example, the disk group split so that one disk (d1) appears to be on one side of the split. You can
specify the -c option to vxsplitlines to print detailed information about each of the disk IDs from the
configuration copy on a disk specified by its disk access name:

# vxsplitlines -g dgD280silo1 -c c2t3d0s2

VxVM vxsplitlines INFO V-5-2-2701 DANAME(DMNAME) || Actual SSB || Expected SSB


VxVM vxsplitlines INFO V-5-2-2700 c2t1d0s2( d1 ) || 0.0 || 0.0 ssb ids match
VxVM vxsplitlines INFO V-5-2-2700 c2t2d0s2( d2 ) || 0.1 || 0.0 ssb ids don't match
VxVM vxsplitlines INFO V-5-2-2700 c2t3d0s2( d3 ) || 0.1 || 0.0 ssb ids don't match
VxVM vxsplitlines INFO V-5-2-2700 c2t9d0s2( d4 ) || 0.1 || 0.0 ssb ids don't match
VxVM vxsplitlines INFO V-5-2-2706

This output can be verified by using vxdisk list on each disk. A summary is shown below:

# vxdisk list c2t1d0s2 # vxdisk list c2t3d0s2


Device: c2t1d0s2 Device: c2t3d0s2
disk: name= id=1092974296.21.gopal disk: name=d3 id=1092974311.23.gopal
group: name=dgD280silo1 id=1095738111.20.gopal group: name=dgD280silo1 id=1095738111.20.gopal
ssb: actual_seqno=0.0 ssb: actual_seqno=0.1
# vxdisk list c2t2d0s2 # vxdisk list c2t9d0s2
Device: c2t2d0s2 Device: c2t9d0s2
disk: name=d2 id=1092974302.22.gopal disk: name=d4 id=1092974318.24.gopal
group: name=dgD280silo1 id=1095738111.20.gopal group: name=dgD280silo1 id=1095738111.20.gopal
ssb: actual_seqno=0.1 ssb: actual_seqno=0.1

Note that though some disks SSB IDs might match that does not necessarily mean that those disks' config
copies have all the changes. From some other configuration copies, those disks' SSB IDs might not match.

To see the configuration from this disk, run


/etc/vx/diag.d/vxprivutil dumpconfig /dev/rdsk/c2t3d0s2 > dumpconfig_c2t3d0s2

If the other disks in the disk group were not imported on another host, Volume Manager resolves the
conflicting values of the serial IDs by using the version of the configuration database from the disk with the
greatest value for the updated ID (shown as update_tid in the output from /etc/vx/diag.d/vxprivutil
dumpconfig /dev/rdsk/<device>).

In this example , looking through the dumpconfig, there are the following update_tid and ssbid values:

dumpconfig c2t3d0s2 dumpconfig c2t9d0s2


config:tid=0.1058 config:tid=0.1059
dm d1 dm d1
update_tid=0.1038 update_tid=0.1059
ssbid=0.0 ssbid=0.0
dm d2 dm d2
update_tid=0.1038 update_tid=0.1038
ssbid=0.0 ssbid=0.0
dm d3 dm d3
update_tid=0.1053 update_tid=0.1053
ssbid=0.0 ssbid=0.0
dm d4 dm d4
update_tid=0.1053 update_tid=0.1059
ssbid=0.0 ssbid=0.1

Using the output from the dumpconfig for each disk determines which config output to use by running the
command:

# cat dumpconfig_c2t3d0s2 | vxprint -D - -ht

Before deciding on which option to use for import, ensure the disk group is currently in a valid deport
state:

# vxdisk -o alldgs list


DEVICE TYPE DISK GROUP STATUS
c2t1d0s2 auto:cdsdisk - (dgD280silo1) online
c2t2d0s2 auto:cdsdisk - (dgD280silo1) online
c2t3d0s2 auto:cdsdisk - (dgD280silo1) online
c2t9d0s2 auto:cdsdisk - (dgD280silo1) online
At this stage, your knowledge of how the serial split brain condition came about may be a little clearer and
you should have chosen a configuration from one disk to be used to import the disk group. In this example,
the following command imports the disk group using the configuration copy from d2:
# /usr/sbin/vxdg -o selectcp=1092974302.22.gopal import dgD280silo1
Once the disk group has been imported, Volume Manager resets the serial IDs to 0 for the imported disks.
The actual and expected serial IDs for any disks in the disk group that are not imported at this time remain
unchanged.

# vxprint -htg dgD280silo1


dg dgD280silo1 default default 26000 1095738111.20.gopal
dm d1 c2t1d0s2 auto 2048 35838448 -
dm d2 c2t2d0s2 auto 2048 35838448 -
dm d3 c2t3d0s2 auto 2048 35838448 -
dm d4 c2t9d0s2 auto 2048 35838448 -

v SNAP-vol_db2silo1.1 - DISABLED ACTIVE 1024000 SELECT - fsgen


pl SNAP-vol_db2silo1.1-01 SNAP-vol_db2silo1.1 DISABLED ACTIVE 1024000 STRIPE 2/1024 RW
sd d3-01 SNAP-vol_db2silo1.1-01 d3 0 512000 0/0 c2t3d0 ENA
sd d4-01 SNAP-vol_db2silo1.1-01 d4 0 512000 1/0 c2t9d0 ENA
dc SNAP-vol_db2silo1.1_dco SNAP-vol_db2silo1.1 SNAP-vol_db2silo1.1_dcl
v SNAP-vol_db2silo1.1_dcl - DISABLED ACTIVE 544 SELECT - gen
pl SNAP-vol_db2silo1.1_dcl-01 SNAP-vol_db2silo1.1_dcl DISABLED ACTIVE 544 CONCAT - RW
sd d3-02 SNAP-vol_db2silo1.1_dcl-01 d3 512000 544 0 c2t3d0 ENA

v orgvol - DISABLED ACTIVE 1024000 SELECT - fsgen


pl orgvol-01 orgvol DISABLED ACTIVE 1024000 STRIPE 2/128 RW
sd d1-01 orgvol-01 d1 0 512000 0/0 c2t1d0 ENA
sd d2-01 orgvol-01 d2 0 512000 1/0 c2t2d0 ENA

# vxrecover -g dgD280silo1 -sb


# mount -F vxfs /dev/vx/dsk/dgD280silo1/orgvol /orgvol
UX:vxfs mount: ERROR: V-3-21268: /dev/vx/dsk/dgD280silo1/orgvol is corrupted. needs checking
# fsck -F vxfs /dev/vx/rdsk/dgD280silo1/orgvol
log replay in progress
replay complete - marking super-block as CLEAN
# mount -F vxfs /dev/vx/dsk/dgD280silo1/orgvol /orgvol
# df /orgvol
/orgvol (/dev/vx/dsk/dgD280silo1/orgvol): 1019102 blocks 127386 files
# vxdisk -o alldgs list

DEVICE TYPE DISK GROUP STATUS


c2t1d0s2 auto:cdsdisk d1 dgD280silo1 online
c2t2d0s2 auto:cdsdisk d2 dgD280silo1 online
c2t3d0s2 auto:cdsdisk d3 dgD280silo1 online
c2t9d0s2 auto:cdsdisk d4 dgD280silo1 online
# vxprint -htg dgD280silo1

dg dgD280silo1 default default 26000 1095738111.20.gopal

dm d1 c2t1d0s2 auto 2048 35838448 -


dm d2 c2t2d0s2 auto 2048 35838448 -
dm d3 c2t3d0s2 auto 2048 35838448 -
dm d4 c2t9d0s2 auto 2048 35838448 -

v SNAP-vol_db2silo1.1 - ENABLED ACTIVE 1024000 SELECT SNAP-vol_db2silo1.1-01 fsgen


pl SNAP-vol_db2silo1.1-01 SNAP-vol_db2silo1.1 ENABLED ACTIVE 1024000 STRIPE 2/1024 RW
sd d3-01 SNAP-vol_db2silo1.1-01 d3 0 512000 0/0 c2t3d0 ENA
sd d4-01 SNAP-vol_db2silo1.1-01 d4 0 512000 1/0 c2t9d0 ENA
dc SNAP-vol_db2silo1.1_dco SNAP-vol_db2silo1.1 SNAP-vol_db2silo1.1_dcl
v SNAP-vol_db2silo1.1_dcl - ENABLED ACTIVE 544 SELECT - gen
pl SNAP-vol_db2silo1.1_dcl-01 SNAP-vol_db2silo1.1_dcl ENABLED ACTIVE 544 CONCAT - RW
sd d3-02 SNAP-vol_db2silo1.1_dcl-01 d3 512000 544 0 c2t3d0 ENA

v orgvol - ENABLED ACTIVE 1024000 SELECT orgvol-01 fsgen


pl orgvol-01 orgvol ENABLED ACTIVE 1024000 STRIPE 2/128 RW
sd d1-01 orgvol-01 d1 0 512000 0/0 c2t1d0 ENA
sd d2-01 orgvol-01 d2 0 512000 1/0 c2t2d0 ENA
Supported Features in Disk Group Version 110

There is limited documentation on ssb. The previous tech note is meant to clarify
some of the features.

The only reference to this subject is included in the VERITAS Volume Manager™ 4.0
Administrator’s Guide for Solaris.

The table below is from the Disk group support section.

Previous
Disk Group Version
Version Features
New Features Supported Supported
110 Cross-platform Data Sharing (CDS) 20, 30, 40, 50, 60,
70, 80, 90

Device Discovery Layer (DDL) 2.0

Disk Group Configuration Backup and Restore

Elimination of rootdg as a Special Disk Group

Full-Sized and Space-Optimized Instant Snapshots

Intelligent Storage Provisioning (ISP)

Serial Split Brain Detection

Volume Sets (Multiple Device Support for VxFS)

For more detail you can download the

VERITAS Volume Manager (tm) 4.0 Administrator's Guide for Solaris

From the following URL:

Public Link: http://support.veritas.com/docs/<265469>

When dealing with issues where the ssb id is referenced , you need to look at
the following 2 outputs;
# vxdg list sharedg

Group: sharedg
dgid: 1105340390.50.jacaranda
import-id: 33792.53
flags: shared
version: 110
alignment: 8192 (bytes)
local-activation: shared-write
cluster-actv-modes: jacaranda=sw sundar=sw
ssb: on
detach-policy: global
copies: nconfig=default nlog=default
config: seqno=0.1089 permlen=0 free=0 templen=0 loglen=0

# vxdisk list sdc3t12d0s5


Device: sdc3t12d0s5
devicetag: sdc3t12d0s5
type: simple
clusterid: sunjac
disk: name=sharedg125 id=1104366468.5014.jacaranda
group: name=sharedg id=1105340390.50.jacaranda
flags: online ready private foreign shared autoimport imported
pubpaths: block=/dev/simple/sdc3t12d0s5 char=/dev/rsimple/sdc3t12d0s5
version: 2.2
iosize: min=512 (bytes) max=2048 (blocks)
public: slice=0 offset=1025 len=4194551 disk_offset=4199692
private: slice=0 offset=1 len=1024 disk_offset=4199692
update: time=1108883968 seqno=0.48
ssb: actual_seqno=0.0
headers: 0 248
configs: count=1 len=727
logs: count=1 len=110
Defined regions:
config priv 000017-000247[000231]: copy=01 offset=000000 enabled
config priv 000249-000744[000496]: copy=01 offset=000231 enabled
log priv 000745-000854[000110]: copy=01 offset=000000 enabled
Case Example
new_quest# ./vxreattach -r c2t0d0
VxVM vxdg ERROR V-5-1-10127 associating disk-media vistadg101 with c2t0d0s2:
Serial Split Brain detected. Run vxsplitlines

vxdiskadm option 5 does not list disk thus cannot re-attach it.

cu is using vxvm 4.0 and the disks were initialized using the cds format see below.

Device: c2t0d0s2
devicetag: c2t0d0
type: auto
hostid: quest
disk: name= id=1086208967.1288.quest
group: name=vistadg1 id=1086209872.2043.quest
info: format=cdsdisk,privoffset=256,pubslice=2,privslice=2 <---cds format
flags: online ready private autoconfig autoimport
pubpaths: block=/dev/vx/dmp/c2t0d0s2 char=/dev/vx/rdmp/c2t0d0s2
version: 3.1
iosize: min=512 (bytes) max=256 (blocks)
public: slice=2 offset=2304 len=8391936 disk_offset=0
private: slice=2 offset=256 len=2048 disk_offset=0
update: time=1096861924 seqno=0.39
ssb: actual_seqno=0.0
headers: 0 240
configs: count=1 len=1280
logs: count=1 len=192
Defined regions:
config priv 000048-000239[000192]: copy=01 offset=000000 enabled
config priv 000256-001343[001088]: copy=01 offset=000192 enabled
log priv 001344-001535[000192]: copy=01 offset=000000 enabled
lockrgn priv 001536-001679[000144]: part=00 offset=000000
Multipathing information:
numpaths: 1
c2t0d0s2 state=enabled

Document Audience: INTERNAL


Document ID: 64313344
Title: NCN/G/vxvm4.0/ cannot re-attach disks/Serial Split Brain detected : Sun Fire 880
Synopsis: NCN/G/vxvm4.0/ cannot re-attach disks/Serial Split Brain detected; Solution: walked cu
thru procedure/see sol notes
Update Date: Wed Oct 20 21:40:29 MDT 2004

Case Number: 64313344


Geo Code (AMER, APAC, EMEA): AMER
Synopsis: NCN/G/vxvm4.0/ cannot re-attach disks/Serial Split Brain detected
Product: Sun Fire 880
HW Platform: Sun Fire 880
OS Version: Solaris 9 (S9)
Engineer: JOE KAUFMANN
Engineer Phone: 781-442-8635
Open Date: 20-Oct-2004 15:57:51

Close Date: 20-Oct-2004 21:40:08


Resolution: walked cu thru procedure/see sol notes
Disk group: kintanadg

dg kintanadg default default 28000 1087515421.304.quest


dm kintanadg01 c3t0d100s2 auto 2048 8391936 -
dm kintana06 c2t0d147s2 auto 2048 8391936 -
dm kintana07 c2t0d148s2 auto 2048 8391936 -
dm kintana08 - - - - NODEVICE
dm kintana12 - - - - NODEVICE
dm kintana13 - - - - NODEVICE
dm kintana14 c2t0d132s2 auto 2048 8391936 -
dm kintana15 c2t0d133s2 auto 2048 8391936 -
.
v vol01 - ENABLED ACTIVE 17825792 SELECT - fsgen
pl vol01-01 vol01 ENABLED ACTIVE 17825792 CONCAT - RW
sd kintanadg01-01 vol01-01 kintanadg01 0 8391936 0 c3t0d100 ENA
sd kintanadg02-01 vol01-01 kintanadg02 0 8391936 8391936 c3t0d101 ENA
sd kintanadg03-01 vol01-01 kintanadg03 0 1041920 16783872 c3t0d102 ENA

v vol02 - DISABLED ACTIVE 54525952 SELECT - fsgen


pl vol02-01 vol02 DISABLED NODEVICE 54525952 CONCAT - RW
sd kintana08-01 vol02-01 kintana08 0 8391936 0 - NDEV
sd kintana09-01 vol02-01 kintana09 0 8391936 8391936 c2t0d127 ENA
sd kintana10-01 vol02-01 kintana10 0 8391936 16783872 c2t0d128 ENA
sd kintana11-01 vol02-01 kintana11 0 8391936 25175808 c2t0d129 ENA
sd kintana12-01 vol02-01 kintana12 0 4171264 33567744 - NDEV
sd kintana02-01 vol02-01 kintana02 0 8393472 37739008 c2t0d149 ENA
sd kintana01-01 vol02-01 kintana01 0 8393472 46132480 c2t0d150 ENA

v vol03 - DISABLED ACTIVE 11534336 SELECT - fsgen


pl vol03-01 vol03 DISABLED NODEVICE 11534336 CONCAT - RW
sd kintana13-01 vol03-01 kintana13 0 8391936 0 - NDEV
sd kintana14-01 vol03-01 kintana14 0 3142400 8391936 c2t0d132 ENA

v vol04 - DISABLED ACTIVE 151091200 SELECT - fsgen


pl vol04-01 vol04 DISABLED NODEVICE 151091200 CONCAT - RW
sd kintana12-02 vol04-01 kintana12 4171264 4220672 0 - NDEV
sd kintana14-02 vol04-01 kintana14 3142400 5249536 4220672 c2t0d132 ENA
sd kintana15-01 vol04-01 kintana15 0 8391936 9470208 c2t0d133 ENA
sd kintana16-01 vol04-01 kintana16 0 8391936 17862144 c2t0d134 ENA
sd kintana17-01 vol04-01 kintana17 0 8391936 26254080 c2t0d135 ENA
sd kintana18-01 vol04-01 kintana18 0 8391936 34646016 c2t0d136 ENA
sd kintana19-01 vol04-01 kintana19 0 8391936 43037952 c2t0d137 ENA
sd kintana20-01 vol04-01 kintana20 0 8391936 51429888 c2t0d138 ENA
sd kintana21-01 vol04-01 kintana21 0 8391936 59821824 c2t0d139 ENA
sd kintana22-01 vol04-01 kintana22 0 8391936 68213760 c2t0d140 ENA
sd kintana23-01 vol04-01 kintana23 0 8391936 76605696 c2t0d141 ENA
sd kintana24-01 vol04-01 kintana24 0 8391936 84997632 c2t0d142 ENA
sd kintana25-01 vol04-01 kintana25 0 8391936 93389568 c2t0d143 ENA
sd kintana03-01 vol04-01 kintana03 0 8391936 101781504 c2t0d144 ENA
sd kintana04-01 vol04-01 kintana04 0 8391936 110173440 c2t0d145 ENA
sd kintana05-01 vol04-01 kintana05 0 8391936 118565376 c2t0d146 ENA
sd kintana06-01 vol04-01 kintana06 0 8391936 126957312 c2t0d147 ENA
sd kintana07-01 vol04-01 kintana07 0 8391936 135349248 c2t0d148 ENA
sd kintanadg03-02 vol04-01 kintanadg03 1041920 7350016 143741184 c3t0d102 ENA
Solution

procedure to get cu back volumes back:

1. vxdg list vistadg1

config disk c2t0d12s2 copy 1 len=1280 state=clean online <--- located a clean
online disk

2. vxdisk list c2t0d12s2 <---- do a vxdisk list on clean online disk and get
disk id#

e.g. disk id=1086209041.1375.quest

3. vxdg deport vistadg1 (deport entire disk group, make sure all volumes are
unmounted first)

4. vxdg -o selectcp=(diskidstring from above) import dgname <--- import the


diskgroup using the clean diskid#

e.g. vxdg -o selectcp=1086209041.1375.quest import vistadg1

5. force start each individual volumes of the diskgroup

e.g. vxvol -f -g vistadg1 start vol01

6. done
Many Thanks to All Who Helped (review &
contribute), including:
- Dave Graham
- Mike Young
- Spencer Borck
- Larry Tyburczy
- Joel Garrett
- Jeff Huff

Вам также может понравиться