Академический Документы
Профессиональный Документы
Культура Документы
X troubleshooting
TOI Objectives
#1 OS/Boot disk and mirror are under Veritas Volume Manager 3.x (4.0 is different) control and
system is not booting
1. Gather Information
- history, what let up to this
- any hardware replaced
- error messages
- what versionof VxVM
- tried booting off mirror or clone disks yet
What issues can we identify just from asking the right questions?
- at ok prompt
ok> printenv use-nvramrc?
use-nvramrc? = false
ok> setenv use-nvramrc? true # if false
ok> setenv auto-boot?=false # keep system from booting to Solaris
- boot cdrom -s
If you are unsure what disk is the primary boot disk. You can use
'luxadm -y set_boot_dev /dev/dsk/c?t?d?s?' to set the boot dev to the OBP.
- If the system boots ok, then ctrl-d to continue booting to run level 3.
(this helps verify that the underlying hardware and OS are in good state or not)
3. If the system successfully boots then we need to see if we can manually start VxVM.
What are some of the things in the volboot file that can cause bootVxVM starting issues?
4. If we can manually start VxVM. Then check to see if there are OS mirrors, and if the mirror disk is
available.
IMPORTANT: If the boot disk contains mirrored volumes, one must take all the mirrors offline for
those volumes, except for the one on the boot disk. Offlining a mirror prevents VxVM from ever
performing a recovery on that plex. This step is critical in preventing data corruption.
What issues can be seen in vxdisk list output that can cause boot/starting issues?
- vxprint -htrLg rootdg # rootdg disk group should be imported and all volumes should be in
a DISABLED state.
- next step is to re-enable VxVM and reboot to test if VxVM will start automatically
What can cause VxVM to start manually but not automatically on boot up?
6. Once any debugging actions and/or any other operations are completed, VxVM can be re-enabled again
with the following steps.
- cp /etc/system.vx-rootdev /etc/system
- cp / etc/vfstab.vx /etc/vfstab
- rm /etc/vx/reconfig.d/state.d/install-db
- cd /; ls -l | grep -i vx # if see a /VXVM_.... type directory rename it to old.VXVM
- sync; sync; init 0
- at ok prompt boot off of same disk
- check for status of volumes and plexes and address any issues
vxinfo -g rootdg -p | egrep -iv "started|active"
- online mirrors and start recovery operations on the mirrors that were just onlined.
- Example:
vxmend -g rootdg on rootvol-02
vxrecover -bs
7b. If system does NOT boot up use the following document to enable debug logging of boot up.
also send customer following link or document so they can generate a vxexplorer and send to us and Veritas
- /etc/vx/reconfig.d/state.d/install-db
If there are a lot of disks, you can try using this script:
#! /usr/bin/ksh
file="/tmp/diskvtoc.out"
y=0
rm $file
for x in `/bin/ls /dev/rdsk/c*s2`
do
echo "" >> $file
echo "" >> $file
let y="$y"+1
echo "number:" $y >> $file; echo $x >> $file
echo "#########################" >> $file
/bin/ls -l $x >> $file 2>&1
echo "" >> $file
/usr/sbin/prtvtoc $x >> $file 2>&1
echo "*************************" >> $file
echo "" >> $file
echo "" >> $file
done
- Can the private regions be read? Are there disks in rootdg diskgroup? Does the host id
match the /etc/volboot file? Are the group ids of all the rootdg disks the same?
#! /usr/bin/ksh
file="/tmp/vx.out"
y=0
rm $file
for x in `/bin/ls /dev/rdsk/c*s2`
do
echo "" >> $file
echo "" >> $file
let y="$y"+1
echo "number:" $y >> $file; echo $x >> $file
echo "#########################" >> $file
/bin/ls -l $x >> $file 2>&1
echo "" >> $file
/usr/lib/vxvm/diag.d/vxprivutil list $x >> $file 2>&1
echo "*************************" >> $file
echo "" >> $file
echo "" >> $file
done
3. use vxdctl init <hostid> to change volboot file DO NOT EDIT MANUALLY
Note: You can edit the volboot file if it becomes necessary. Just keep in mind that you can
cause vxconfigd to not recognize it, if the format is changed, and/or the file size is not 512 bytes.
4. otherwise:
also send customer following link or document so they can generate a vxexplorer and send to us and Veritas
c1t0d0 (rootdisk) was proactively removed from Veritas and was replaced, but vxdiskadm option 5
completes with no errors, (or at least none that are noticed) without bringing the disk out of the removed
state?
Why?
# vxdisk list
DEVICE TYPE DISK GROUP STATUS
c1t0d0s2 sliced - - error
c1t0d0s2 sliced - - error
c1t0d0s2 sliced - - offline
c1t1d0s2 sliced disk01 rootdg online
c3t0d0s2 sliced - - offline
c3t0d89s2 sliced datadg01 datadg online
c3t0d90s2 sliced datadg02 datadg online
c3t0d146s2 sliced - - online
c3t0d217s2 sliced datadg03 datadg online
c3t0d218s2 sliced datadg04 datadg online
c4t0d0s2 sliced - - offline
c4t0d89s2 sliced - - offline
c4t0d90s2 sliced - - offline
c4t0d146s2 sliced - - offline
c4t0d217s2 sliced - - online
c4t0d218s2 sliced - - online
- - rootdisk rootdg removed was:c1t0d0s2
Objective 2 Troubleshooting Exercise #1 Part B
Same customer reboots but system fails to come up and gets following message.
Why?
# eeprom
auto-boot?=true
boot-command=boot
boot-file: data not available.
boot-device=/pci@8,600000/SUNW,qlc@4/fp@0,0/disk@w2100002037e3e93e,0:a disk
nvramrc=devalias net /pci@8,700000/network@3
devalias vx-disk01 /pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w2100002037e3d51b,0:a
devalias vx-rootdisk /pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w2100002037e3e93e,0:a
error-reset-recovery=boot
Objective 2 Troubleshooting Exercise #1 Part C
Same customer system is now booted, vxdiskadm option 5 completes (no errors) without
bringing the disk out of the removed state.
# vxdisk list
DEVICE TYPE DISK GROUP STATUS
c1t0d0s2 sliced - - online
c1t1d0s2 sliced disk01 rootdg online
c3t0d0s2 sliced - - error
c3t0d89s2 sliced datadg01 datadg online
c3t0d90s2 sliced datadg02 datadg online
c3t0d146s2 sliced - - online
c3t0d217s2 sliced datadg03 datadg online
c3t0d218s2 sliced datadg04 datadg online
c4t0d0s2 sliced - - error
c4t0d89s2 sliced - - error
c4t0d90s2 sliced - - error
c4t0d146s2 sliced - - error
c4t0d217s2 sliced - - online
c4t0d218s2 sliced - - online
- - rootdisk rootdg removed was:c1t0d0s2
When we try to manually bring back c1t0d0 (rootdisk) we get following error.
Why?
c1t0d0 (rootdisk) was proactively removed from Veritas and was replaced, but vxdiskadm option 5
completes (no errors) without bring disk out of removed state?
What would you do to fix this? reboot or follow infodoc 70929 (luxadm offline)
What would you do to keep this from happening again? Installed Patch 113201-05
Part B
What would you do to keep this from happening again? use luxadm command to set boot-device to new
disk wwn
# eeprom boot-device
boot-device=/pci@8,600000/SUNW,qlc@4/fp@0,0/disk@w2100002037e3d51b,0:a
<root@ralph:/tmp>
#
Part C
Same customer system is now booted, vxdiskadm option 5 completes (no errors) without bring disk out of
removed state.
When we try to manually bring back c1t0d0 (rootdisk) we get following error.
Why?
From what I could gather from Veritas Knowledge Base for a encapsulated primary boot disk where the
special subdisk to protect the private region is created; if something other than "rootdisk" is used for boot
disk or something other than "rootvol" is used for the root filesystem volume, using vxdiskadm option 5 will
not work since Veritas will not ignore the special subdisk in this case called "c1t0d0s2Priv"
We can see from the vxprint that the dm list rootdisk but the subdisk in rootvol are c1t0d0
#1 Document ID:73801
Title:VERITAS Volume Manager 3.2 or higher: During replacement of a disk, the error "Disk public region
is too small" is displayed
Using this command can be helpful, especially on a root disk, because the subdisk offsets get skewed,
disallowing the same subdisks to be created on the new disk in the same locations as on the previous disk.
The '-p' option disregards the offsets of the subdisks' starting points, allowing the subdisks to be created on
the replacement disk.
disassociate original bootdisk plexes and recursively remove them, then remove subdisk "c1t0d0s2Priv"
then rootdisk, then initialize rootdisk and remirror
# vxdctl enable
# vxprint -htg rootdg
use default names, reserved names for bootdisk & os volume, no hotsparing on rootdisk
Objective 3 Troubleshooting Exercise #2 Part A
The Customer attempted to grow a file system and volume from 2 gb to 12 gb since it was almost full.
df -k shows it succeeded, but the volume is still at 2 gb in size, and can not create large files in file system.
df -k
Filesystem kbytes used avail capacity Mounted on
/dev/vx/dsk/iodg/ccs 12580864 1071444 10790204 10% /ccs
4194304 (vxprint default unit are sectors) x 512 (bytes per sector)= 2,147,483,648 (2 gb)
Why?
actions taken
checked file system orginal creation options using " mkfs -F vxfs -m"
unix>
veritas recommended first reducing file system size back to original size
root@ndwest # cd /
root@ndwest # fsadm -F vxfs -b 4194304 /ccs
UX:vxfs fsadm: INFO: /dev/vx/rdsk/iodg/ccs is currently 25161728 sectors - size will be reduced
root@ndwest # df -k /ccs
Filesystem kbytes used avail capacity Mounted on
/dev/vx/dsk/iodg/ccs 2097152 1072970 960293 53% /ccs
root@ndwest # cd /ccs
root@ndwest # touch john
root@ndwest # rm john
root@ndwest # cd /
command completed even with messages and vxprint shows correct new volume size
vxprint -ht
v ccs - ENABLED ACTIVE 25161728 SELECT - fsgen
pl ccs-01 ccs ENABLED ACTIVE 25162112 CONCAT - RW
sd 603e_503-03 ccs-01 603e_503 4195072 4195072 0 Disk_10 ENA
sd 603W_1299-01 ccs-01 603W_1299 0 20967040 4195072 Disk_52 ENA
pl ccs-02 ccs ENABLED ACTIVE 25162112 CONCAT - RW
sd 603W_1022-03 ccs-02 603W_1022 16778496 4195072 0 Disk_44 ENA
sd 603E_299-01 ccs-02 603E_299 0 20967040 4195072 Disk_53 ENA
issue resolved
Objective 4 Troubleshooting Exercise #3
Customer temporarily lost access to several fibre disks, now unable to access several file systems even
though df -k show them mounted
Why?
What would you do to fix this?
Objective 4 Troubleshooting Exercise #3 Answer Sheet
- umount /mountpoint
- vxdctl enable
Customer Upgrade from Solaris 8 to Solaris 9 using Live Upgrade, customer then tried to boot off disk with
Solaris 9 and got following message:
...
SunOS Release 5.9 Version Generic_117171-02 64-bit
Why?
What would you do to fix this? had customer to pkgrm & pkgadd VRTSvxvm so veritas would now use
correct driver
Objective 6 Troubleshooting Exercise #5
A5200 has hung, customer had to recycle array to regain access to disks.
RAID5 volumes are in "DISABLED ACTIVE" state and show several disks in "RCOV"
state
Reviewed of /var/adm/messages file and root emails from veritas confirm lose of access
to several disks at same time
How would you address this issue so as to have best chance of recovering data?
Volume "swrdata"
NOTE: actual SUN customer case 64395069 where veritas support was involved, also referencing doc from
veritas http://seer.support.veritas.com/docs/251793.htm and sun internal document on raid5 volume
recovery procedures.
see Document ID:79162 Title:VERITAS Volume Manager: Recovering a RAID5 Volume State After a
Disk Channel Failure
How would you address this issue so as to have best chance of recovering data?
#### NOTE: do not run format > anaylsis > refresh on a raid5 ####
if this does not work then use force option even though document states not to force start volume because
this is a situation where VxVM lost access to several disks at same time which is not covered in document.
delayrecover
Does not perform any plex revive operations when starting a volume. Instead, the volume and any plexes
are enabled. This may leave some stale plexes, and may leave a mirrored volume in a special read-
writeback (NEEDSYNC) recover state that performs limited plex recovery for each read to the volume.
NOTE: For RAID 0 volumes and RAID 5 volumes without LOG plexes
the use of the "-f" option of the vxvol command will be necessary
to start the volume or the log plex will have to be disassociated then reattach
after starting volume
NOTE: For RAID 1+0, known as Striped Pro and Concat Pro volume types
you will need to run "vxrecover -s -E <volume>" AFTER all underlying layered
volumes have been repaired.
$LOGNAME@sunbcpsunx001$PWD# vxvol -g bcpsdg -o delayrecover start standby
vxvm:vxvol: ERROR: Volume standby is not startable; some subdisks are unusable and the parity is stale
vxvm:vxvol: ERROR: Volume standby is invalid
$LOGNAME@sunbcpsunx001$PWD#
customer ran fsck then cleared flag and was able to mount the vxfs file system in this volume
Starting a volume reports the error above. The vxprint output shows that the plexes for the volume are in
"DISABLED RECOVER" state.
Here is an example:
The disk group dg01 has 2 volumes, apps and home. Trying to start all the volumes reported the following
error:
- If both of the Plexes in a Volume are in RECOVER state, it is recommened to stop one plex, force start
the Volume, and check the filesystem data. Similarly, stop the volume, stop the previously started plex, start
the previously stopped plex, force start the volume and check the filesystem data. Compare which data is
recent and change the state of the Plexes accordingly to synchronize.
# vxmend on home-01
The volume will then start successfully using the cleaned plex:
check data
# vxmend on home-02
The volume will then start successfully using the cleaned plex:
check data
Issue:
- getting "...Disk public region is too small..." error when bring new/replacemnet
disk back under Veritas control
...
Select a removed or failed disk [<disk>,list,q,?] disk01
Why?
dm disk01 - - - - REMOVED
dm disk02 c0t13d0s2 sliced 3590 17674902 -
prtvtoc /dev/rdsk/c0t11d0s2
* /dev/rdsk/c0t11d0s2 partition map
*
...
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
2 5 00 0 17682084 17682083
3 15 01 3591 3591 7181
4 14 01 7182 17674902 17682083
# prtvtoc /dev/rdsk/c0t13d0s2
* /dev/rdsk/c0t13d0s2 partition map
...
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
2 5 00 0 17682084 17682083
3 15 01 3591 3591 7181
4 14 01 7182 17674902 17682083
# vxconfigd -k -x cleartempdir
# /usr/sbin/vxdctl enable
2 options to recover
#1 Document ID:73801
Title:VERITAS Volume Manager 3.2 or higher: During replacement of a disk, the error "Disk public region
is too small" is displayed
Using this command can be helpful, especially on a root disk, because the subdisk offsets get skewed,
disallowing the same subdisks to be created on the new disk in the same locations as on the previous disk.
The '-p' option disregards the offsets of the subdisks' starting points, allowing the subdisks to be created on
the replacement disk.
#2 since data was already gone we were able to get around issue by:
# vxdisk list
DEVICE TYPE DISK GROUP STATUS
c0t10d0s2 sliced rootdisk rootdg online
c0t11d0s2 sliced - - error
c0t12d0s2 sliced rootmirr rootdg online
c0t13d0s2 sliced disk02 crackle_bkup online
c10t0d1s2 sliced CRACKLE01 CRACKLEdg online
...
# ./vxdisksetup -i c0t11d0
# prtvtoc /dev/rdsk/c0t11d0s2
* Partition Tag Flags Sector Count Sector Mount Directory
2 5 00 0 17682084 17682083
3 15 01 3591 3591 7181
4 14 01 7182 17674902 17682083
# vxdisk list
DEVICE TYPE DISK GROUP STATUS
c0t10d0s2 sliced rootdisk rootdg online
c0t11d0s2 sliced disk01 crackle_bkup online
c0t12d0s2 sliced rootmirr rootdg online
...
Objective 9 Information on Split Brain
cannot get disk(s) to re-attach getting the following message: VxVM 4.X./vxreattach -r c#t#d#
Details:
Background:
The Serial Split Brain condition arises because VERITAS Volume Manager (tm) increments the serial ID in
the disk media record of each imported disk in all the disk group configurations on those disks. A new serial
(SSB) ID has been included as part of the new disk group version=110 in Volume Manager 4 to assist with
recovery of the disk group from this condition. The value that is stored in the configuration database
represents the serial ID that the disk group expects a disk to have. The serial ID that is stored in a disk's
private region is considered to be its actual value.
If some disks went missing from the disk group (due to physical disconnection or power failure) and those
disks were imported by another host, the serial IDs for the disks in their copies of the configuration
database, and also in each disk's private region, are updated separately on that host. When the disks are
subsequently reimported into the original shared disk group, the actual serial IDs on the disks do not agree
with the expected values from the configuration copies on other disks in the disk group.
The disk group cannot be reimported because the databases do not agree on the actual and expected serial
IDs. You must choose which configuration database to use. This is a true serial split brain condition, which
Volume Manager cannot correct automatically. In this case, the disk group import fails, and the vxdg utility
outputs error messages similar to the following before exiting:
VxVM vxconfigd NOTICE V-5-0-33 Split Brain. da id is 0.1, while dm id is 0.0 for DM <dg name>
VxVM vxdg ERROR V-5-1-587 Disk group <dg name>: import failed: Serial Split Brain detected. Run
vxsplitlines
The import does not succeed even if you specify the -f flag to vxdg.
Although it is usually possible to resolve this conflict by choosing the version of the configuration database
with the highest valued configuration ID (shown as config_tid in the output from the vxprivutil dumpconfig
<device>), this may not be the correct thing to do in all circumstances.
To resolve conflicting configuration information, you must decide which disk contains the correct version
of the disk group configuration database. To assist you in doing this, you can run the vxsplitlines command
to show the actual serial ID on each disk in the disk group and the serial ID that was expected from the
configuration database. For each disk, the command also shows the vxdg command that you must run to
select the configuration database copy on that disk as being the definitive copy to use for importing the disk
group.
The following example shows the result of JBOD losing access to one of the four disks in the disk group:
# vxdisk -o alldgs list
DEVICE TYPE DISK GROUP STATUS
c2t1d0s2 auto:cdsdisk - (dgD280silo1) online
c2t2d0s2 auto:cdsdisk d2 dgD280silo1 online
c2t3d0s2 auto:cdsdisk d3 dgD280silo1 online
c2t9d0s2 auto:cdsdisk d4 dgD280silo1 online
- - d1 dgD280silo1 failed was:c2t1d0s2
# vxreattach -c c2t1d0s2
dgD280silo1 d1
# vxsplitlines -g dgD280silo1
VxVM vxsplitlines NOTICE V-5-2-2708 There are 1 pools.
The Following are the disks in each pool. Each disk in the same pool
has config copies that are similar.
VxVM vxsplitlines INFO V-5-2-2707 Pool 0.
c2t1d0s2 d1
To see the configuration copy from this disk issue /etc/vx/diag.d/vxprivutil dumpconfig /
dev/vx/dmp/c2t1d0s2
To import the diskgroup with config copy from this disk use the following command;
The following are the disks whose serial split brain (SSB) IDs don't match in this configuration copy:
d2
At this stage, you need to gain confidence prior to running the recommended command by generating the
following outputs :
In this example, the disk group split so that one disk (d1) appears to be on one side of the split. You can
specify the -c option to vxsplitlines to print detailed information about each of the disk IDs from the
configuration copy on a disk specified by its disk access name:
This output can be verified by using vxdisk list on each disk. A summary is shown below:
Note that though some disks SSB IDs might match that does not necessarily mean that those disks' config
copies have all the changes. From some other configuration copies, those disks' SSB IDs might not match.
If the other disks in the disk group were not imported on another host, Volume Manager resolves the
conflicting values of the serial IDs by using the version of the configuration database from the disk with the
greatest value for the updated ID (shown as update_tid in the output from /etc/vx/diag.d/vxprivutil
dumpconfig /dev/rdsk/<device>).
In this example , looking through the dumpconfig, there are the following update_tid and ssbid values:
Using the output from the dumpconfig for each disk determines which config output to use by running the
command:
Before deciding on which option to use for import, ensure the disk group is currently in a valid deport
state:
There is limited documentation on ssb. The previous tech note is meant to clarify
some of the features.
The only reference to this subject is included in the VERITAS Volume Manager™ 4.0
Administrator’s Guide for Solaris.
Previous
Disk Group Version
Version Features
New Features Supported Supported
110 Cross-platform Data Sharing (CDS) 20, 30, 40, 50, 60,
70, 80, 90
When dealing with issues where the ssb id is referenced , you need to look at
the following 2 outputs;
# vxdg list sharedg
Group: sharedg
dgid: 1105340390.50.jacaranda
import-id: 33792.53
flags: shared
version: 110
alignment: 8192 (bytes)
local-activation: shared-write
cluster-actv-modes: jacaranda=sw sundar=sw
ssb: on
detach-policy: global
copies: nconfig=default nlog=default
config: seqno=0.1089 permlen=0 free=0 templen=0 loglen=0
vxdiskadm option 5 does not list disk thus cannot re-attach it.
cu is using vxvm 4.0 and the disks were initialized using the cds format see below.
Device: c2t0d0s2
devicetag: c2t0d0
type: auto
hostid: quest
disk: name= id=1086208967.1288.quest
group: name=vistadg1 id=1086209872.2043.quest
info: format=cdsdisk,privoffset=256,pubslice=2,privslice=2 <---cds format
flags: online ready private autoconfig autoimport
pubpaths: block=/dev/vx/dmp/c2t0d0s2 char=/dev/vx/rdmp/c2t0d0s2
version: 3.1
iosize: min=512 (bytes) max=256 (blocks)
public: slice=2 offset=2304 len=8391936 disk_offset=0
private: slice=2 offset=256 len=2048 disk_offset=0
update: time=1096861924 seqno=0.39
ssb: actual_seqno=0.0
headers: 0 240
configs: count=1 len=1280
logs: count=1 len=192
Defined regions:
config priv 000048-000239[000192]: copy=01 offset=000000 enabled
config priv 000256-001343[001088]: copy=01 offset=000192 enabled
log priv 001344-001535[000192]: copy=01 offset=000000 enabled
lockrgn priv 001536-001679[000144]: part=00 offset=000000
Multipathing information:
numpaths: 1
c2t0d0s2 state=enabled
config disk c2t0d12s2 copy 1 len=1280 state=clean online <--- located a clean
online disk
2. vxdisk list c2t0d12s2 <---- do a vxdisk list on clean online disk and get
disk id#
3. vxdg deport vistadg1 (deport entire disk group, make sure all volumes are
unmounted first)
6. done
Many Thanks to All Who Helped (review &
contribute), including:
- Dave Graham
- Mike Young
- Spencer Borck
- Larry Tyburczy
- Joel Garrett
- Jeff Huff