100 002369 A

VERITAS Cluster Server for UNIX, Fundamentals (Lessons)
100-002369-A
COURSE DEVELOPERS
Bilge Gerrits Siobhan Seeger
Copyright 2006 Symantec Corporation. All rights reserved. Symantec, the Symantec Logo, and VERITAS are trademarks or registered trademarks of Symantec Corporation or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. THIS PUBLICATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID. SYMANTEC CORPORATION SHALL NOT BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES IN CONNECTION WITH THE FURNISHING, PERFORMANCE, OR USE OF THIS PUBLICATION. THE INFORMATION CONTAINED HEREIN IS SUBJECT TO CHANGE WITHOUT NOTICE. No part of the contents of this book may be reproduced or transmitted in any form or by any means without the written permission of the publisher. VERITAS Cluster Server for UNIX, Fundamentals Symantec Corporation 20330 Stevens Creek Blvd. Cupertino, CA 95014 http://www.symantec.com
LEAD SUBJECT MATTER EXPERTS
Pete Toemmes Brad Willer
TECHNICAL CONTRIBUTORS AND REVIEWERS
Geoff Bergren Margy Cassidy Tomer Gurantz Gene Henriksen Kleber Saldanha
Table of Contents
Course Introduction VERITAS Cluster Server Curriculum ................................................................ Intro-2 Cluster Design................................................................................................... Intro-4 Lab Design for the Course ................................................................................ Intro-5 Lesson 1: High Availability Concepts High Availability Concepts....................................................................................... 1-3 Clustering Concepts................................................................................................ 1-7 Clustering Prerequisites ........................................................................................ 1-14 Lesson 2: VCS Building Blocks VCS Terminology .................................................................................................... 2-3 Cluster Communication......................................................................................... 2-12 VCS Architecture................................................................................................... 2-17 Lesson 3: Preparing a Site for VCS Hardware Requirements and Recommendations ................................................... 3-3 Software Requirements and Recommendations..................................................... 3-5 Preparing Installation Information ........................................................................... 3-8 Lesson 4: Installing VCS Using the VERITAS Product Installer...................................................................... 4-3 VCS Configuration Files.......................................................................................... 4-7 Viewing the Default VCS Configuration ................................................................ 4-10 Other Installation Considerations .......................................................................... 4-12 Lesson 5: VCS Operations Managing Applications in a Cluster Environment.................................................... 5-3 Common VCS Operations....................................................................................... 5-5 Using the VCS Simulator ...................................................................................... 5-16 Lesson 6: VCS Configuration Methods Starting and Stopping VCS ..................................................................................... 6-3 Overview of Configuration Methods ........................................................................ 6-7 Online Configuration ............................................................................................... 6-9 Offline Configuration ............................................................................................. 6-16 Controlling Access to VCS .................................................................................... 6-19 Lesson 7: Preparing Services for VCS Preparing Applications for VCS............................................................................... 7-3 Performing One-Time Configuration Tasks............................................................. 7-5 Testing the Application Service............................................................................. 7-10 Stopping and Migrating an Application Service..................................................... 7-18 Lesson 8: Online Configuration Online Service Group Configuration ....................................................................... 8-3 Adding Resources................................................................................................... 8-6 Solving Common Configuration Errors.................................................................. 8-15 Testing the Service Group .................................................................................... 8-19
Table of Contents
Copyright 2006 Symantec Corporation. All rights reserved.
Lesson 9: Offline Configuration Offline Configuration Procedures ............................................................................ 9-3 Solving Offline Configuration Problems ................................................................ 9-13 Testing the Service Group .................................................................................... 9-17 Lesson 10: Sharing Network Interfaces Parallel Service Groups ........................................................................................ 10-3 Sharing Network Interfaces................................................................................... 10-7 Using Parallel Network Service Groups .............................................................. 10-11 Localizing Resource Attributes............................................................................ 10-14 Lesson 11: Configuring Notification Notification Overview ............................................................................................ 11-3 Configuring Notification ......................................................................................... 11-6 Using Triggers for Notification.............................................................................. 11-11 Lesson 12: Configuring VCS Response to Resource Faults VCS Response to Resource Faults ...................................................................... 12-3 Determining Failover Duration .............................................................................. 12-9 Controlling Fault Behavior................................................................................... 12-13 Recovering from Resource Faults....................................................................... 12-17 Fault Notification and Event Handling ................................................................. 12-19 Lesson 13: Cluster Communications VCS Communications Review .............................................................................. 13-3 Cluster Membership .............................................................................................. 13-6 Cluster Interconnect Configuration........................................................................ 13-8 Joining the Cluster Membership.......................................................................... 13-14 Changing the Interconnect Configuration............................................................ 13-19 Lesson 14: System and Communication Faults Ensuring Data Integrity ......................................................................................... 14-3 Cluster Interconnect Failures ................................................................................ 14-6 Lesson 15: I/O Fencing Data Protection Requirements .............................................................................. 15-3 I/O Fencing Concepts and Components ............................................................... 15-8 I/O Fencing Operations ....................................................................................... 15-11 I/O Fencing Implementation ................................................................................ 15-19 Configuring I/O Fencing ...................................................................................... 15-25 Stopping and Recovering Fenced Systems ........................................................ 15-28 Lesson 16: Troubleshooting Monitoring VCS ..................................................................................................... 16-3 Troubleshooting Guide.......................................................................................... 16-7 Archiving VCS-Related Files................................................................................. 16-9
ii
VERITAS Cluster Server for UNIX, Fundamentals

Course Introduction
VERITAS Cluster Server Curriculum

Learning Path
VERITAS Cluster Server, Fundamentals
VERITAS Cluster Server, Implementing Local Clusters
High Availability Design and Customization Using VCS
Disaster Recovery Using VVR and Global Cluster Option
VERITAS Cluster Server Curriculum

The VERITAS Cluster Server curriculum is a series of courses that are designed to provide a full range of expertise with VERITAS Cluster Server (VCS) high availability solutionsfrom design through disaster recovery. VERITAS Cluster Server, Fundamentals This course covers installation and configuration of common VCS configurations, focusing on two-node clusters running application and database services. VERITAS Cluster Server, Implementing Local Clusters This course focuses on multinode VCS clusters and advanced topics related to more complex cluster configurations. High Availability Design and Customization Using VERITAS Cluster Server This course enables participants to translate high availability requirements into a VCS design that can be deployed using VERITAS Cluster Server. Disaster Recovery Using VVR and Global Cluster Option This course covers cluster configurations across remote sites, including VERITAS Volume Replicator and the Global Cluster Option for wide-area clusters.
Intro2

Course Overview
Lesson 1: High Availability Concepts Lesson 2: VCS Building Blocks Lesson 3: Preparing a Site for VCS Lesson 4: Installing VCS Lesson 5: VCS Operations Lesson 6: VCS Configuration Methods Lesson 7: Preparing Services for VCS Lesson 8: Online Configuration Lesson 9: Offline Configuration Lesson 10: Sharing Network Interfaces Lesson 11: Configuring Notification Lesson 12: Configuring VCS Response to Faults Lesson 13: Cluster Communications Lesson 14: System and Communication Faults Lesson 15: I/O Fencing Lesson 16: Troubleshooting
Course Overview This training provides comprehensive instruction on the installation and initial configuration of VERITAS Cluster Server (VCS). The course covers principles and methods that enable you to prepare, create, and test VCS service groups and resources using tools that best suit your needs and your high availability environment. You learn to configure and test failover and notification behavior, cluster additional applications, and further customize your cluster according to specified design criteria.
Course Introduction
Intro3
Sample Cluster Design Input

Web Server Web Service Start up on system S1. Restart Web server process 3 times before faulting it. Fail over to S2 if any resource faults. Notify patg@company.com if any resource faults. IP IP Address 192.168.3.132 192.168.3.132 Mount /web
NIC eri0
Volume WebVol
Disk Group WebDG Components required to Components required to provide the Web service provide the Web service
Cluster Design
Sample Cluster Design Input A VCS design can be presented in many different formats with varying levels of detail. In some cases, you may have only the information about the application services that need to be clustered and the desired operational behavior in the cluster. For example, you may be told that the application service uses multiple network ports and requires local failover capability among those ports before it fails over to another system. In other cases, you may have the information you need as a set of service dependency diagrams with notes on various aspects of the desired cluster operations. If you receive the design information that does not detail the resource information, develop a detailed design worksheet before starting the deployment. Using a design worksheet to document all aspects of your high availability environment helps ensure that you are well-prepared to start implementing your cluster design. In this course, you are provided with a design worksheet showing sample values as a tool for implementing the cluster design in the lab exercises. You can use a similar format to collect all the information you need before starting deployment at your site.
Intro4

Lab Design for the Course
vcsx
their_nameSG1 their_nameSG2 your_nameSG1 your_nameSG2 trainxx trainxx NetworkSG
Lab Design for the Course

The diagram shows a conceptual view of the cluster design used as an example throughout this course and implemented in hands-on lab exercises. Each aspect of the cluster configuration is described in greater detail, where applicable, in course lessons. The cluster consists of: Two nodes Five high availability services; four failover service groups and one parallel network service group Fibre connections to SAN shared storage from each node through a switch Two private Ethernet interfaces for the cluster interconnect network Ethernet connections to the public network Additional complexity is added to the design to illustrate certain aspects of cluster configuration in later lessons. The design diagram shows a conceptual view of the cluster design described in the worksheet.
Course Introduction
Intro5
Lab Naming Conventions

Service Group Definition Group Sample Value nameSG
Resource Definition Service Group Name Resource Name Resource Type Required Attributes ResAttribute1 ResAttribute2 ...
Sample Value nameSG nameIP IP value value
Required Attributes SGAttribute1 value SGAttribute2 value Optional Attributes SGAttribute3 value
Substitute your name, or a nickname, wherever tables or instructions indicate name in labs. Following this convention:
Simplifies lab instructions Helps prevent naming conflicts with your lab partner
Lab Naming Conventions To simplify the labs, use your name or a nickname as a prefix for cluster objects created in the lab exercises. This includes Volume Manager objects, such as disk groups and volumes, as well as VCS service groups and resources. Following this convention helps distinguish your objects when multiple students are working on systems in the same cluster and helps ensure that each student uses unique names. The lab exercises represent your name with the word name in italics. You substitute the name you select whenever you see the name placeholder in a lab step.
Intro6

Classroom Values for Labs

Network Definition Subnet Domain name Software Location VCS installation dir Lab files directory ... Use the classroom values provided by your instructor at the beginning of each lab exercise. Lab tables are provided in the lab appendixes to record these values. Your instructor may also hand out printed tables. If sample values are provided as guidelines, substitute your classroom-specific values provided by your instructor. Your Value Your Value
Classroom Values for Labs Your instructor will provide the classroom-specific information you need to perform the lab exercises. You can record these values in your lab books using the tables provided, or your instructor may provide separate handouts showing the classroom values for your location. In some lab exercises, sample values may be shown in tables as a guide to the types of values you must specify. Substitute the values provided by your instructor to ensure that your configuration is appropriate for your classroom. If you are not sure of the configuration for your classroom, ask your instructor.
Course Introduction
Intro7
Typographic Conventions Used in This Course The following tables describe the typographic conventions used in this course. Typographic Conventions in Text and Commands
Convention Courier New, bold Element Command input, both syntax and examples Examples To display the robot and drive configuration: tpconfig -d To display disk information: vxdisk -o alldgs list In the output: protocol_minimum: 40 protocol_maximum: 60 protocol_current: 0 Locate the altnames directory. Go to http://www.symantec.com. Enter the value 300. Log on as user1.
Courier New, plain
Command output Command names, directory names, file names, path names, user names, passwords, URLs when used within regular text paragraphs Variables in command syntax, and examples: Variables in command input are Italic, plain. Variables in command output are Italic, bold.
Courier New, Italic, bold or plain
To install the media server: /cdrom_directory/install To access a manual page: man command_name To display detailed information for a disk: vxdisk -g disk_group list disk_name
Typographic Conventions in Graphical User Interface Descriptions

Convention Arrow Initial capitalization Element Menu navigation paths Buttons, menus, windows, options, and other interface elements Examples Select File>Save. Select the Next button. Open the Task Status window. Clear the checkmark from the Print File check box. Mark the Include subvolumes in object view window check box.
Quotation marks
Interface elements with long names
Intro8

Lesson 1 High Availability Concepts
Lesson Introduction
Lesson Topics and Objectives
Topic
After completing this lesson, you will be able to:
High Availability Concepts Describe the merits of high availability in the data center environment. Clustering Concepts High Availability Application Services Clustering Prerequisites Describe how clustering is used to implement high availability. Describe how applications are managed in a high availability environment. Describe key requirements for a clustering environment.
12

Challenges in the Data Center

What is running in my data center?
Visibility
Who is making changes? Am I in compliance? How do I track usage and align with the business?
How can I automate mundane tasks?
Control
How do I maintain standards? How can I pool servers and decouple apps?
How do I reduce planned and unplanned downtime?
Availability
How do I meet my disaster recovery requirements? How do I track & deliver against SLAs?
High Availability Concepts

Challenges in the Data Center Managing a data center presents many challenges, which can be roughly split into three categories: Visibility: Viewing and tracking the components in the data center Control: Managing these components Availability: Keeping critical business applications available Availability can be considered as the most important aspect of data center management. When critical business applications are offline, the loss of revenue and reputation can be devastating.

13
Causes of Downtime
Hardware 10%
Unplanned
Software 40%
People 15%
Environment 5% Client <1% LAN/WAN Equipment <1%
Prescheduled Downtime 30%
Planned
Causes of Downtime Downtime is defined as the period of time in which a user is unable to perform tasks in an efficient and timely manner due to poor system performance or system failure. The data in the graph shows reasons for downtime from a study published by the International Electric and Electronic Engineering Association. It shows that hardware failures are the cause of only about 10 percent of total system downtime. As much as 30 percent of all downtime is prescheduled, and most of this time is required due to the lack of system tools to enable online administration of systems. Another 40 percent of downtime is due to software errors. Some of these errors are as simple as a database running out of space on disk and stopping its operations as a result. Downtime can be more generally classified as either planned or unplanned. Examples of unplanned downtime include events such as server damage or application failure. Examples of planned downtime include times when the system is shut down to add additional hardware, upgrade the operating system, rearrange or repartition disk space, or clean up log files and memory. With an effective HA strategy, you can significantly reduce the amount of planned downtime. With planned hardware or software maintenance, a high availability product can enable manual failover while upgrade or hardware work is performed.
14

Costs of Downtime
Actual unplanned downtime per month:
DATA CORRUPTION
Hours: 9
Cost per hour: $106k to $183K
COMPONENT FAILURE
APPLICATION FAILURE
Total cost: $954,000 to 1,647,000
Goal for monthly unplanned downtime:

HUMAN ERROR3 Hours: MAINTENANCE SITE OUTAGE
Cost savings: $636,000 to $1,098,000

Gartner User Survey: High Availability and Mission Critical Services, North America 2005
Costs of Downtime A Gartner study shows that large companies experienced a loss of between $954,000 and $1,647,000 (USD) per month for nine hours of unplanned downtime. In addition to the monetary loss, downtime also results in loss of business opportunities and reputation. Planned downtime is almost as costly as unplanned. Planned downtime can be significantly reduced by migrating a service to another server while maintenance is performed. Given the magnitude of the cost of downtime, the case for implementing a high availability solution is clear.

15
Levels of Availability
remote clustering
GCO
availability
remote replication
VVR
local clustering data availability backup
VCS
VxVM/VxFS
NetBackup
Levels of Availability Data centers may implement different levels of availability depending on their requirements for availability. Backup: At minimum, all data needs to be protected using an effective backup solution, such as VERITAS NetBackup. Data availability: Local mirroring provides real-time data availability within the local data center. Point-in-time copy solutions protect against corruption. Online configuration keeps data available to applications while storage is expanded to accommodate growth. Local clustering: After protecting, the next level is using a clustering solution, such as VERITAS Cluster Server (VCS), for application and server availability. Remote replication: After implementing local availability, you can further ensure data availability in the event of a site failure by replicating data to a remote site. Replication can be application-, host-, or array-based. Remote clustering: Implementing remote clustering ensures that the applications and data can be started at a remote site. The VCS Global Cluster Option supports remote clustering with automatic site failover capability.
16

Types of Clusters
High availability (HA) clusters Parallel processing clusters Load balancing clusters High performance computing clusters Fault tolerant clusters
1
Cluster is a broadly-used term:
VCS is primarily an HA cluster with support for:

Parallel processing applications, such as Oracle RAC Application workload balancing
Clustering Concepts
The term cluster refers to multiple independent systems connected into a management framework. Types of Clusters A variety of clustering solutions are available for various computing purposes. HA clusters: Provide resource monitoring and automatic startup and failover Parallel processing clusters: Break large computational programs into smaller tasks executed in parallel on multiple systems Load balancing clusters: Monitor system load and distribute applications automatically among systems according to specified criteria High performance computing clusters: Use a collection of computing resources to enhance application performance Fault-tolerant clusters: Provide uninterrupted application availability Fault tolerance guarantees 99.9999 percent availability, or approximately 30 seconds of downtime per year. Six 9s (99.9999 percent) availability is appealing, but the costs of this solution are well beyond the affordability of most companies. In contrast, high availability solutions can achieve five 9s (99.999 percent availabilityless than five minutes of downtime per year) at a fraction of the cost. The focus of this course is VERITAS Cluster Server, which is primarily used for high availability, although it also provides some support for parallel processing and load balancing.

17
Local Cluster Configurations

Active/Active or Active/Passive N-to-1, N + 1, N-to-N
Utilization:
Utilization:
Examples
Local Cluster Configurations Depending on the clustering solution you deploy, you may be able to implement a variety of configurations, enabling you to deploy your clustering solution to best suit your HA requirements and utilize existing hardware. Active/Passive In this configuration, an application runs on a primary or master server and a dedicated redundant server is present to take over on any failover. Active/Active In this configuration, each server is configured to run specific applications or services, and essentially provides redundancy for its peer. N-to-1 In this configuration, the applications fail over to the spare when a system crashes. When the server is repaired, applications must be moved back to their original systems. N+1 Similar to N-to-1, the applications restart on the spare after a failure. Unlike the N-to-1 configuration, after the failed server is repaired, it can become the redundant server. N-to-N This configuration is an active/active configuration that supports multiple application services running on multiple servers. Each application service is capable of being failed over to different servers in the cluster.
18

Global Cluster Configurations

1
REPLICATION VERITAS EMC Campus Cluster IBM NetApp HDS/HP Oracle
Campus and Global Cluster Configurations Cluster configurations that enable data to be duplicated among multiple physical sites protect against site-wide failures. Campus Clusters The campus or stretch cluster environment is a single cluster stretched over multiple locations, connected by an Ethernet subnet for the cluster interconnect and a fiber channel SAN, with storage mirrored at each location. Advantages of this configuration are: It provides local high availability within each site as well as protection against site failure. It is a cost-effective solution; replication is not required. Recovery time is short. The data center can be expanded. You can leverage existing infrastructure. Global Clusters Global clusters, or wide-area clusters, contain multiple clusters in different geographical locations. Global clusters protect against site failures by providing data replication and application failover to remote data centers. Global clusters are not limited by distance because cluster communication uses TCP/IP. Replication can be provided by hardware vendors or by a software solution, such as VERITAS Volume Replicator, for heterogeneous array support.
19
HA Application Services
Collection of all hardware and software components required to provide a service All components moved together Components started, stopped in order Examples: Web servers, databases, and applications
Application
DB FS FS Vol Vol DG
IP NIC
Application requires: Application requires: Database Database IP address IP address Database requires file systems Database systems File systems require volumes File systems require volumes Volumes require disk groups Volumes require disk groups
HA Application Services
An application service is a collection of hardware and software components required to provide a service, such as a Web site an end-user may access by connecting to a particular network IP address or host name. Each application service typically requires components of the following three types: Application binaries (executables) Network Storage If an application service needs to be switched to another system, all of the components of the application service must migrate together to re-create the service on another system. These are the same components that the administrator must manually move from a failed server to a working server to keep the service available to clients in a nonclustered environment. Application service examples include: A Web service consisting of a Web server program, IP addresses, associated network interfaces used to allow access into the Web site, a file system containing Web data files, and a volume and disk group containing the file system. A database service may consist of one or more IP addresses, database management software, a file system containing data files, a volume and disk group on which the file system resides, and a NIC for network access.
110

Local Application Service Failover

1
Application
Clients
Application
DB FS FS Vol Vol DG
IP NIC
DB FS FS Vol Vol DG
IP NIC
Local Application Service Failover Cluster management software performs a series of tasks in order for clients to access a service on another server in the event a failure occurs. The software must: Ensure that data stored on the disk is available to the new server, if shared storage is configured (Storage). Move the IP address of the old server to the new server (Network). Start up the application on the new server (Application). The process of stopping the application services on one system and starting it on another system in response to a fault is referred to as a failover.

111
Local and Global Failover

Site Migration
Failover
Replication
Local and Global Failover In a global cluster environment, the application services are generally highly available within a local cluster, so faults are first handled by the HA software, which performs a local failover. When HA methods such as replication and clustering are implemented across geographical locations, recovery procedures are started immediately at a remote location when a disaster takes down a site.
112

Application Requirements for Clustering

Function Requirement
Start/ restart Stop Clean Monitor Nodeindependent Restarted to a known state after a failure Stopped using a defined procedure Cleaned up after operational failures Monitored periodically Not tied to a particular host due to licensing constraints or host name dependencies
1
Application Requirements for Clustering The most important requirements for an application to run in a cluster are crash tolerance and host independence. This means that the application should be able to recover after a crash to a known state, in a predictable and reasonable time, on two or more hosts. Most commercial applications today satisfy this requirement. More specifically, an application is considered well-behaved and can be controlled by clustering software if it meets the requirements shown in the slide.

113
Hardware and Infrastructure Redundancy

LAN LAN
Two links to LAN Two links to LAN Server1 Server2
Two independent links Two independent links connecting nodes connecting nodes Switch1 Switch2
Two independent links Two independent links from each server to storage from each server to storage
Clustering Prerequisites
Hardware and Infrastructure Redundancy All failovers cause some type of client disruption. Depending on your configuration, some applications take longer to fail over than others. For this reason, good design dictates that the HA software first try to fail over within the system, using agents that monitor local resources. Design as much resiliency as possible into the individual servers and components so that you do not have to rely on any hardware or software to cover a poorly configured system or application. Likewise, try to use all resources to make individual servers as reliable as possible. Single Point of Failure Analysis Determine whether any single points of failure exist in the hardware, software, and infrastructure components within the cluster environment. Any single point of failure becomes the weakest link of the cluster. The application is equally inaccessible if a client network connection fails, or if a server fails. Also consider the location of redundant components. Having redundant hardware equipment in the same location is not as effective as placing the redundant component in a separate location. In some cases, the cost of redundant components outweighs the risk that the component will become the cause of an outage. For example, buying an additional expensive storage array may not be practical. Decisions about balancing cost versus availability need to be made according to your availability requirements.
114 VERITAS Cluster Server for UNIX, Fundamentals
External Dependencies
Avoid dependence on services outside the cluster, where possible. Ensure redundancy of external services, if required.
NIS Master Primary DNS NIS Slave Secondary DNS
Switch
Switch
System A
System B
External Dependencies Whenever possible, it is good practice to eliminate or reduce reliance by high availability applications on external services. If it is not possible to avoid outside dependencies, ensure that those services are also highly available. For example, network name and information services, such as DNS (Domain Name System) and NIS (Network Information Service), are designed with redundant capabilities.

115
Lesson Summary
Key Points
Clustering is used to make business-critical applications highly available. Local and global clusters can be used together to provide disaster recovery for data center sites.
Reference Materials
High Availability Design and Customization Using VERITAS Cluster Server course VERITAS High Availability Fundamentals Webbased training
High Availability References

Use these references as resources for building a complete understanding of high availability environments within your organization. The Resilient Enterprise: Recovering Information Services from Disasters This book explains the nature of disasters and their impacts on enterprises, organizing and training recovery teams, acquiring and provisioning recovery sites, and responding to disasters. Blueprints for High Availability: Designing Resilient Distributed Systems This book provides a step-by-step guide for building systems and networks with high availability, resiliency, and predictability. High Availability Design, Techniques, and Processes This guide describes how to create systems that are easier to maintain, and defines ongoing availability strategies that account for business change. Designing Storage Area Networks The text offers practical guidelines for using diverse SAN technologies to solve existing networking problems in large-scale corporate networks. With this book, you learn how the technologies work and how to organize their components into an effective, scalable design. Storage Area Network Essentials: A Complete Guide to Understanding and Implementing SANs (VERITAS Series) This book identifies the properties, architectural concepts, technologies, benefits, and pitfalls of storage area networks (SANs).
116

Lesson 2 VCS Building Blocks
Lesson Introduction
Topic
VCS Terminology Cluster Communication VCS Architecture

Define VCS terminology. Describe VCS cluster communication mechanisms. Describe the VCS architecture.
22

VCS Cluster
VCS clusters consist of: consist of:
Up to 32 systems (nodes) Up An interconnect for cluster An interconnect for cluster communication communication A public network for client A public connections connections Shared storage accessible by Shared storage accessible by each system each system
Online service Offline service Cluster Interconnect
VCS Terminology
VCS Cluster A VCS cluster is a collection of independent systems working together under the VCS management framework for increased service availability. VCS clusters have the following components: Up to 32 systemssometimes referred to as nodes or servers Each system runs its own operating system. A cluster interconnect, which allows for cluster communications A public network, connecting each system in the cluster to a LAN for client access Shared storage (optional), accessible by each system in the cluster that needs to run the application

23
Service Groups
Web IP NIC Mount Volume DiskGroup WebSG WebSG
A service group is a container that enables VCS to manage an application service as a unit.
A service group is defined by:
Resources: Components Resources: required to provide the service to provide the service Dependencies: Relationships Dependencies: between components between Attributes: Behaviors for startup and failure conditions and failure conditions
Service Groups A service group is a virtual container that enables VCS to manage an application service as a unit. The service group contains all the hardware and software components required to run the service. The service group enables VCS to coordinate failover of the application service resources in the event of failure or at the administrators request. A service group is defined by these attributes: The cluster-wide unique name of the group The list of the resources in the service group, usually determined by which resources are needed to run a specific application service The dependency relationships between the resources The list of cluster systems on which the group is allowed to run The list of cluster systems on which you want the group to start automatically
24

Service Group Types

Failover
Online on only one cluster system at a time Most common type
Parallel
Online on multiple cluster systems simultaneously Example: Oracle Real Application Cluster (RAC)
Hybrid
Special-purpose service group used in replicated data clusters (RDCs) using VERITAS Volume Replicator
Service Group Types Service groups can be one of three types: Failover This service group runs on one system at a time in the cluster. Most application services, such as database and NFS servers, use this type of group. Parallel This service group runs simultaneously on more than one system in the cluster. This type of service group requires an application that can be started on more than one system at a time without threat of data corruption. Hybrid (4.x and later) A hybrid service group is a combination of a failover service group and a parallel service group used in VCS 4.x (and later) replicated data clusters (RDCs), which use replication between systems at different sites instead of shared storage. This service group behaves as a failover group within a defined set of systems, and a parallel service group within a different set of systems. RDC configurations are described in the High Availability Using VERITAS Cluster Server for UNIX, Implementing Remote Clusterscourse.

25
Resources
VCS resources:
Correspond to the hardware or software components of an application service Have unique names throughout the cluster Are always contained within service groups Are categorized as: Persistent: Always on Nonpersistent: Turned on and off
Recommendation: Choose names that reflect the service Recommendation: Choose names that reflect the service group name to easily identify all resources in that group; for group name to easily identify all resources in that group; for example, WebIP in the WebSG group. example, WebIP in the WebSG group.
Resources Resources are VCS objects that correspond to hardware or software components, such as the application, the networking components, and the storage components. VCS controls resources through these actions: Bringing a resource online (starting) Taking a resource offline (stopping) Monitoring a resource (probing) Resource Categories Persistent None VCS can only monitor persistent resourcesthese resources cannot be brought online or taken offline. The most common example of a persistent resource is a network interface card (NIC), because it must be present but cannot be stopped. FileNone and ElifNone are other examples. On-only VCS brings the resource online if required but does not stop the resource if the associated service group is taken offline. ProcessOnOnly is a resource used to start, but not stop a process such as daemon, for example. Nonpersistent, also known as on-off Most resources fall into this category, meaning that VCS brings them online and takes them offline as required. Examples are Mount, IP, and Process. FileOnOff is an example of a test version of this resource.
Resource Dependencies
Resources dependencies:
Determine online and offline order Have parent/child relationships; parent depends on child Offline order Cannot be cyclical Offline order
Parent
Parent/child
Persistent resources, such as NIC, cannot Persistent resources, such as NIC, cannot be parents. be parents. Online order Online order
Child
Resource Dependencies Resources depend on other resources because of application or operating system requirements. Dependencies are defined to configure VCS for these requirements. Dependency Rules These rules apply to resource dependencies: A parent resource depends on a child resource. In the diagram, the Mount resource (parent) depends on the Volume resource (child). This dependency illustrates the operating system requirement that a file system cannot be mounted without the Volume resource being available. Dependencies are homogenous. Resources can only depend on other resources. No cyclical dependencies are allowed. There must be a clearly defined starting point.

27
Resource Attributes
Resource attributes:
Define individual resource properties Are used by VCS to manage the resource Can be required or optional Have values that match actual components
Solaris Solaris
Online Online mount F vxfs /dev/vx/dsk/WebDG/WebVol /Web mount F vxfs /dev/vx/dsk/WebDG/WebVol /Web
Resource Attributes Resources attributes define the specific characteristics on individual resources. As shown in the slide, the resource attribute values for the sample resource of type Mount correspond to the UNIX command line to mount a specific file system. VCS uses the attribute values to run the appropriate command or system call to perform an operation on the resource. Each resource has a set of required attributes that must be defined in order to enable VCS to manage the resource. For example, the Mount resource on Solaris has four required attributes that must be defined for each resource of type Mount: The directory of the mount point (MountPoint) The device for the mount point (BlockDevice) The type of file system (FSType) The options for the fsck command (FsckOpt) The first three attributes are the values used to build the UNIX mount command shown in the slide. The FsckOpt attribute is used if the mount command fails. In this case, VCS runs fsck with the specified options (-y, which means answer yes to all fsck questions) and attempts to mount the file system again. Some resources also have additional optional attributes you can define to control how VCS manages a resource. In the Mount resource example, MountOpt is an optional attribute you can use to define options to the UNIX mount command. For example, if this is a read-only file system, you can specify -ro as the MountOpt value.
Resource Types
Resources types: Are classifications of resources Specify the attributes needed to define a resource Are templates for defining resource instances
Solaris Solaris mount [-F FSType] [options] block_device mount_point mount [-F FSType] [options] block_device mount_point
Resource Types and Type Attributes Resources are classified by resource type. For example, disk groups, network interface cards (NICs), IP addresses, mount points, and databases are distinct types of resources. VCS provides a set of predefined resource typessome bundled, some add-onsin addition to the ability to create new resource types. Individual resources are instances of a resource type. For example, you may have several IP addresses under VCS control. Each of these IP addresses individually is a single resource of resource type IP. A resource type can be thought of as a template that defines the characteristics or attributes needed to define an individual resource (instance) of that type. You can view the relationship between resources and resource types by comparing the mount command for a resource on the previous slide with the mount syntax on this slide. The resource type defines the syntax for the mount command. The resource attributes fill in the values to form an actual command line.

29
Agents: How VCS Controls Resources

Each resource type has a corresponding agent that manages all resources of that type.
Agents have one or more entry points. Entry points perform set actions on resources. Each system runs one agent for each active resource type.
10.1.2.3 eri0 /web /log WebDG WebVol LogVol online offline monitor IP NIC Mount Disk Group Volume clean
Agents: How VCS Controls Resources Agents are processes that control resources. Each resource type has a corresponding agent that manages all resources of that resource type. Each cluster system runs only one agent process for each active resource type, no matter how many individual resources of that type are in use. Agents control resources using a defined set of actions, also called entry points. The four entry points common to most agents are: Online: Resource startup Offline: Resource shutdown Monitor: Probing the resource to retrieve status Clean: Killing the resource or cleaning up as necessary when a resource fails to be taken offline gracefully The difference between offline and clean is that offline is an orderly termination and clean is a forced termination. In UNIX, this can be thought of as the difference between exiting an application and sending the kill -9 command to the process. Each resource type needs a different way to be controlled. To accomplish this, each agent has a set of predefined entry points that specify how to perform each of the four actions. For example, the startup entry point of the Mount agent mounts a block device on a directory, whereas the startup entry point of the IP agent uses the ifconfig command to set the IP address on a unique IP alias on the network interface. VCS provides both predefined agents and the ability to create custom agents.
VERITAS Cluster Server Bundled Agents Reference Guide
Defines all VCS resource types Defines all VCS resource types for all bundled agents for all bundled agents Includes all supported UNIX Includes all supported UNIX platforms platforms Downloadable from Downloadable from http://support.veritas.com http://support.veritas.com
Solaris
AIX
HP-UX
Linux
VERITAS Cluster Server Bundled Agents Reference Guide The VERITAS Cluster Server Bundled Agents Reference Guide describes the agents that are provided with VCS and defines the required and optional attributes for each associated resource type. VERITAS also provides additional application and database agents in an Agent Pack that is updated quarterly. Some examples of these agents are: Oracle NetBackup Informix iPlanet Select the Agents and Options link on the VERITAS Cluster Server page at www.veritas.com for a complete list of agents available for VCS. To obtain PDF versions of product documentation for VCS and agents, see the Support Web site at http://support.veritas.com.

211
Cluster Communication
The cluster interconnect provides a communication channel between nodes.
The interconnect:
Determines which nodes are affiliated by cluster ID affiliated by cluster Uses a heartbeat mechanism Maintains cluster membership: cluster A single view of the state of each single view of state cluster node Is also referred to as the private referred to as the private network
Cluster Communication
VCS requires a cluster communication channel between systems in a cluster to serve as the cluster interconnect. This communication channel is also sometimes referred to as the private network because it is often implemented using a dedicated Ethernet network. VERITAS recommends that you use a minimum of two dedicated communication channels with separate infrastructuresfor example, multiple NICs and separate network hubsto implement a highly available cluster interconnect. The cluster interconnect has two primary purposes: Determine cluster membership: Membership in a cluster is determined by systems sending and receiving heartbeats (signals) on the cluster interconnect. This enables VCS to determine which systems are active members of the cluster and which systems are joining or leaving the cluster. In order to take corrective action on node failure, surviving members must agree when a node has departed. This membership needs to be accurate and coordinated among active membersnodes can be rebooted, powered off, faulted, and added to the cluster at any time. Maintain a distributed configuration: Cluster configuration and status information for every resource and service group in the cluster is distributed dynamically to all systems in the cluster. Cluster communication is handled by the Group Membership Services/Atomic Broadcast (GAB) mechanism and the Low Latency Transport (LLT) protocol, as described in the next sections.
212

Low-Latency Transport (LLT)

LLT is a high-performance, low-latency protocol for cluster communication.
LLT:
Sends heartbeat messages Sends Transports cluster Transports cluster communication traffic traffic Balances traffic load across traffic across multiple network links multiple links Is a proprietary protocol proprietary Runs on an Ethernet network an Ethernet
LLT LLT
LLT LLT
Low-Latency Transport Clustering technologies from Symantec use a high-performance, low-latency protocol for communications. LLT is designed for the high-bandwidth and lowlatency needs of not only VERITAS Cluster Server, but also VERITAS Cluster File System, in addition to Oracle Cache Fusion traffic in Oracle RAC configurations. LLT runs directly on top of the Data Link Provider Interface (DLPI) layer over Ethernet and has several major functions: Sending and receiving heartbeats over network links Monitoring and transporting network traffic over multiple network links to every active system Balancing the cluster communication load over multiple links Maintaining the state of communication Providing a transport mechanism for cluster communications

213
Group Membership Services/Atomic Broadcast (GAB)

GAB is a proprietary broadcast protocol that uses LLT as its transport mechanism. GAB: Manages cluster membershipGAB membership Is a proprietary broadcast protocol Sends and receives configuration information Uses the LLT transport mechanism
GAB GAB
LLT LLT GAB GAB
LLT LLT
Group Membership Services/Atomic Broadcast (GAB) GAB provides the following: Group Membership Services: GAB maintains the overall cluster membership by way of its group membership services function. Cluster membership is determined by tracking the heartbeat messages sent and received by LLT on all systems in the cluster over the cluster interconnect. GAB messages determine whether a system is an active member of the cluster, joining the cluster, or leaving the cluster. If a system stops sending heartbeats, GAB determines that the system has departed the cluster. Atomic Broadcast: Cluster configuration and status information are distributed dynamically to all systems in the cluster using GABs atomic broadcast feature. Atomic broadcast ensures that all active systems receive all messages for every resource and service group in the cluster.
214

I/O Fencing
I/O fencing is a mechanism to prevent uncoordinated access to shared storage.
Fence Fence Reboot GAB GAB Fence Fence GAB GAB
I/O fencing:
Monitors GAB for cluster membership changes Prevents simultaneous access to shared storage (fences off nodes) Is implemented as a kernel driver Coordinates with Volume Manager Requires hardware with SCSI-3 PR support
LLT LLT
LLT LLT
The Fencing Driver The fencing driver prevents multiple systems from accessing the same Volume Manager-controlled shared storage devices in the event that the cluster interconnect is severed. In the example of a two-node cluster displayed in the diagram, if the cluster interconnect fails, each system stops receiving heartbeats from the other system. GAB on each system determines that the other system has failed and passes the cluster membership change to the fencing module. The fencing modules on both systems contend for control of the disks according to an internal algorithm. The losing system is forced to panic and reboot. The winning system is now the only member of the cluster, and it fences off the shared data disks so that only systems that are still part of the cluster membership (only one system in this example) can access the shared storage. The winning system takes corrective action as specified within the cluster configuration, such as bringing service groups online that were previously running on the losing system.

215
High Availability Daemon (HAD)

HAD is the VCS engine, which manages all resources and tracks all configuration and state changes.
Agent Agent Agent
HAD:
Runs on each cluster node Maintains resource configuration and state information Manages agents and service groups Is monitored by the hashadow daemon
HAD HAD hashadow hashadow Fence Fence GAB GAB LLT LLT
The High Availability Daemon The VCS engine, also referred to as the high availability daemon (had), is the primary VCS process running on each cluster system. HAD tracks all changes in cluster configuration and resource status by communicating with GAB. HAD manages all application services (by way of agents) whether the cluster has one or many systems. Building on the knowledge that the agents manage individual resources, you can think of HAD as the manager of the agents. HAD uses the agents to monitor the status of all resources on all nodes. This modularity between had and the agents allows for efficiency of roles: HAD does not need to know how to start up Oracle or any other applications that can come under VCS control. Similarly, the agents do not need to make cluster-wide decisions. This modularity allows a new application to come under VCS control simply by adding a new agentno changes to the VCS engine are required. On each active cluster system, HAD updates all the other cluster systems with changes to the configuration or status. In order to ensure that the had daemon is highly available, a companion daemon, hashadow, monitors had, and if had fails, hashadow attempts to restart had. Likewise, had restarts hashadow if hashadow stops.
216

Maintaining the Cluster Configuration

HAD maintains the cluster configuration in memory on each node. Configuration changes are broadcast by HAD to broadcast by HAD all systems. The configuration is preserved on disk (main.cf).
main.cf
HAD HAD GAB GAB HAD HAD GAB GAB LLT LLT LLT LLT
main.cf
VCS Architecture
Maintaining the Cluster Configuration HAD maintains configuration and state information for all cluster resources in memory on each cluster system. Cluster state refers to tracking the status of all resources and service groups in the cluster. When any change to the cluster configuration occurs, such as the addition of a resource to a service group, HAD on the initiating system sends a message to HAD on each member of the cluster by way of GAB atomic broadcast, to ensure that each system has an identical view of the cluster. Atomic means that all systems receive updates, or all systems are rolled back to the previous state, much like a database atomic commit. The cluster configuration in memory is created from the main.cf file on disk in the case where HAD is not currently running on any cluster systems, so there is no configuration in memory. When you start VCS on the first cluster system, HAD builds the configuration in memory on that system from the main.cf file. Changes to a running configuration (in memory) are saved to disk in main.cf when certain operations occur. These procedures are described in more detail later in the course.

217
VCS Configuration Files

/etc/VRTSvcs/conf/config/main.cf include "types.cf" cluster vcs_web ( UserNames = { admin = ElmElgLimHmmKumGlj } Administrators = { admin } CounterInterval = 5 Cluster configuration stored in text Cluster configuration stored in text ) file on disk system S1 ( file on disk ) system S2 ( ) group WebSG ( SystemList = { S1 = 0, S2 = 1 } ) Mount WebMount ( MountPoint = "/Web" BlockDevice = "/dev/vx/dsk/WebDG/WebVol" FSType = vxfs FsckOpt = "-y" )
VCS Configuration Files Configuring VCS means conveying to VCS the definitions of the cluster, service groups, resources, and resource dependencies. VCS uses two configuration files in a default configuration: The main.cf file defines the entire cluster, including the cluster name, systems in the cluster, and definitions of service groups and resources, in addition to service group and resource dependencies. The types.cf file defines the resource types. Additional files similar to types.cf may be present if agents have been added. For example, if the Oracle enterprise agent is added, a resource types file, such as OracleTypes.cf, is also present. The cluster configuration is saved on disk in the /etc/VRTSvcs/conf/ config directory, so the memory configuration can be re-created after systems are restarted.
218

Lesson Summary
Key Points
HAD is the primary VCS process, which manages resources by way of agents. Resources are organized into service groups. Each system in a cluster has an identical view of the state of resources and service groups.
Reference Materials
High Availability Design and Customization Using VERITAS Cluster Server course VERITAS Cluster Server Bundled Agents Reference Guide VERITAS Cluster Server Users Guide
Next Steps
Your understanding of basic VCS architecture enables you to prepare your site for installing VCS.

219
220

Lesson 3 Preparing a Site for VCS
Lesson Introduction
Topic
Hardware Requirements and Recommendations Software Requirements and Recommendations Preparing Installation Information

Describe general VCS hardware requirements. Describe general VCS software requirements. Collect cluster design information to prepare for installation.
32

Hardware Requirements
entsupport.symantec.com entsupport.symantec.com
Hardware Compatibility List (HCL) Minimum configurations: Memory Disk space Cluster interconnect:
Redundant interconnect links Separate infrastructure (hubs, switches) No single point of failure
Systems installed and verified
Hardware Requirements and Recommendations

Hardware Requirements See the hardware compatibility list (HCL) at the VERITAS Web site for the most recent list of supported hardware for VERITAS products by Symantec. Cluster Interconnect VERITAS Cluster Server requires a minimum of two heartbeat channels for the cluster interconnect. Loss of the cluster interconnect results in downtime, and in nonfencing environments, can result in split brain condition (described in detail later in the course). Configure a minimum of two physically independent Ethernet connections on each node for the cluster interconnect: Two-node clusters can use crossover cables. Clusters with three or more nodes require hubs or switches. You can use layer 2 switches; however, this is not a requirement. For clusters using VERITAS Cluster File System or Oracle Real Application Cluster (RAC), Symantec recommends the use of multiple gigabit interconnects and gigabit switches.

33
Hardware Recommendations
No single points of failure Redundancy for:
Public network interfaces and infrastructures HBAs for shared storage (Fibre or SCSI)
Identically configured systems:

System hardware Network interface cards Storage HBAs
Networking For a highly available configuration, each system in the cluster should have a minimum of two physically independent Ethernet connections for the public network. Using the same interfaces on each system simplifies configuring and managing the cluster. Shared Storage VCS is designed primarily as a shared data high availability product; however, you can configure a cluster that has no shared storage. For shared storage clusters, consider these recommendations: One HBA minimum for shared and nonshared (boot) disks: To eliminate single points of failure, it is recommended to have two HBAs to connect to disks and to use a dynamic multipathing software, such as VERITAS Volume Manager DMP. Use multiple single-port HBAs or SCSI controllers rather than multiport interfaces to avoid single points of failure. Shared storage on a SAN must reside in the same zone as all cluster nodes. Data should be mirrored or protected by a hardware-based RAID mechanism. Use redundant storage and paths. Include all cluster-controlled data in your backup planning, implementation, and testing. For information about configuring SCSI shared storage, see the SCSI Controller Configuration for Shared Storage section in the Job Aids appendix.
34

Software Requirements
Determine supported software:
Operating system Patch level Volume management File system Applications
entsupport.symantec.com entsupport.symantec.com Release notes and installation guide Release notes and installation guide
Obtain VCS license key

vlicense.veritas.com vlicense.veritas.com Sales representative Sales representative Technical Support for upgrades Technical Support for upgrades
Software Requirements and Recommendations

Software Requirements Ensure that the software meets requirements for installing VCS. Verify that the required operating system patches are installed on the systems before installing VCS. For the latest software requirements, refer to the VERITAS Cluster Server Release Notes and the VERITAS Support Web site. Verify that storage management software versions are supported. Using storage management software, such as VERITAS Volume Manager and VERITAS File System, enhances high availability by enabling you to mirror data for redundancy and change the configuration or physical disks without interrupting services. Obtain VCS license keys. You must obtain license keys for each cluster system to complete the license process. For new installations, use the vLicense Web site, http://vlicense.veritas.com, or contact your VERITAS/Symantec sales representative for license keys. For upgrades, contact Technical Support. Also, verify that you have the required licenses to run applications on all systems where the corresponding service can run.

35
Software Recommendations
Identical system software configuration:
Operating system version and patch level Kernel and networking Configuration files User accounts
Identical application configuration:

Version and patch level User accounts Licenses
Software Recommendations Follow these recommendations to simplify installation, configuration, and management of the cluster: Operating system: Although it is not a strict requirement to run the same operating system version on all cluster systems, doing so greatly reduces the complexity of installation and ongoing cluster maintenance. Configuration: Setting up identical configurations on each system helps ensure that your application services can fail over and run properly on all cluster systems. Application: Verify that you have the same revision level of each application you are placing under VCS control. Ensure that any application-specific user accounts are created identically on each system. Ensure that you have appropriate licenses to enable the applications to run on any designated cluster system.
36

System and Network Preparation

Before beginning VCS installation:
Add /sbin, /usr/sbin, /opt/VRTSvcs/bin to PATH.
Verify that systems are accessible using fully qualified host names.
Create an alias for the abort>go sequence (Solaris). Configure ssh or rsh.
Only required for the duration of the VCS installation installation procedure No prompting permitted: prompting permitted: ssh: Set public/private keys Set public/private rsh: Set /.rhosts Set Move /etc/issue or similar type files type files Can install systems individually if remote access is access not allowed
System and Network Preparation Perform these tasks before starting VCS installation. Add directories to the PATH variable, if required. For the PATH settings, see the Installation guide for your platform. Verify that administrative IP addresses are configured on your public network interfaces and that all systems are accessible on the public network using fully qualified host names. For details on configuring administrative IP addresses, see the Job Aids appendix.
Solaris
Consider disabling the go sequence after Stop-A on Solaris systems. When a Solaris system in a VCS cluster is halted with the abort sequence (STOP-A), it stops producing VCS heartbeats. This causes other systems to consider this a failed node. Ensure that the only action possible after an abort is a reset. To ensure that you never issue a go function after an abort, create an alias for the go function that displays a message. See the VERITAS Cluster Server Installation Guide for the detailed procedure. Enable ssh or rsh to install all cluster systems from one system. If you cannot enable secure communications, you can install VCS on each system separately.

37
Required Installation Input

Collect required installation information:
System (node) names License keys Cluster name Cluster ID (0 64K) Network interfaces for cluster interconnect links
Preparing Installation Information

Required Installation Input Verify that you have the information necessary to install VCS. Be prepared to supply: Names of the systems that will be members of the cluster A name for the cluster, beginning with a letter of the alphabet (a-z, A-Z) A unique ID number for the cluster in the range 0 to 64K Avoid using 0 because this is the default setting and can lead to conflicting cluster numbers if other clusters are added later using the default setting. All clusters sharing a private network infrastructure (including connection to the same public network if used for low-priority links) must have a unique ID. Device names of the network interfaces used for the cluster interconnect
38

Cluster Configuration Options

Prepare for configuring options:
VCS user names and passwords Managed host (Cluster Management Console) Local CMC (Web GUI):
Network interface for CMC Web GUI Virtual IP address for CMC Web GUI
SMTP server name and e-mail addresses SNMP Console name and message levels Root broker node for security
Default account: account: User name: admin Password: password
3
You can opt to configure additional cluster services during installation. VCS user accounts: Add accounts or change the default admin account. Managed host: Add cluster nodes to a Cluster Management Console management server as described in the Managed Hosts section. Local Cluster Management Console (Web GUI): Specify a network interface and virtual IP address on the public network to configure a highly available Web management interface for local cluster administration. Notification: Specify SMTP and SNMP information during installation to configure the cluster notification service. Broker nodes (4.1 and later): VCS can be configured to use VERITAS Security Services (VxSS) to provide secure communication between cluster nodes and clients, as described in the VERITAS Security Services section.

39
Managed Hosts
A managed host:
Can be any 4.x or 5.0 cluster system, any platform Is under control of 5.0 CMC Runs a console connector that communicates with CMC
Cluster Management Console (CMC)
This course covers local This course covers local cluster management only. cluster management only.
Managed Hosts During VCS installation, you are prompted to select whether the systems in this cluster are managed hosts in a Cluster Management Console environment. Cluster Management Console (CMC) is a Web-based interface for managing multiple clusters at different physical locations, with cluster systems running on any operating system platform supported by VCS 4.x or 5.0. You can also use the CMC in local mode to manage only the local cluster. This is similar to the Web GUI functionality in pre-5.0 versions of VCS. Alternately, you can place cluster systems under CMC control by configuring a cluster connector, which enables the systems to be CMC-managed hosts. You can select the type of CMC functionality (or none at all) during VCS installation, or configure this after installation. During installation: If you select to use CMC for local cluster management, you must provide: A public NIC for each node A virtual IP address and netmask If you configure the cluster nodes as managed hosts, you must also configure the cluster connector by providing: The IP address or fully-qualified host name for the CMC server The CMC service account password The root hash of the management server This course covers local cluster management only. Refer to the product documentation for information about managed hosts and CMC.
310

Symantec Product Authentication Service

Provides secure communication: Among cluster systems Between VCS interfaces and cluster systems Uses digital certificates for authentication Uses Secure Socket Layer (SSL) for encryption Provides user authentication (single sign-on) Requires one root broker node to be running Requires all cluster systems to be authentication brokers Formerly named VERITAS Security Services (VxSS)
Symantec recommends using a system outside Symantec recommends using a system outside the cluster to serve as the root broker node. the cluster to serve as the root broker node.
Symantec Product Authentication Service VCS versions 4.1 and later can be configured to use Symantec Product Authentication Service (formerly named VERITAS Security Services or VxSS) to provide secure communication between cluster nodes and clients, including the Java and the Web consoles. VCS uses digital certificates for authentication and uses SSL to encrypt communication over the public network. In the secure mode, VCS uses platform-based authentication; VCS does not store user passwords. All VCS users are system users. After a user is authenticated, the account information does not need to be provided again to connect to the cluster (single sign-on). Note: Security Services are in the process of being implemented in all VERITAS products. VxSS requires one system to act as a root broker node. This system serves as the main registration and certification authority and should be a system that is not a member of the cluster. All cluster systems must be configured as authentication broker nodes, which can authenticate clients. Security can be configured after VCS is installed and running. For additional information on configuring and running VCS in secure mode, see Enabling and Disabling VERITAS Security Services in the VERITAS Cluster Server Users Guide.

311
Using a Design Worksheet

Validate installation Validate installation input as you input as you prepare the site. prepare the site.
Cluster Definition Cluster Name Required Attributes UserNames ClusterAddress Administrators
Value vcs_web admin=password 192.168.3.91 admin
S2 System Definition System System S1 Value S1 S2
Using a Design Worksheet You may want to use a design worksheet to collect the information required to install VCS as you prepare the site for VCS deployment. You can then use this worksheet later when you are installing VCS.
312

Lesson Summary
Key Points
Verify hardware and software compatibility and record information in a worksheet. Prepare cluster configuration values before you begin installation.
Reference Materials
VERITAS Cluster Server Release Notes VERITAS Cluster Server Installation Guide http://entsupport.symantec.com http://vlicense.veritas.com
3
Lab 3: Validating Site Preparation

Visually inspect the classroom lab site. Visually inspect the classroom lab site. Complete and validate the design worksheet. Complete and validate the design worksheet.
train2
System Definition train1 System System
Sample Value train1 train2
Your Value
See the next slide for lab assignments. See the next slide for lab assignments.
Labs and solutions for this lesson are located on the following pages. "Lab 3: Validating Site Preparation," page A-3. "Lab 3 Solutions: Validating Site Preparation," page B-3.

313
314

Lesson 4 Installing VCS
Lesson Introduction
Topic
Using the VERITAS Product Installer VCS Configuration Files Viewing the Default VCS Configuration Other Installation Considerations

Install VCS using the VERITAS product installation utility. Display the configuration files created during installation. View the VCS configuration created during installation. Describe other components to consider at installation time.
42

Using the VERITAS Product Installer

The VERITAS product installer:
Is the recommended installation procedure for VCS Performs environment checking for prerequisites Enables you to add product licenses Is started by running the installer file in the DVD mount point directory Runs the installvcs utility Logs user input and program output to: /opt/VRTS/install/logs
VERITAS ships high availability and storage foundation products with a product installation utility that enables you to install these products using the same interface. Viewing Installation Logs At the end of every product installation, the installer creates three text files: A log file containing any system commands executed and their output A response file to be used in conjunction with the -responsefile option of the installer A summary file containing the output of the VERITAS product installer scripts These files are located in /opt/VRTS/install/logs. The names and locations of each file are displayed at the end of each product installation installertimestamp.log, .summary, and .response. It is recommended that these logs be kept for auditing and debugging purposes.

43
Using the VERITAS Product Installer
The installvcs Utility

The installvcs utility, called by the product installer:
Uses the operating system-specific commands to install VCS packages Configures network links for the cluster interconnect Can install all systems simultaneously Configures and starts VCS:
Brings the ClusterService group online Enables access to the VCS Web GUI (if configured)
The installvcs Utility The installvcs utility is used by the product installer to automatically install and configure a cluster. If remote root access is enabled, installvcs installs and configures all cluster systems you specify during the installation process. The installation utility performs these high-level tasks: Installs VCS packages on all the systems in the cluster Configures cluster interconnect links Brings the cluster up without any application services Make any changes to the new cluster configuration, such as the addition of any application services, after the installation is completed. For a list of software packages that are installed, see the release notes for your VCS version and platform. Options to installvcs The installvcs utility supports several options that enable you to tailor the installation process. For example, you can: Perform an unattended installation. Install software packages without configuring a cluster. Install VCS in a secure environment. Upgrade an existing VCS cluster. For a complete description of installvcs options, see the VERITAS Cluster Server Installation Guide.
44
Automated VCS Installation Procedure

Invoke Invoke installer. installer. Enter system names. Enter system names. Enter license keys. Enter license keys. Select optional packages. Select optional packages. Configure the cluster Configure the cluster (name, ID, interconnects). (name, ID, interconnects). Verify communication Verify communication and install infrastructure and install infrastructure packages. packages.
What the script does
User input to the script
Select root broker node. Select root broker node. Set up VCS user accounts. Set up VCS user accounts. Configure cluster Configure cluster connector or local CMC connector or local CMC (Web GUI). (Web GUI). Configure SMTP and SNMP Configure SMTP and SNMP notification. notification. Install VCS packages. Install VCS packages. Configure VCS. Configure VCS. Start VCS. Start VCS.
If you use the installer utility and select VCS from the product list, installvcs is started. You can also run installvcs directly from the command line. Using information you supply, the installvcs utility installs VCS and all bundled agents on each cluster system, installs a Perl interpreter, and sets up the LLT and GAB communication services. The utility also gives you the option to configure the cluster connector for managed hosts or the local Cluster Manager Console (Web Console), and to set up SNMP and SMTP notification features in the cluster. As you use the installvcs utility, you can review summaries to confirm the information that you provide. You can stop or restart the installation after reviewing the summaries. Installation of VCS packages takes place only after you have confirmed the information. However, you must remove partially installed VCS files before running the installvcs utility again. Note: The installation process may differ depending on the platform. For example, on HP-UX, the package installation occurs before the configuration questions. You need to reboot the systems after the package installation and then execute ./installvcs -configure to continue with the rest of the configuration questions as displayed on the flow chart. Check your product installation guide before you begin installing VCS to familiarize yourself with the procedure for your platform.

45
Automated VCS Installation Procedure
Installing VCS Updates

Check for VCS updates periodically: http://entsupport.symantec.com Updates are provided as patches or maintenance packs. Refer to the update installation instructions.
Installing VCS Updates Updates for VCS are created periodically in the form of patches or maintenance packs to provide software fixes and enhancements. Before proceeding to configure your cluster, check the Technical Support Web site at http:// entsupport.symantec.com for information about any updates that may be available. Download the latest update for your version of VCS according to the instructions provided on the Web site. The installation instructions for VCS updates are included with the update pack. Before you install an update, ensure that all prerequisites are met. At the end of the update installation, you may be prompted to run scripts to update agents or other portions of the VCS configuration. Continue through any additional procedures to ensure that the latest updates are applied.
46

VCS File Locations

Directories /sbin, /usr/sbin, /opt/VRTSvcs/bin /var/VRTSvcs/log /etc/VRTSvcs/conf/config /etc /opt/VRTSvcs/doc /opt/VRTS/install/logs Contents Executables, scripts, libraries VCS engine and agent log files Cluster configuration files Configuration files for GAB, LLT, and I/O fencing PDF versions of VCS documentation VCS installation log files
VCS File Locations The VCS installation procedure creates several directory structures. Commands: /sbin, /usr/sbin, and /opt/VRTSvcs/bin VCS engine and agent log files: /var/VRTSvcs/log Configuration files: /etc and /etc/VRTSvcs/conf/config Product documentation: /opt/VRTSvcs/doc Note: For VCS 4.0, PDF files are located in /opt/VRTSvcsdc. Installation log files: /opt/VRTS/install/logs

47
VCS Configuration Files
Communication Configuration Files

LLT configuration files:
/etc/llttab
Cluster ID number System name LLT network interfaces used for the cluster interconnect
/etc/llthosts
Cluster host names with corresponding LLT node ID number
GAB configuration file:

/etc/gabtab GAB startup command with specified number of systems
Communication Configuration Files The installvcs utility creates these VCS communication configuration files: /etc/llttab The llttab file is the primary LLT configuration file and is used to: Set the cluster ID number. Set system ID numbers. Specify the network device names used for the cluster interconnect. /etc/llthosts The llthosts file associates a system name with a unique VCS cluster node ID number for every system in the cluster. This file is the same on all systems in the cluster. /etc/gabtab This file contains the command line that is used to start GAB. Cluster communication is described in detail later in the course.
48

Cluster Configuration Files

VCS configuration files
/etc/VRTSvcs/conf/config/types.cf /etc/VRTSvcs/conf/config/main.cf include "types.cf" cluster vcs_train ( Cluster name Cluster name ) system train1 ( ) Systems in the cluster Systems in the cluster system train2 ( ) group ClusterService ( ) IP webip ( Local CMC Web GUI Local CMC Web GUI Device = eri0 Address = "192.168.105.101" NetMask = "255.255.255.0" Solaris Solaris ) . . .
The following cluster configuration files are added as a result of package installation: /etc/VRTSvcs/conf/config/types.cf /etc/VRTSvcs/conf/config/main.cf The installvcs utility modifies the main.cf file to configure the ClusterService service group, which includes the resources used to manage the Web-based local Cluster Management Console. VCS configuration files are discussed in detail throughout the course.

49
Cluster Configuration Files
Viewing Installation Results

List the VERITAS packages installed.
pkginfo pkginfo lslpp lslpp swlist swlist rpm rpm
View VCS configuration files.

llthosts llthosts llttab llttab gabtab gabtab main.cf main.cf types.cf types.cf
Access the local CMC Web Console.

http://IP_Address:8181/cmc
Viewing the Default VCS Configuration

Viewing Installation Results After the initial installation, you can perform the following tasks to view the cluster configuration performed during the installation process. List the VERITAS packages installed on the system:
Solaris
pkginfo | grep -i vrts

AIX
lslpp -L | grep -i vrts

HP-UX
swlist -l product | grep -i vrts

Linux
rpm -qa | grep -i vrts View the VCS and communication configuration files. Log onto the VCS local Cluster Management Console Web GUI using the virtual IP address specified during installation: http://IP_Address:8181/cmc
410

Viewing Status
View LLT status:
# lltconfig llt is running
View GAB status:

# gabconfig -a GAB Port Memberships ================================ Port a gen a36e003 membership 01 Port h gen fd57002 membership 01
View VCS status:

# hastatus sum -- System State A S1 RUNNING A S2 RUNNING Frozen 0 0
After installation is complete, you can check the status of VCS components. View VCS communications status on the cluster interconnect using LLT and GAB commands. This topic is discussed in more detail later in the course. For now, you can see that LLT and GAB are running using these commands: lltconfig llt is running gabconfig -a GAB Port Memberships =============================================== Port a gen a36e003 membership 01 Port h gen fd57002 membership 01 View the cluster status: hastatus -sum -- SYSTEM STATE -- System State A S1 RUNNING A S2 RUNNING -- GROUP STATE -- Group System B ClusterService S1 B ClusterService S2
Frozen 0 0 Probed Y Y AutoDisabled N N State ONLINE OFFLINE

411
Viewing Status
I/O Fencing Considerations

I/O Fencing is the recommended method for protecting shared storage in a cluster environment. Configure fencing after initial VCS installation if:
Your shared storage devices support SCSI-3 Persistent Reservations; and You installed a 4.0 or later version of VCS; and You are using Volume Manager 4.0 or later.
A detailed procedure is provided later in the course.
Other Installation Considerations

Fencing Considerations If you are using VCS with shared storage devices that support SCSI-3 Persistent Reservations, configure fencing after VCS is initially installed. You must have VCS 4.0 and Volume Manager 4.0 (or later) to implement fencing. You can configure fencing at any time. However, if you set up fencing after you have service groups running, you must stop and restart VCS for fencing to take effect. The procedure for configuring fencing is provided later in the course.
412

Cluster Manager Java GUI

The Cluster Manager Java GUI:
Can be installed on non-cluster systems Is installed on cluster systems if all optional packages are selected Can manage multiple clusters as separate sessions Has no effect on Cluster Management Console Web GUI Is started on UNIX by typing hagui &
You can install the VCS Java-based Cluster Manager GUI during the cluster installation by selecting optional packages. You can also manually install Cluster Manager from the VRTScscm package on any supported system using the appropriate operating system installation utility. See the VERITAS Cluster Server Installation Guide for instructions specific to your platform. Installing the Java Console on Windows You can also install and use the VCS Java Console from a Windows workstation using the setup.exe file in the following directory on the CD:
Solaris
\windows\WindowsInstallers\WindowsClusterManager\EN
HP-UX
\windows\WindowsClusterManager\EN

413
Cluster Manager Java GUI
Lesson Summary
Key Points
Use the VERITAS Common Product Installer to install VCS on UNIX systems. Familiarize yourself with the installed and running configuration.
Reference Materials
VERITAS Cluster Server Release Notes VERITAS Cluster Server Installation Guide http://support.veritas.com http://vlicense.veritas.com
Lab 4: Installing VCS

vcs1
Link 1:______ Link 2:______
Link 1:______ Link 2:______
Public:______ train1 train1 5.0 5.0 4.x 4.x Pre-4.0 Pre-4.0 # ./installer -rsh # ./installer -rsh # ./installer # ./installer # ./installvcs # ./installvcs train2 train2
Public:______
Software location:_______________________________ Subnet:_______
Labs and solutions for this lesson are located on the following pages. "Lab 4: Installing VCS," page A-13. "Lab 4 Solutions: Installing VCS," page B-13.
414

Lesson 5 VCS Operations
Course Overview
Lesson 1: High Availability Concepts Lesson 2: VCS Building Blocks Lesson 3: Preparing a Site for VCS Lesson 4: Installing VCS Lesson 5: VCS Operations Lesson 6: Preparing Services for VCS Lesson 7: VCS Configuration Methods Lesson 8: Online Configuration Lesson 9: Offline Configuration Lesson 10: Sharing Network Interfaces Lesson 11: Configuring Notification Lesson 12: Configuring VCS Response to Faults Lesson 13: Cluster Communications Lesson 14: System and Communication Faults Lesson 15: I/O Fencing Lesson 16: Troubleshooting
Topic
Managing Applications in a Cluster Environment Common VCS Operations Using the VCS Simulator

Describe key considerations for managing applications. Perform common cluster administrative operations. Use the VCS Simulator to practice managing services.
52

Key Considerations
To manage an application under VCS control: Use VCS to start and stop the application. or Direct VCS not to intervene:
1. Freeze the service group. 2. Perform administrative operations outside of VCS. 3. Unfreeze the service group to reenable VCS control.
You can mistakenly cause problems if you manipulate You can mistakenly cause problems if you manipulate resources outside of VCS, such as forcing faults: resources outside of VCS, such as forcing faults: Causing failover and downtime Causing failover and downtime Preventing failover Preventing failover
Managing Applications in a Cluster Environment

Key Considerations In a cluster environment, the application software is a resource that is a member of the service group. When an application is placed under control of VCS, you must change your standard administration practices for managing the application. Consider a nonclustered, single-host environment running an Oracle database. A common method for shutting down the database is to log on as the database administrator (DBA) and use sqlplus to shut down the database. In a clustered environment where Oracle is a resource in a failover service group, the same action causes a failover, which results in VCS detecting a fault (the database is offline) and bringing the database online on another system. It is also normal and common to do other things in a nonclustered environment, such as forcibly unmounting a file system. Under VCS, the manipulation of resources that are part of service groups and the service groups themselves need to be managed using VCS utilities, such as the GUI or CLI, with full awareness of resource and service group dependencies. Alternately, you can freeze the service group to prevent VCS from taking action when changes in resource status are detected, as described later in this lesson. Warning: In clusters that do not implement fencing, VCS cannot prevent someone with proper permissions from manually starting another instance of the application on another system outside of VCS control. VCS will eventually detect this and take corrective action, but it may be too late to prevent data corruption.
5

53
VCS Management Tools

CLI Commandline interface Runs on the local system Java GUI Thick client Runs on UNIX and Windows systems Web GUI Thin client CMC-managed host mode Local mode VCS Simulator Create, model, and test cluster configurations Cannot manage a running cluster configuration
VCS Management Tools You can use any of the VCS interfaces to manage the cluster environment, provided that you have the proper VCS authorization. VCS user accounts are described in more detail in the VCS Configuration Methods lesson. The Web GUI in 5.0 is a component of Cluster Management Console and can run in two modes: If a system is a CMC-managed host, the Web GUI runs in host mode and the system runs a connector process that communications with a CMC management server A system that is not configured as a CMC-managed host runs the Web GUI in local mode For details about the requirements for running the graphical user interfaces (GUIs), see the VERITAS Cluster Server Installation Guide and the VERITAS Cluster Server Users Guide. Note: You cannot use the Simulator to manage a running cluster configuration.
54

Displaying Cluster Status

Determine the state of the cluster at a point in time: hastatus sum Display continuous cluster status: hastatus
Status and Attributes
Common VCS Operations

Displaying Cluster Status To display cluster status, use either form of the hastatus command: hastatus -sum Show a static snapshot of the status of cluster objects. hastatus Show a continuous updated display of the status of cluster objects. You can also display status using the GUI management tools.

55
Displaying Logs
VCS log file location: /var/VRTSvcs/log HAD (engine) log: engine_A.log Java GUI command log:
Is useful for learning the CLI Can be used to create batch files Is cleared when you log out of the GUI
Logs
Displaying Logs The engine log is located in /var/VRTSvcs/log/engine_A.log. You can view this file with standard UNIX text-file utilities such as tail, more, or vi. You can also display the engine log in Cluster Manager to see detailed status information about activity in the cluster. You can also view the command log to see how the activities you perform using the Cluster Manager Java GUI are translated into VCS commands. You can use the command log as a resource for creating batch files to use when performing repetitive configuration or administration tasks. Note: The command log is not saved to diskyou can view commands only for the current session of the GUI.
56

Example: Switching ClusterService
hastatus hastatus
Example: Switching ClusterService The slide shows how you can use the GUI and CLI together to develop an understanding of how VCS responds to events in the cluster environment, and the effects on application services under VCS control.
5

57
Displaying Object Information

Use hares to display resource information:
hares hares hares hares hagrp hagrp hagrp hagrp display resource list condition value resource attribute state resource display group list condition value group attribute state group
Use hagrp to display service group information:
Getting help:
Command-line syntax: Command-line syntax: ha_command help ha_command help man ha_command man ha_command Bundled Agents Reference Guide Bundled Guide
Displaying Object Information Displaying Resources Using the CLI The following examples show how to display resource attributes and status. Display values of attributes to ensure they are set properly. hares -display webip #Resource Attribute System Value . . . webip AutoStart global 1 webip Critical global 1 Determine which resources are non-critical. hares -list Critical=0 VCSweb S1 VCSweb S2 Determine the virtual IP address for the Web GUI. hares -value webip Address #Resource Attribute System Value webip Address global 10.10.27.93 Determine the state of a resource on each cluster system. hares -state VCSweb #Resource Attribute System Value webip State S1 OFFLINE webip State S2 ONLINE
58
Displaying Service Group Information Using the CLI The following examples show some common uses of the hagrp command for displaying service group information and status. Display values of all attributes to ensure they are set properly. hagrp -display ClusterService #Group Attribute System Value . . . ClusterService AutoFailOver global 1 ClusterService AutoRestart global 1 . . . Determine which service groups are frozen, and are therefore not able to be stopped, started, or failed over. hagrp -list Frozen=1 WebSG S1 WebSG S1 Determine whether a service group is set to automatically start. hagrp -value WebSG AutoStart 1 List the state of a service group on each system. hagrp -state WebSG #Group Attribute System Value WebSG State S1 Offline WebSG State S1 Offline
5

59
Bringing Service Groups Online

Service groups are brought online:
Manually, after maintenance Automatically, when VCS is started
WebSG WebSG Web IP NIC Mount Volume
Resources are brought online in order from bottom to top. Persistent resources do not affect service group state.
DiskGroup online online System S1
Online
hagrp -online hagrp -online
Bringing Service Groups Online When a service group is brought online, resources are brought online starting with the lowest (child) resources and progressing up the resource dependency tree to the highest (parent) resources. In order to bring a failover service group online, VCS must verify that all nonpersistent resources in the service group are offline everywhere in the cluster. If any nonpersistent resource is online on another system, the service group is not brought online. A service group is considered online if all of its nonpersistent, autostart, and critical resources are online. An autostart resource is a resource whose AutoStart attribute is set to 1. A critical resource is a resource whose Critical attribute is set to 1. The state of persistent resources is not considered when determining the online or offline state of a service group because persistent resources cannot be taken offline. Bringing a Service Group Online Using the CLI To bring a service group online, use either form of the hagrp command: hagrp -online group -sys system hagrp -online group -any The -any option, supported as of VCS 4.0, brings the service group online based on the groups failover policy. Failover policies are described in detail later in the course.
510
Taking Service Groups Offline

Service groups are taken offline:
Manually, for maintenance Automatically, when VCS is stopped or during failover
WebSG WebSG offline offline Web IP NIC Mount Volume
Resources are taken offline in order from top to bottom. A service group is offline when all nonpersistent resources are offline.
DiskGroup S1
Offline
hagrp -offline hagrp -offline
Taking Service Groups Offline When a service group is taken offline, resources are taken offline starting with the highest (parent) resources in each branch of the resource dependency tree and progressing down the resource dependency tree to the lowest (child) resources. Persistent resources cannot be taken offline. Therefore, the service group is considered offline when all nonpersistent resources are offline. Taking a Service Group Offline Using the CLI To take a service group offline, use either form of the hagrp command: hagrp -offline group -sys system Provide the service group name and the name of a system where the service group is online. hagrp -offline group -any Provide the service group name. The -any switch, supported as of VCS 4.0, takes a failover service group offline on the system where it is online. All instances of a parallel service group are taken offline when the -any switch is used.
5

511
Switching Service Groups

You can switch a service group between systems to:
Test failover Migrate services for maintenance
VCS:
1. Takes resources offline on system S1. 2. Brings resources online on system S2. Only resources online on system S1 are brought online on system S2.
Before
After
Switch
.
hagrp -switch hagrp -switch
Switching Service Groups In order to ensure that failover can occur as expected in the event of a fault, test the failover process by switching the service group between systems within the cluster. Switching a Service Group Using the CLI To switch a service group, type:
hagrp -switch group -to system
Provide the service group name and the name of the system where the service group is to be brought online.
512

Freezing a Service Group

Freeze a service group to prevent offline, online, or failover actions.
Example use:
DBA can start and stop database outside of VCS control
Persistent freeze:
Remains in effect through VCS restarts Frozen=1
Temporary freeze:
Only in effect until VCS restarts TFrozen=1
You can cause a concurrency violation and possible data You can cause a concurrency violation and possible data corruption if you bring an application online outside of VCS. corruption if you bring an application online outside of VCS. Freeze hagrp -freeze hagrp -freeze
Freezing a Service Group When you freeze a service group, VCS continues to monitor the resources, but it does not allow the service group (or its resources) to be taken offline or brought online. Failover is also disabled, even if a resource faults. You can also specify that the freeze is in effect even if VCS is stopped and restarted throughout the cluster. Warning: When frozen, VCS does not take action on the service group even if you cause a concurrency violation by bringing the service online on another system outside of VCS. Freezing and Unfreezing a Service Group Using the CLI To freeze and unfreeze a service group temporarily, type: hagrp -freeze group hagrp -unfreeze group To freeze a service group persistently, you must first open the configuration: haconf -makerw hagrp -freeze group -persistent hagrp -unfreeze group -persistent To determine if a service group is frozen, display the Frozen (for persistent) and TFrozen (for temporary) service group attributes for a service group.
hagrp -value group Frozen
5

513
Bringing Resources Online

Resources are brought online:
Automatically, when a service group is brought online Manually, after maintenance In dependency order from bottom
WebSG WebSG Web IP NIC Mount Volume
Agent runs the online entry point Online entry point:

Runs specific startup operations Example: Disk group online vxdg t import WebDG
DiskGroup online online S1
Online
hares -online hares -online
Bringing Resources Online In normal day-to-day operations, you perform most management operations at the service group level. However, you may need to perform maintenance tasks that require one or more resources to be offline while others are online. Also, if you make errors during resource configuration, you can cause a resource to fail to be brought online. Bringing Resources Online Using the CLI To bring a resource online, type:
hares -online resource -sys system
Provide the resource name and the name of a system that is configured to run the service group.
514

Taking Resources Offline

Resources are taken offline:
Automatically, when a service group is taken offline Manually, for maintenance In dependency order from top
OraSG OraSG Oracle IP NIC Mount Volume
Agent runs the offline entry point Offline entry point:

Runs specific shutdown operations Example: Database offline sqlplus "/as sysdba" shutdown
DiskGroup Ora1
Offline
hares -offline hares -offline
Taking Resources Offline Taking resources offline should not be a normal occurrence. Taking resources offline causes the service group to become partially online, and availability of the application service is affected. If a resource needs to be taken offline, for example, for maintenance of underlying hardware, then consider switching the service group to another system. If multiple resources need to be taken offline manually, then they must be taken offline in resource dependency tree order, that is, from top to bottom. Taking a resource offline and immediately bringing it online may be necessary if, for example, the resource must reread a configuration file due to a change. Or you may need to take a database resource offline in order to perform an update that modifies the database files. Taking Resources Offline Using the CLI To take a resource offline, type:
hares -offline resource -sys system
5
Provide the resource name and the name of a system.

515
Using the VCS Simulator

The VCS Simulator enables you to: Learn how to manage VCS using predefined clusters. Create and test new cluster configurations. Simulate faults to see how VCS responds. Use the Java GUI or the Simulator-specific commandline interface.
Simulator software is: Simulator software is:

Installed automatically on cluster systems if you select all option Installed automatically on cluster systems if you select all option software software Available on the VCS installation DVD: Available on the VCS installation DVD: UNIX: VRTScssim UNIX:
Windows: ./windows/WindowsInstallers/WindowsSimulator Windows: ./windows/WindowsInstallers/WindowsSimulator

You can use the VCS Simulator as a tool for learning how to manage VCS operations and applications under VCS control. You can perform all basic service group operations using the Simulator. The Simulator also has other uses as a configuration and test tool. For this lesson, the focus of the Simulator discussion is on using a predefined VCS configuration to practice performing administration tasks. You can install the Simulator while installing VCS or obtain the software from the installation CD to install on other supported UNIX and Windows systems. No additional licensing is required to install and use the Simulator.
516

Simulator Java Console

Use the Simulator Java Console to:
Start and stop sample Simulator configurations. Launch the Cluster Manager Java Console. Create new Simulator configurations. Verify configuration file syntax. Delete simulator configurations.
Link to 4.1
Simulator Java Console The Simulator Java Console is provided to create and manage multiple Simulator configurations, which can run simultaneously. To start the Simulator Java Console, type /opt/VRTSvcs/bin/hasimgui & When the Simulator Java Console is running, a set of sample Simulator configurations is displayed, showing an offline status. You can start one or more existing cluster configurations and then launch an instance of the Cluster Manager Java Console for each running Simulator configuration. You can use the Cluster Manager Java Console to perform all the same tasks as an actual cluster configuration. Additional options are available for Simulator configurations to enable you to test various failure scenarios, including faulting resources and powering off systems. Note: In VCS versions before 5.0, you can remove a simulated cluster configuration by removing the corresponding entries in the simconfig and config.properties GUI configuration files and removing the directory structure for that cluster from the /opt/VRTScssim directory.
5

517
Creating a New Simulator Configuration

To add a simulated cluster configuration specify the: Cluster name System name Unique port number Operating system platform After the simulated cluster is created, view the: New directory structure: /opt/VRTScssim/cluster_name
Configuration files:
types.cf main.cf
Creating a New Simulator Configuration When you add a Simulator cluster configuration, a new directory structure is created and populated with sample files based on the criteria you specify. On UNIX systems, Simulator configurations are located in /opt/VRTScssim. On Windows, the Simulator repository is in C:\Program Files\VERITAS\VCS Simulator. Within the Simulator directory, each Simulator configuration has a directory corresponding to the cluster name. When the Simulator is installed, several sample configurations are placed in the sim_dir, such as: SOL_ORACLE: An two-node Solaris cluster with an Oracle service group LIN_NFS: A two-node Linux cluster with two NFS service groups WIN_SQL_VVR_C1: One of two clusters in a global Windows cluster with a SQL service group When you add a cluster: The default types.cf file corresponding to the selected platform is copied from sim_dir/types to the sim_dir/cluster_name/conf/ config directory. A main.cf file is created based on the sim_dir/sample_clus/conf/ config/main.cf file, using the cluster and system names specified when adding the cluster.
518

Using a Customized Configuration

To use the simulator with existing main.cf and types.cf files:
1. Create the Simulated cluster. 2. Copy the existing configuration files to:
/opt/VRTScssim/cluster_name/conf/config
3. 4. 5. 6.
Verify the configuration file syntax. Start the simulated cluster. Launch the Java Console. Log in with admin/password.
Using a Customized Configuration You can copy existing main.cf and types.cf files into the conf directory of a simulated cluster to test or modify that configuration. For example, if you have a cluster implementation and want to test fault and failover behavior, you can create a new simulated cluster and copy the configuration files from the actual cluster. You can also create new resources or service groups within the Simulated cluster, and then copy the modified configuration files back into a real cluster. This requires stopping and restarting VCS, as discussed later in the course. Using Cluster Manager with the Simulator After the Simulator is started, you can use the Cluster Manager Java GUI to connect to the simulated cluster. You can either launch Cluster Manager from the Simulator GUI or start Cluster Manager from Cluster Monitor. Note: If you receive a message that the GUI is unable to connect to the Simulator, verify that the Simulator is running and check the port number.

519
Simulator Command-Line Interface
hasim hasim
Use a separate terminal window for each Simulator configuration.
# # # # #
hasim setupclus myclus simport 16555 wacport -1 hasim start myclus_sys1 clus myclus VCS_SIM_PORT=16555 VCS_SIMULATOR_HOME=/opt/VRTScssim export VCS_SIM_PORT VCS_SIMULATOR_HOME Value Running
# hasim sys state $System Attribute myclus_sys1 SysState
Simulator Command-Line Interface You can use the Simulator command-line interface (CLI) to add and manage simulated cluster configurations. While there are a few commands specific to Simulator activities, such as cluster setup shown in the slide, in general, the hasim command syntax follows the corresponding ha commands used to manage an actual cluster configuration. Use the following procedure to initially set up a Simulator cluster configuration. The corresponding commands are displayed in the slide. Note: This procedure assumes you have already set the PATH and VCS_SIMULATOR_HOME environment variables.
1 2 3 4 5 6
Change to the /opt/VRTScssim directory if you want to view the new structure created when adding a cluster. Add the cluster configuration, specifying a unique cluster name and port. For local clusters, specify -1 as the WAC port. Start the cluster on the first system. Set VCS_SIMULATOR_HOME to /opt/VRTScssim. Set VCS_SIM_PORT to the value you specified when adding the cluster. If you are simulating a global cluster, set the VCS_SIM_WACPORT environment variable to the value you specified when adding the cluster.
Now you can use hasim commands to test or modify the configuration.
520

Lesson Summary
Key Points
Use VCS tools to manage applications under VCS control. The VCS Simulator can be used to practice managing resources and service groups.
Reference Materials
VERITAS Architect Network (VAN): http://www.symantec.com/van VERITAS Cluster Server Release Notes VERITAS Cluster Server Users Guide
Lab 5: Using the VCS Simulator

1. Start the Simulator Java GUI. hasimgui & 2. Add a cluster. 3. Copy the preconfigured *.cf files to the new directory. 4. Start the cluster from the Simulator GUI. 5. Launch the Cluster Manager Java Console. 6. Log in using the VCS account oper with password oper. This account demonstrates different privilege levels in VCS.
next SeeSee next slide for classroom values Seethe nextslide for lab assignments. the slide for lab assignments.
Labs and solutions for this lesson are located on the following pages. "Lab 5: Using the VCS Simulator," page A-27. "Lab 5 Solutions: Using the VCS Simulator," page B-43.

521
522

Lesson 6 VCS Configuration Methods
Lesson Introduction

Topic
Starting and Stopping VCS

Start and stop VCS.
Overview of Configuration Compare and contrast VCS Methods configuration methods. Online Configuration Offline Configuration Describe the online configuration method. Describe the offline configuration method.
Controlling Access to VCS Set user account privileges to control access to VCS.
62

VCS Startup Behavior

S1
Local Build Cluster Conf 4 2 main.cf had hashadow 1 hastart hastart 3 main.cf had hashadow 5 7 hastart hastart 6
S2
No config in memory Current_ Discover _Wait
Starting and Stopping VCS

The startup procedure for versions earlier than 5.0 is described in the Job Aids appendix. VCS Startup Behavior The default VCS startup process is demonstrated using a cluster with two systems connected by the cluster interconnect. To illustrate the process, assume that no systems have an active cluster configuration. 1 The hastart command is run on S1 and starts the had and hashadow processes. 2 HAD checks for a valid configuration file (hacf -verify config_dir). 3 HAD checks for an active cluster configuration on the cluster interconnect. 4 Because there is no active cluster configuration, HAD on S1 reads the local main.cf file and loads the cluster configuration into local memory. The S1 system is now in the VCS local build state, meaning that VCS is building a cluster configuration in memory on the local system. 5 The hastart command is then run on S2 and starts had and hashadow on S2. The S2 system is now in the VCS current discover wait state, meaning VCS is in a wait state while it is discovering the current state of the cluster. 6 HAD on S2 checks for a valid configuration file on disk. 7 HAD on S2 checks for an active cluster configuration by sending a broadcast message out on the cluster interconnect, even if the main.cf file on S2 is valid.
63
VCS Startup Behavior (continued)

S1
Running Cluster Cluster Conf Conf
S2
Cluster Conf Remote Build 10
main.cf main.cf had hashadow 8
main.cf main.cf had hashadow
Link to Pre-5.0 Slides
HAD on S1 receives the request from S2 and responds. HAD on S1 sends a copy of the cluster configuration over the cluster interconnect to S2. The S1 system is now in the VCS running state, meaning VCS determines that there is a running configuration in memory on system S1. The S2 system is now in the VCS remote build state, meaning VCS is building the cluster configuration in memory on the S2 system from the cluster configuration that is in a running state on S1. 10 When the remote build process completes, HAD on S2 copies the cluster configuration into the local main.cf file. If S2 has valid local configuration files (main.cf and types.cf), these are saved to new files with a name, including a date and time stamp, before the active configuration is written to the main.cf file on disk.
8 9
Note: If the checksum of the configuration in memory matches the main.cf on disk, no write to disk occurs. The startup process is repeated on each system until all members have identical copies of the cluster configuration in memory and matching main.cf files on local disks. Synchronization is maintained by data transfer through LLT and GAB.
64

Stopping VCS
Before stopping: haconf dump -makero Before haconf dump -makero S1 S2 S1 S2
had had 1 had 3 hastop -local
had
hastop all -force
S1
S2
S1 had 2
S2 had 4
had hastop -all
had
hastop -local -evacuate
Stopping VCS There are several methods of stopping the VCS engine (had and hashadow daemons) on a cluster system. The options you specify to hastop determine where VCS is stopped, and how resources under VCS control are affected. VCS Shutdown Examples The four examples show the effect of using different options with the hastop command: 1 Example 1: The -local option causes the service group to be taken offline on S1 and stops VCS services (had) on S1. 2 Example 2: The -local -evacuate options cause the service group on S1 to be migrated to S2 and then stop VCS services (had) on S1. 3 Example 3: The -all -force options stop VCS services (had) on both systems and leave the services running. Although they are no longer protected highly available services and cannot fail over, the services continue to be available to users. Use caution with this option. VCS does not warn you if the configuration is open and you stop using the -force option. 4 Example 4: The -all option stops VCS services (had) on all systems and takes the service groups offline.

65
Modifying VCS Shutdown Behavior

The EngineShutdown cluster attribute controls hastop. Value
Enable (default) Disable PromptClusStop DisableClusStop PromptLocal PromptAlways
Behavior
Process all hastop commands. Reject all hastop commands. Prompt for hastop all only. Ignore hastop all; process all other hastop commands. Prompt for hastop local only. Prompt for all hastop commands.
hastop force is not subject to not subject ! hastop force issettings. to EngineShutdown settings. EngineShutdown
Modifying VCS Shutdown Behavior Use the EngineShutdown attribute to define VCS behavior when you run the hastop command. Note: VCS does not consider this attribute when hastop is issued with the -force option. Configure one of the values shown in the table in the slide for the EngineShutdown attribute depending on the desired functionality for the hastop command.
66

Configuration Methods
Online configuration tools: HAD is running
Java and Web GUIs graphical user interface VCS CLI and shell scripts
Offline configuration methods: HAD must restart

Manual modification of configuration files Modification using the VCS Simulator
# vi main.cf ~ include "types.cf" cluster vcs_train ( ) system train1 ( ) system train2 ( ) group ClusterService ( )
Overview of Configuration Methods

VCS provides several tools and methods for configuring service groups and resources, generally categorized as: Online configuration: You can modify the cluster configuration while VCS is running using one of the graphical user interfaces or the command-line interface. These online methods change the cluster configuration in memory. When finished, you write the in-memory configuration to the main.cf file on disk to preserve the configuration. Offline configuration: In some circumstances, you can simplify cluster implementation and configuration using an offline method, including: Editing configuration files manually Using the Simulator to create, modify, model, and test configurations This method requires you to stop and restart VCS in order to build the new configuration in memory.

67
How VCS Changes the Online Cluster Configuration

Open the configuration. Open the configuration. Add a service group. Add a service group. hagrp add hagrp add Config In-memory configuration Config In-memory configuration
Online Configuration
How VCS Changes the Online Cluster Configuration When you use Cluster Manager to modify the configuration, the GUI communicates with had on the specified cluster system to which Cluster Manager is connected. Note: Cluster Manager configuration requests are shown conceptually as ha commands in the diagram, but they are implemented as system calls. The had daemon communicates the configuration change to had on all other nodes in the cluster, and each had daemon changes the in-memory configuration. When the command to save the configuration is received from Cluster Manager, had communicates this command to all cluster systems, and each systems had daemon writes the in-memory configuration to the main.cf file on its local disk. The VCS command-line interface is an alternate online configuration tool. When you run ha commands, had responds in the same fashion. Note: When two administrators are changing the cluster configuration simultaneously, each administrator sees all changes as they are being made.
68

Opening the Cluster Configuration
haconf -makerw haconf -makerw
1 ReadOnly=0 Shared cluster configuration in memory hares modify ... . hares modify ... . 2 In-memory configuration main.cf not equal to main.cf main.cf
Opening the Cluster Configuration You must open the cluster configuration to add service groups and resources, make modifications, and perform certain operations. The state of the configuration is maintained in an internal attribute (ReadOnly). If you try to stop VCS with the configuration open, a warning is displayed that the configuration is open. This helps ensure that you remember to save the configuration to disk so you do not lose any changes you may have made while the configuration was open. You can override this protection, as described later in this lesson.
6

69
Saving the Cluster Configuration
haconf -dump haconf -dump
ReadOnly is still 0 and the ReadOnly is still 0 and the configuration is still open. configuration is still open. Shared cluster configuration in memory
main.cf
main.cf
Saving the Cluster Configuration When you save the cluster configuration, VCS copies the configuration in memory to the main.cf file in the /etc/VRTSvcs/conf/config directory on all running cluster systems. At this point, the configuration is still open. You have only written the in-memory configuration to disk and have not closed the configuration. If you save the cluster configuration after each change, you can view the main.cf file to see how the in-memory modifications are reflected in the main.cf file.
610

Closing the Cluster Configuration
haconf dump -makero haconf dump -makero 1 ReadOnly=1 2 Shared cluster configuration in memory
main.cf
main.cf
Link to Pre-5.0 Slides
Closing the Cluster Configuration When the administrator saves and closes the configuration, VCS: 1 Changes the state of the configuration to closed (ReadOnly=1) 2 Writes the configuration in memory to the main.cf file

611
How VCS Protects the Cluster Configuration

When the configuration is open:
Configuration in memory may not match main.cf ReadOnly cluster attribute is set to 0: haclus value ReadOnly VCS warns you to close the configuration if you:
Stop VCS on all systems. Use of hastop all force bypasses warning; you can lose configuration changes Close the GUI.
Link to Pre-5.0 Slide
How VCS Protects the Cluster Configuration When the cluster configuration is open, you cannot stop VCS without overriding the warning that the configuration is open. If you ignore the warning and stop VCS while the configuration is open, you may lose configuration changes. If you forget to save the configuration and shut down VCS, the configuration in the main.cf file on disk may not be the same as the configuration that was in memory before VCS was stopped. You can configure VCS to automatically back up the in-memory configuration to disk to minimize the risk of losing modifications made to a running cluster. This is covered later in this lesson.
612

Automatic Configuration Backups

Configuring automatic main.cf backups:
BackupInterval is 0 by default. When set > 0, VCS saves the configuration in memory to main.cf.autobackup. haclus -modify BackupInterval 3 BackupInterval minimum is 3 (minutes).
ls l /etc/VRTSvcs/conf/config/main* ls l /etc/VRTSvcs/conf/config/main* -rw------ 2 root other 5992 Oct 10 -rw------ 2 root other 5992 Oct 10 -rw------ 1 root root 5039 Oct 8 -rw------ 1 root root 5039 Oct 8 -rw------ 2 root other 5051 Oct 9 -rw------ 2 root other 5051 Oct 9 -rw------ 2 root other 5992 Oct 10 -rw------ 2 root other 5992 Oct 10 -rw------ 1 root other 6859 Oct 11 -rw------ 1 root other 6859 Oct 11 -rw------ 2 root other 5051 Oct 9 -rw------ 2 root other 5051 Oct 9 12:07 12:07 8:01 8:01 17:58 17:58 12:07 12:07 7:43 7:43 17:58 17:58 main.cf main.cf main.cf.08Oct2006... main.cf.08Oct2006... main.cf.09Oct2006... main.cf.09Oct2006... main.cf.10Oct2006... main.cf.10Oct2006... main.cf.autobackup main.cf.autobackup main.cf.previous main.cf.previous
Automatic Configuration Backups You can set the BackupInterval cluster attribute to automatically save the inmemory configuration to disk periodically. When set to a value greater than or equal to three minutes, VCS automatically saves the configuration in memory to the main.cf.autobackup file. If necessary, you can copy the main.cf.autobackup file to main.cf and restart VCS to build the configuration in memory at the point in time of the last backup. Ensure that you understand the VCS startup sequence described in the Starting and Stopping VCS section before you attempt this type of recovery.
6

613
Offline Configuration Characteristics

Can be more efficient than online configuration for:
Large configurations Many similar resources or service groups Many similar clusters
Requires root to modify main.cf Requires restarting VCS Enables services to continue running, but they are not highly available while VCS is being restarted
Offline Configuration
Characteristics In some circumstances, you can simplify cluster implementation or configuration tasks by directly modifying the VCS configuration files. This method requires you to stop and restart VCS in order to build the new configuration in memory. The benefits of using an offline configuration method are that it: Offers a very quick way of making major changes or getting an initial configuration up and running Is efficient for creating many similar resources of service groups Provides a means for deploying a large number of similar clusters One consideration when choosing to perform offline configuration is that you must be logged into the a cluster system as root. This section describes situations where offline configuration is useful. The next section shows how to stop and restart VCS to propagate the new configuration throughout the cluster. The Offline Configuration of Service Groups lesson provides detailed offline configuration procedures and examples.
614

Example 1: Reusing a Cluster Configuration

group DB3 ( group DB3 ( SystemList = {S3=1,S4=2} SystemList = {S3=1,S4=2} AutoStartList = {S3} AutoStartList = {S3} ) )
main.cf
DB1
DB2
S1 S1
Cluster1 Cluster1
S2 S2 main.cf
DB3
DB4
group DB1 ( group DB1 ( SystemList = {S1=1,S2=2} SystemList = {S1=1,S2=2} AutoStartList = {S1} AutoStartList = {S1} ) )
S3 S3
Cluster2 Cluster2
S4 S4
Example 1: Reusing a Cluster Configuration One example where offline configuration is appropriate is when your high availability environment is expanding and you are adding clusters with similar configurations. In the example displayed in the diagram, the original cluster consists of two systems, each system running a database instance. Another cluster with essentially the same configuration is being added, but it is managing different Oracle databases. You can copy the configuration files from the original cluster, make the necessary changes, and then restart VCS as described later in this lesson. This method may be more efficient than creating each service group and resource using Cluster Manager or the VCS command-line interface.

615
Example 2: Reusing a Service Group Definition

Service Group Definition Sample Value Service Group Definition Group TestSG Group Required Attributes Required Attributes FailoverPolicy Priority FailoverPolicy SystemList S1=0, S2=1 SystemList Optional Attributes Optional Attributes AutoStartList S1 AutoStartList TestProcess TestIP Test NIC Test Mount TestVol TestDG AppIP App NIC Sample Value AppSG Priority S1=0, S2=1 S1 AppProcess App Mount AppVol AppDG
Example 2: Reusing a Service Group Configuration Another example of using offline configuration is when you want to add a service group with a similar set of resources as another service group in the same cluster. In the example shown in the slide, the portion of the main.cf file that defines the TestSG service group is copied and edited as necessary to define a new AppSG service group.
616

Relating VCS and UNIX User Accounts

In nonsecure mode:
No UNIX to VCS user account mapping Cluster Administrator privileges given to root Nonroot users are prompted for a VCS account name and password when using the CLI Mode Non-secure (default) Secure (VxSS) Authentication System-level VCS Authorization VCS
System-level only VCS
Controlling Access to VCS

Relating VCS and UNIX User Accounts If you have not configured VxSS security in the cluster, VCS has a completely separate list of user accounts and passwords to control access to VCS. When using the Cluster Manager to perform administration, you are prompted for a VCS account name and password. Depending on the privilege level of that VCS user account, VCS displays the Cluster Manager GUI with an appropriate set of options. If you do not have a valid VCS account, you cannot run Cluster Manager. When using the command-line interface for VCS, you are also prompted to enter a VCS user account and password and VCS determines whether the VCS user account has proper privileges to run the command. One exception is the UNIX root user. By default, only the UNIX root account is able to use VCS ha commands to administer VCS from the command line. VCS Access in Secure Mode When running in secure mode (VxSS), VCS uses operating system-based authentication. All VCS users are system and domain users and are configured using fully qualified user names, for example, administrator@xyz.com. VCS provides a single sign-on mechanism, so authenticated users do not need to sign on each time to connect to a cluster. When running in secure mode, you can add system or domain users to VCS and assign them VCS privileges. However, you cannot assign or change passwords using a VCS interface.

617
Simplifying VCS Administrative Access

To simplify administration from the CLI:
1. Set the VCS_HOST environment variable to the node name. 2. Log in to VCS: halogin vcs_user_name 3. Type the password.
Simplifying VCS Administrative Access VCS 4.1 and 5.0 The halogin command is provided in VCS 4.1 and later to save authentication information so that users do not have to enter credentials every time a VCS command is run. The command stores authentication information in the users home directory. You must set the VCS_HOST environment variable to the name of the node from which you are running VCS commands to use halogin. Note: The effect of halogin only applies for that shell session. VCS 3.5 and 4.0 For releases prior to 4.1, halogin is not supported. When logged in to UNIX as a nonroot account, the user is prompted to enter a VCS account name and password every time a VCS command is entered. To enable nonroot users to more easily administer VCS, you can set the AllowNativeCliUsers cluster attribute to 1.
haclus -modify AllowNativeCliUsers 1
When set, VCS maps the UNIX user name to the same VCS account name to determine whether the user is valid and has the proper privilege level to perform the operation. You must explicitly create each VCS account name to match the UNIX user names and grant the appropriate privilege level.
618

VCS User Account Privileges

VCS levels of authorization:
Cluster Administrator
Full privileges
Cluster Operator
All cluster, service group, and resource-level operations
Cluster Guest
Read-only access; new users given Cluster Guest authorization by default
Group Administrator
All operations for specified service group, except deletion
Group Operator
Service group and resource online and offline; temporarily freeze or unfreeze service groups
VCS User Account Privileges You can ensure that the different types of administrators in your environment have a VCS authority level to affect only those aspects of the cluster configuration that are appropriate to their level of responsibility. For example, if you have a DBA account that is authorized to take a database service group offline or switch it to another system, you can make a VCS Group Operator account for the service group with the same account name. The DBA can then perform operator tasks for that service group, but cannot affect the cluster configuration or other service groups. If you set AllowNativeCliUsers to 1, then the DBA logged on with that account can also use the VCS command line to manage the corresponding service group. Setting VCS privileges is described in the next section.
6

619
Creating Cluster User Accounts To configure VCS user accounts:

1. Open the cluster configuration. 2. Add a user with the hauser command.
hauser add user
3. Type a password. 4. Add privileges: Cluster:

hauser addpriv user priv
GUI alternative. GUI
Groups:
hauser -addpriv user priv group group
5. Save and close the configuration.
Creating Cluster User Accounts VCS users are not the same as UNIX users except when running VCS in secure mode. If you have not configured VxSS security in the cluster, VCS maintains a set of user accounts separate from UNIX accounts. In this case, even if the same user exists in both VCS and UNIX, this user account can be given a range of rights in VCS that does not necessarily correspond to the users UNIX system privileges. The slide shows how to use the hauser command to create users and set privileges. You can also add privileges with the -addpriv and -deletepriv options to hauser. In non-secure mode, VCS user accounts are stored in the main.cf file in encrypted format. If you use a GUI or CLI to set up a VCS user account, passwords are encrypted automatically. If you edit the main.cf file, you must encrypt the password using the vcsencrypt command. Note: In non-secure mode, if you change a UNIX account, this change is not reflected in the VCS configuration automatically. You must manually modify accounts in both places if you want them to be synchronized. Modifying User Accounts Use the hauser command to make changes to a VCS user account: Change the password for an account. hauser -update user_name Delete a user account. hauser -delete user_name
620
Lesson Summary
Key Points
Online configuration enables you to keep VCS running while making configuration changes. Offline configuration is best suited for largescale modifications.
Reference Materials
VERITAS Cluster Server User's Guide VERITAS Cluster Server Command Line Quick Reference
Lab 6: Starting and Stopping VCS

vcs1
train1
# hastop all -force # hastop all -force
train2
Labs and solutions for this lesson are located on the following pages. "Lab 6: Starting and Stopping VCS," page A-33. "Lab 6 Solutions: Starting and Stopping VCS," page B-55.

621
622

Lesson 7 Preparing Services for VCS
Course Overview
Topic
Preparing Applications for Prepare applications for the VCS VCS environment. Performing One-Time Configuration Tasks Testing the Application Service Stopping and Migrating a Service Validating the Design Worksheet Perform one-time configuration tasks. Test the application services before placing them under VCS control. Stop resources and manually migrate a service. Validate the design worksheet using configuration information.
72

Application Service Overview
IP Address NIC Network Network End Users
Process Application Application File System Volume Disk Group
Storage Storage
Preparing Applications for VCS

Application Service Component Review An application service is the service that the end-user perceives when accessing a particular network address. An application service typically consists of multiple components, some hardware- and some software-based, all cooperating together to produce a service. For example, a service can include application software (processes), a file system containing data files, a physical disk on which the file system resides, one or more IP addresses, and a NIC for network access. If this application service needs to be migrated to another system for recovery purposes, all of the components that compose the service must migrate together to re-create the service on another system.

73
Identifying Components
Application Application Start, stop, and monitor processes Start, stop, and monitor processes File locations File locations Network Network IP addresses Network interfaces IP NIC
Process Mount Volume DiskGroup Storage Storage
Disk group and volume devices File systems Mount point directories
Identifying Components The first step in preparing services to be managed by VCS is to identify the components required to support the services. These components should be itemized in your design worksheet and may include the following, depending on the requirements of your application services: Shared storage resources: Disks or components of a logical volume manager, such as Volume Manager disk groups and volumes File systems to be mounted Directory mount points Network-related resources: IP addresses Network interfaces Application-related resources: Identical installation and configuration procedures Procedures to manage and monitor the application The location of application binary and data files The following sections describe the aspects of these components that are critical to understanding how VCS manages resources.
74

Configuration and Migration Procedure

Before configuring VCS in production environments:
Fully test each service on each startup or failover target system. Configure services on a test cluster, if possible.
Perform one-time Perform one-time configuration tasks on configuration tasks on each system. each system.
Start, verify, and Start, verify, and stop services on stop services on one system at a time. one system at a time.
More More Systems? Systems?
Performing One-Time Configuration Tasks

Configuration and Migration Procedure Use the procedure shown in the diagram to prepare and test application services on each system before placing the service under VCS control. Use the design worksheet to obtain and record information about the service group and each resource. This is the information you need to configure VCS to control these resources. Details are provided in the following section.

75
Ready for Ready for VCS VCS
Documenting Attributes
As you configure components:
Use a design worksheet to document details needed to configure VCS resources. Note differences among systems, such as network interface device names.
Resource Definition Service Group Name Resource Name Resource Type Device Address NetMask Sample Value DemoSG DemoIP IP eri0 10.10.21.198 DiskGroup 255.0.0.0 IP NIC Process Mount Volume
Required Attributes Solaris Solaris
Optional Attributes
Documenting Attributes In order to configure the operating system resources you have identified as requirements for application service, you need the detailed configuration information used when initially configuring and testing services. You can use a design diagram and worksheet while performing one-time configuration tasks and testing to: Show the relationships between the resources, which determine the order in which you configure, start, and stop resources. Document the values needed to configure VCS resources after testing is complete. Note: If your systems are not configured identically, you must note those differences in the design worksheet. The Online Configuration of Service Groups lesson shows how you can configure a resource with different attribute values for different systems.
76

Checking Resource Attributes

While documenting the configuration, verify that:
Recorded attribute values match configured values All required attributes have values
Resource Definition Process 192.... eri0 /Demo DemoVol DemoDG Service Group Name Resource Name
Fix !
Sample Value DemoSG DemoMount Mount /demo /dev/vx/dsk. . . vxfs -y Solaris Solaris
Resource Type MountPoint BlockDevice FSType FsckOpt Optional Attributes
Required Attributes
BARGs:
Solaris AIX HP-UX
Linux
MountOpt
Checking Resource Attributes Verify that the resources specified in your design worksheet are appropriate and complete for your platform. Refer to the VERITAS Cluster Server Bundled Agents Reference Guide before you begin configuring resources. The examples displayed in the slides in this lesson are based on the Solaris operating system. If you are using another platform, your resource types and attributes may be different.

77
Configuring Shared Storage

Initialize disks. Initialize disks. Create a disk group. Create a disk group. Create a volume. Create a volume. Make a file system. Make a file system. Make a mount point. Make a mount point. vxdisksetup -i disk_dev From One System From One System
vxdg init DemoDG DemoDG01=disk_dev
vxassist -g DemoDG make DemoVol 1g
mkfs args vxfs /dev/vx/rdsk/DemoDG/DemoVol mkdir /Demo Each System Each System
Volume Manager Example Volume Manager Example Solaris AIX HP-UX Linux
Configuring Shared Storage The diagram shows the procedure for configuring shared storage on the initial system. In this example, Volume Manager is used to manage shared storage on a Solaris system. Note: Although examples used throughout this course are based on VERITAS Volume Manager, VCS also supports other volume managers. VxVM is shown for simplicityobjects and commands are essentially the same on all platforms. The agents for other volume managers are described in the VERITAS Cluster Server Bundled Agents Reference Guide. Preparing shared storage, such as creating disk groups, volumes, and file systems, is performed once, from one system. Then you must create mount point directories on each system. The options to mkfs may differ depending on platform type, as displayed in the following examples.
Solaris/HP-UX
mkfs -F vxfs /dev/vx/rdsk/DemoDG/DemoVol

AIX
mkfs -V vxfs /dev/vx/rdsk/DemoDG/DemoVol

Linux
mkfs -t vxfs /dev/vx/rdsk/DemoDG/DemoVol
78

Configuring the Application

Install and configure applications identically on each target system.
Determine file locations: Resource Definition
Shared or local storage Binaries, data, configuration
Resource Name Resource Type PathName Arguments Sample Value DemoProcess Process /opt/orderproc start Service Group Name DemoSG
Follow application installation and configuration guidelines. Identify startup, monitor, and shutdown procedures.
Required Attributes Optional Attributes
Configuring the Application You must ensure that the application is installed and configured identically on each system that is a startup of the failover target to manually test the application after all dependent resources are configured and running. Depending on the application requirements, you may need to Create user accounts. Configure environment variables. Apply licenses. Set up configuration files. This ensures that you have correctly identified the information used by the VCS agent scripts to control the application. Note: The shutdown procedure should be a graceful stop, which performs any cleanup operations.
7

79
Testing the Application Service

Bring up resources. Bring up resources. N More More Systems? Systems? Y Ready for Ready for VCS VCS
S1
Start up all resources in dependency order. Shared storage Shared storage Virtual IP address Virtual IP address Application software Application software Test the application. Test the application. Stop resources. Stop resources.
S2 S2Sn
Stop resources. Stop resources. Test the application. Test the application. Bring up resources. Bring up resources.
Testing the Application Service

Before configuring a service group in VCS to manage an application service, test the application components on each system that can be a startup or failover target for the service group. Following this best practice recommendation ensures that VCS will successfully manage the application service after you configure a service group to manage the service. The testing procedure emulates how VCS manages application services and must include: Startup: Online Shutdown: Offline Verification: Monitor The actual commands used may differ from those used in this lesson. However, conceptually, the same type of action is performed by VCS. Example operations are described for each component throughout this section.
710

Bringing Up Resources: Online

To bring shared storage resources online:
1. Import the disk group:
vxdg t import DemoDG
2. Start the volume:

vxvol g DemoDG start DemoVol
3. Mount the file system:

mount F vxfs /dev/vx/dsk/DemoDG/DemoVol /Demo
Solaris Solaris
Do not automount file systems controlled by VCS. controlled Solaris example: Verify /etc/vfstab has no entries
Bringing Up Shared Storage Resources Verify that shared storage resources are configured properly and accessible. The examples shown in the slide are based on using Volume Manager. 1 Import the disk group. 2 Start the volume. 3 Mount the file system. Mount the file system manually for the purposes of testing the application service. Do not configure the operating system to automatically mount any file system that will be controlled by VCS. On Solaris systems, ensure that the file system is not added to /etc/ vfstab, or controlled by the automounter. VCS must control where the file system is mounted. Examples of mount commands are provided for each platform.
Solaris/HP-UX
mount -F vxfs /dev/vx/dsk/DemoDG/DemoVol /Demo

AIX
mount -V vxfs /dev/vx/dsk/DemoDG/DemoVol /Demo

Linux
mount -t vxfs /dev/vx/dsk/DemoDG/DemoVol /Demo

711
Virtual IP Addresses
http://eweb.com
Failed
DNS eweb.com = 10.10.21.198 ifconfig eri0 addif 10.10.21.198 up Virtual IP configured by IP resource Admin IP configured at boot eri0:1 eri0
S2
10.10.21.8
S1
Solaris
Virtual IP Addresses The example in the slide demonstrates how users access services through a virtual IP address that is specific to an application. In this scenario, VCS is managing a Web server that is accessible to network clients over a public network. 1 A network client requests a Web page from http://acmeweb.com. 2 The DNS server translates the host name to the virtual IP address of the Web server. 3 The virtual IP address is managed and monitored by a VCS IP resource in the Web service group. The virtual IP address is associated with the next virtual network interface for eri0, which is eri0:1 in this example. 4 The system which has the service group online accepts the incoming request on the virtual IP address. Note: The administrative IP address is associated with a physical network interface on a specific system and is configured during system startup.
712

Virtual IP Address Migration
http://eweb.com
DNS eweb.com = 10.10.21.198
ifconfig eri0 addif 10.10.21.198 up Virtual IP configured by IP resource eri0:1 eri0 Admin IP configured at boot
Failed
10.10.21.9
S2 S1
After Failover
Virtual IP Address Migration The diagram in the slide shows what happens if the system running the Web service group (S1) fails. 1 The IP address is no longer available on the network. Network clients may receive errors that web pages are not accessible. 2 VCS on the running system (S2) detects the failure and starts the service group. 3 The IP resource is brought online, which configures the same virtual IP address on the next available virtual network interface alias, eri0:1 in this example. This virtual IP address floats, or migrates, with the service. It is not tied to a system. 4 The network client Web request is now accepted by the S2 system. Note: The admin IP address on S2 is also configured during system startup. This address is unique and associated with only this system, unlike the virtual IP address.
7

713
Bringing Up Application (Virtual) IP Addresses

An application IP address is:
Added as virtual IP address to a virtual public network device Associated with an application service Resolved by a naming service Controlled by the high availability software Migrated to other systems by VCS Also called service group or floating IP addresses
Procedure
Solaris
AIX
HP-UX
Linux
Configuring Application IP Addresses Configure the application IP addresses associated with specific application services to ensure that clients can access the application service using the specified address. Application IP addresses are configured as virtual IP addresses. On most platforms, the devices used for virtual IP addresses are defined as interface:number. The lab exercise shows the platform-specific commands and files required to configure a virtual IP address for an application service.
714

Starting the Application

Manually start the application for testing purposes.
Example command line for an application:
/opt/orderproc start
Ensure no automatic startup files exist:

Solaris: /etc/rc2.d HP-UX: /sbin/rc2.d Linux: /etc/rc3.d AIX: /etc/rc.d/rc2.d
Starting the Application When all dependent resources are available, you can start the application software. Ensure that the application is not configured to start automatically during system boot. VCS must be able to start and stop the application using the same methods you use to control the application manually. Do not configure the operating system to automatically start the application during system boot.

715
Verifying Resources: Monitor

Verify the disk group. Verify the disk group. Verify the volume. Verify the volume. Verify the file system. Verify the file system. Verify the NIC. Verify the NIC. Verify the virtual IP. Verify the virtual IP. Verify the application. Verify the application.
vxdg list DemoDG dd if=/dev/vx/rdsk/DemoDG/DemoVol \ of=/dev/null count=1 bs=128 mount | grep /Demo ping 10.10.21.8 ifconfig a | grep 10.10.21.198 Solaris ps -ef | grep "/opt/orderproc"
Verifying Resources You can perform some simple steps, such as those shown in the slide, to verify that each component needed for the application service to function is operating at a basic level. This helps you identify any potential configuration problems before you test the service as a whole, as described in the Testing the Integrated Components section.
716

Testing the Integrated Components

Test the application real-world scenarios. Involve key users of the application. Test network connectivity from client systems on different subnets. Failed
192.168.21.198
S2
S1
Testing the Integrated Components When all components of the service are running, test the service in situations that simulate real-world use of the service. For example, if you have an application with a back-end database, you can: 1 Start the database (and listener process). 2 Start the application. 3 Connect to the application from the public network using the client software to verify name resolution to the virtual IP address. 4 Perform user tasks, as applicable; perform queries, make updates, and run reports. Another example that illustrates how you can test your service is NFS. If you are preparing to configure a service group to manage an exported file system, verify that you can mount the exported file system from a client on the network. This is described in more detail later in the course.
7

717
Stopping Application Components: Offline

Stop the application. Stop the application. Stop resources in order Stop resources in order from top down. from top down. /opt/orderproc stop Take down the virtual IP. Take down the virtual IP. Unmount the file system. Unmount the file system. umount /Demo
ifconfig eri0 removeif 10.10... Solaris Solaris
Stop the volume. Stop the volume. vxvol g DemoDG stop DemoVol Deport the disk group. Deport the disk group. vxdg deport DemoDG
Stopping and Migrating an Application Service

Stopping Application Components Stop resources in the order of the dependency tree from the top down after you have finished testing the service. You must have all resources offline in order to migrate the application service to another system for testing. The procedure also illustrates how VCS stops resources. The ifconfig options are platform-specific, as displayed in the following examples.
Solaris
ifconfig eri0 removeif 10.10.21.198

AIX
ifconfig en1 10.10.21.198 delete

HP-UX
ifconfig lan2:1 0.0.0.0

Linux
ifdown eth0:1
718

Manually Migrating an Application Service
IP Address NIC Network Network
S1
File System
Process Application Application
S2
Storage Storage
Manually Migrating an Application Service After you have verified that the application service works properly on one system, manually migrate the service between all intended target systems. Performing these operations enables you to: Ensure that your operating system and application resources are properly configured on all potential target cluster systems. Validate or complete your design worksheet to document the information required to configure VCS to manage the services. Perform the same type of testing used to validate the resources on the initial system, including real-world scenarios, such client access from the network.

719
Documenting Resource Dependencies

Determine parent and child relationships according to service group diagrams. Verify that the dependencies match the online and offline tests you performed.
Resource Dependency Definition Service Group Process IP NIC Mount Volume Parent Resource DemoVol DemoMount DemoIP DemoProcess DiskGroup DemoProcess DemoSG Requires Child Resource DemoDG DemoVol DemoNIC DemoMount DemoIP
Documenting Resource Dependencies Ensure that the steps you perform to bring resources online and take them offline while testing the service are accurately reflected in the design worksheet. Compare the worksheet with service group diagrams you have created or that have been provided to you. The slide shows the resource dependency definition for the application used as an example in this lesson.
720

Validating Service Group Attributes

Service Group Definition Startup System Group Required Attributes FailOverPolicy DemoSG S2 S1 Failover System SystemList Parallel Optional Attributes AutoStartList S1 Priority S1=0, S2=1 0 Sample Value DemoSG
Validating Service Group Attributes Check the service group attributes in your design worksheet to ensure that the appropriate startup and failover systems are listed. Other service group attributes may be included in your design worksheet, according to the requirements of each service. Service group definitions consist of the attributes of a particular service group. These attributes are described in more detail later in the course.

721
Lesson Summary
Key Points
Prepare each component of a service and document attributes. Test services in preparation for configuring VCS service groups.
Reference Materials
VERITAS Cluster Server Bundled Agents Reference Guide VERITAS Cluster Server User's Guide
Lab 7: Preparing Application Services

/bob1/loopy while true do echo done bobDG1 /bob1 bobVol1 disk1 Disk/Lun /sue1/loopy while true do echo done sueDG1 sueVol1 disk2 Disk/Lun /sue1
NIC IP Address
NIC IP Address
See the next slide for classroom values. See the next slide for classroom values.
Labs and solutions for this lesson are located on the following pages. "Lab 7: Preparing Application Services," page A-37. "Lab 7 Solutions: Preparing Application Services," page B-61.
722

Lesson 8 Online Configuration
Lesson Introduction
Topic
Online Service Group Configuration Procedure Adding Resources Solving Common Configuration Errors

Describe an online configuration procedure. Create resources using online configuration tools. Resolve common errors made during online configuration.
Testing the Service Group Test the service group to ensure that it is correctly configured.
82

Online Configuration Procedure

Open the cluster configuration. Open the cluster configuration. Modify the cluster configuration Modify the cluster configuration using the GUI or CLI. using the GUI or CLI. Save and close the Save and close the configuration. configuration. Add Service Group Add Service Set SystemList Set SystemList Set Opt Attributes Set Opt Attributes Add/Test Resource Add/Test Resource Resource Flow Chart This procedure assumes: This procedure assumes: You have prepared and tested the You have prepared and tested the application service on each system application service on each system Resources are offline everywhere Resources are offline everywhere Y More? More? N
Test
Online Service Group Configuration

The chart on the left in the diagram illustrates the high-level procedure you can use to modify the cluster configuration while VCS is running. Online Configuration Procedure You can use the procedures shown in the diagram as a standard methodology for creating service groups and resources. Although there are many ways you could vary this configuration procedure, following a recommended practice simplifies and streamlines the initial configuration and facilitates troubleshooting if you encounter configuration problems.

83
Adding a Service Group Using the GUI
main.cf main.cf group DemoSG ( group DemoSG ( SystemList = {S1 = 0, S2 = 1} SystemList = {S1 = 0, S2 = 1} AutoStartList = {S1} AutoStartList = {S1} ) )
Adding a Service Group Using the GUI The minimum required information to create a service group is: Enter a unique name. Using a consistent naming scheme helps identify the purpose of the service group and all associated resources. Specify the list of systems on which the service group can run. This is defined in the SystemList attribute for the service group, as displayed in the excerpt from the sample main.cf file. A priority number is associated with each system to determine the order systems are selected for failover. The lower-numbered system is selected first. The Startup box specifies that the service group starts automatically when VCS starts on the system, if the service group is not already online elsewhere in the cluster. This is defined by the AutoStartList attribute of the service group. In the example displayed in the slide, the S1 system is selected as the system on which DemoSG is started when VCS starts up. The Service Group Type selection is Failover by default. If you save the configuration after creating the service group, you can view the main.cf file to see the effect of had modifying the configuration and writing the changes to the local disk. Note: You can click the Show Command button to see the commands that are run when you click OK.
84

Adding a Service Group Using the CLI You can also use the VCS command-line interface to modify a running cluster configuration. The next example shows how to use hagrp commands to add the DemoSG service group and modify its attributes.
haconf makerw hagrp add DemoSG hagrp modify DemoSG SystemList S1 0 S2 1 hagrp modify DemoSG AutoStartList S1 haconf dump -makero
The corresponding main.cf excerpt for DemoSG is shown in the slide. Notice that the main.cf definition for the DemoSG service group does not include the Parallel attribute. When a default value is specified for a resource, the attribute is not written to the main.cf file. To display all values for all attributes: In the GUI, select the object (resource, service group, system, or cluster), click the Properties tag, and click Show all attributes. From the command line, use the -display option to the corresponding ha command. For example: hagrp -display DemoSG See the command-line reference card provided with this course for a list of commonly used ha commands.

85
Online Resource Configuration Procedure

Considerations:
Add Resource Add Resource Set Non-Critical Set Non-Critical Modify Attributes Modify Attributes Enable Resource Enable Resource Bring Online Bring Online N
Add resources in order of dependency, starting at the bottom. Set each resource to non-critical until testing has completed. Configure all required attributes. Enable the resource. Bring each resource online before adding the next resource.
Online? Online? Y
Troubleshoot Resources Done Done
Adding Resources
Online Resource Configuration Procedure Add resources to a service group in the order of resource dependencies starting from the child resource (bottom up). This enables each resource to be tested as it is added to the service group. Adding a resource requires you to specify: The service group name The unique resource name If you prefix the resource name with the service group name, you can more easily identify the service group to which it belongs. When you display a list of resources from the command line using the hares -list command, the resources are sorted alphabetically. The resource type Attribute values Use the procedure shown in the diagram to configure a resource. Notes: It is recommended that you set each resource to be non-critical during initial configuration. This simplifies testing and troubleshooting in the event that you have specified incorrect configuration information. If a resource faults due to a configuration error, the service group does not fail over if resources are noncritical. Enabling a resource signals the agent to start monitoring the resource.
86

Adding a Resource Using the GUI: NIC Example

NICs are monitored using:
ICMP pings to NetworkHosts address, if specified A broadcast to the administrative IP subnet
An administrative IP address must be configured first.
main.cf main.cf NIC DemoNIC ( NIC DemoNIC ( Critical = 0 Critical = 0 Device = eri0 Device = eri0 ) )
Adding Resources Using the GUI: NIC Example The NIC resource has only one required attribute, Device, for all platforms other than HP-UX, which also requires NetworkHosts unless PingOptimize is set to 0. Optional attributes for NIC vary by platform. Refer to the VERITAS Cluster Server Bundled Agents Reference Guide for a complete definition. These optional attributes are common to all platforms. NetworkType: Type of network, Ethernet (ether) PingOptimize: Number of monitor cycles to detect if the configured interface is inactive A value of 1 optimizes broadcast pings and requires two monitor cycles. A value of 0 performs a broadcast ping during each monitor cycle and detects the inactive interface within the cycle. The default is 1. Note: On the HP-UX platform, if the PingOptimize attribute is set to 1, the monitor entry point does not send broadcast pings. NetworkHosts: The list of hosts on the network that are used to determine if the network connection is alive It is recommended that you enter the IP address of the host rather than the host name to prevent the monitor cycle from timing out due to DNS problems. Example device attribute values: Solaris: eri0; HP-UX: lan0; Linux: eth0; AIX: en0

87
Persistent Resources
Persistent resources:
Are of type on-only or none Are online when enabled Cannot be taken offline Do not affect service group state; for example: A service group with only a NIC shows the state as offline. Addition of a nonpersistent resource, such as IP, affects the service group state.
Persistent Resources If you add a persistent resource as the first resource of a new service group, as shown in the lab exercise for this lesson, notice that the service group status is offline, even though the resource status is online. Persistent resources are not considered when VCS reports service group status, because they are always online. When a nonpersistent resource is added to the group, such as IP, the service group status reflects the status of that nonpersistent resource.
88

Adding an IP Resource
Virtual IP addresses: Are configured by the agent using ifconfig Must be different from the administrative IP address
main.cf main.cf IP DemoIP ( IP DemoIP ( Critical = 0 Critical = 0 Device = eri0 Device = eri0 Address = "10.10.21.198" Address = "10.10.21.198" ) )
Adding an IP Resource The slide shows the required attribute values for an IP resource in the DemoSG service group. The corresponding entry is made in the main.cf file when the configuration is saved. Notice that the IP resource has two required attributes: Device and Address, which specify the network interface and IP address, respectively. Optional Attributes NetMask: Netmask associated with the application IP address The value may be specified in decimal (base 10) or hexadecimal (base 16). The default is the netmask corresponding to the IP address class. Options: Options to be used with the ifconfig command ArpDelay: Number of seconds to sleep between configuring an interface and sending out a broadcast to inform routers about this IP address The default is 1 second. IfconfigTwice: If set to 1, this attribute causes an IP address to be configured twice, using an ifconfig up-down-up sequence. This behavior increases the probability of gratuitous ARPs (caused by ifconfig up) reaching clients. The default is 0.

89
Adding a Resource Using Commands: DiskGroup Example
hares -add hares -add
You can use the hares command to add a resource and modify resource attributes. The DiskGroup agent:
haconf makerw hares add DemoDG DiskGroup DemoSG hares modify DemoDG Critical 0 hares modify DemoDG DiskGroup DemoDG hares modify DemoDG Enabled 1 haconf dump makero
The DiskGroup agent: Imports and deports Imports and deports a disk group a disk group Monitors the disk Monitors the disk group using vxdg group using vxdg
DiskGroup DemoDG ( DiskGroup DemoDG ( Critical = 0 Critical = 0 DiskGroup = DemoDG DiskGroup = DemoDG main.cf ) main.cf )
Adding a Resource Using the CLI: DiskGroup Example You can use the hares command to add a resource and configure the required attributes. This example shows how to add a DiskGroup resource. The DiskGroup Resource The DiskGroup resource has only one required attribute, DiskGroup, except on Linux, which also requires StartVolumes and StopVolumes. Notes: In versions prior to 4.0, VCS uses the vxdg with the -t option when importing a disk group to disable autoimport. This ensures that VCS controls the disk group. VCS deports a disk group if it was manually imported without the -t option (outside of VCS control). In version 4.1 and 5.0, VCS sets the vxdg autoimport option to no, which disables autoimporting of disk groups. Example optional attributes: StartVolumes: Starts all volumes after importing the disk group This also starts layered volumes by running vxrecover -s. The default is 1, enabled, on all UNIX platforms except Linux. StopVolumes: Stops all volumes before deporting the disk group with vxvol The default is 1, enabled, on all UNIX platforms except Linux.
810

The Volume Resource
Resource Definition Service Group Name Resource Name Resource Type Required Attributes Volume DiskGroup
Sample Value DemoSG DemoVol Volume DemoVol DemoDG
The Volume agent: Starts a volume using vxrecover and stops a volume using vxvol Reads a block from the raw device interface using dd to determine volume status
main.cf main.cf Volume DemoVol ( Volume DemoVol ( Volume = DemoVol Volume = DemoVol DiskGroup = DemoDG DiskGroup = DemoDG ) )
Volume resources: Are not required; DiskGroup resources can start volumes Provide additional monitoring
The Volume Resource The Volume resource can be used to manage a VxVM volume. Although the Volume resource is not strictly required, it provides additional monitoring. You can use a DiskGroup resource to start volumes when the DiskGroup resource is brought online. This has the effect of starting volumes more quickly, but only the disk group is monitored. However, if you have a large number of volumes on a single disk group, the DiskGroup resource can time out when trying to start or stop all the volumes simultaneously. In this case, you can set the StartVolume and StopVolume attributes of the DiskGroup to 0, and create Volume resources to start the volumes individually. Also, if you are using volumes as raw devices with no file systems, and, therefore, no Mount resources, consider using Volume resources for the additional level of monitoring. The Volume resource has no optional attributes.

811
The Mount Resource

Resource Definition Service Group Name Resource Name Resource Type Required Attributes MountPoint BlockDevice FSType FsckOpt /Demo /dev/vx/dsk/DemoDG /DemoVol vxfs -y Sample Value DemoSG DemoMount Mount
The Mount agent: Mounts and unmounts a block device on the directory Runs fsck to remount if mount fails Uses stat and statvfs to monitor the file system
CLI escape character
Mount DemoMount ( Mount DemoMount ( main.cf main.cf BlockDevice = /dev/vx/dsk/DemoDG/DemoVol BlockDevice = /dev/vx/dsk/DemoDG/DemoVol FSType = vxfs FSType = vxfs MountPoint = /Demo MountPoint = /Demo FsckOpt = -y ) FsckOpt = -y )
hares modify . . FsckOpt %-y
The Mount Resource The Mount resource has the required attributes displayed in the main.cf file excerpt in the slide. Example optional attributes: MountOpt: Specifies options for the mount command When setting attributes with arguments starting with a dash (-), use the percent (%) character to escape the arguments. Examples: hares -modify DemoMount FsckOpt %-y The percent character is an escape character for the VCS CLI which prevents VCS from interpreting the string as an argument to hares. SnapUmount: Determines whether VxFS snapshots are unmounted when the file system is taken offline (unmounted) The default is 0, meaning that snapshots are not automatically unmounted when the file system is unmounted. Note: If SnapUmount is set to 0 and a VxFS snapshot of the file system is mounted, the unmount operation fails when the resource is taken offline, and the service group is not able to fail over. This is desired behavior in some situations, such as when a backup is being performed from the snapshot.
812

The Process Resource

Property Service Group Resource Name Resource Type Required Attributes PathName Optional Attributes Arguments /Demo/orderproc main.cf main.cf up Process DemoProcess ( Process DemoProcess ( PathName = "/bin/sh" PathName = "/bin/sh" Arguments = "/Demo/orderproc up" Arguments = "/Demo/orderproc up" ) ) /bin/sh Value DemoSG DemoProcess Process
The Process agent: Starts and stops a daemon-type process Monitors the process by scanning the process table
The Process Resource The Process resource controls the application and is added last because it requires all other resources to be online in order to start. The Process resource is used to start, stop, and monitor the status of a process. Online: Starts the process specified in the PathName attribute, with options, if specified in the Arguments attribute Offline: Sends SIGTERM to the process SIGKILL is sent if process does not exit within one second. Monitor: Determines if the process is running by scanning the process table The optional Arguments attribute specifies any command-line options to use when starting the process.

813
Process Attribute Specification
Attribute PathName
Binary Executable Fully-qualified path to binary file: "/Demo/orderproc
Shell Script Fully-qualified path to shell: /bin/sh Path to shell script and parameters: /Demo/orderproc up
Arguments (optional)
Parameters passed to binary file: up
Process Attribute Specification If the executable is a shell script, you must specify the script name followed by arguments. You must also specify the full path for the shell in the PathName attribute. The monitor script calls ps and matches the process name. The process name field is limited to 80 characters in the ps output. If you specify a path name to a process that is longer than 80 characters, the monitor entry point fails.
814

Troubleshooting Resources
Bring Online Bring Online Done Done Y Online? Online? N Values? Values? Good Check Logs Check Logs Fix Problems Fix Problems Bad Flush Group Flush Group Primary focus
Modify Attributes Modify Attributes
Faulted? Faulted? Y
Clear Resource Clear Resource
Solving Common Configuration Errors

Troubleshooting Resources Verify that each resource is online on the local system before continuing the service group configuration procedure. If you are unable to bring a resource online, use the procedure in the diagram to find and fix the problem. You can view the logs through Cluster Manager or in the /var/VRTSvcs/logs directory if you need to determine the cause of errors. VCS log entries are written to engine_A.log and agent entries are written to resource_A.log files. Note: Some resources do not need to be disabled and reenabled. Only resources whose agents have open and close entry points, such as MultiNICB, require you to disable and enable again after fixing the problem. By contrast, a Mount resource does not need to be disabled if, for example, you incorrectly specify the MountPoint attribute. However, it is generally good practice to disable and enable regardless because it is difficult to remember when it is required and when it is not. In addition, a resource is immediately monitored upon enabling, which would indicate potential problems with attribute specification. More detail on performing tasks necessary for solving resource configuration problems is provided in the following sections.
8

815
Flushing a Service Group

Agents appear to hang when resources are misconfigured. To fix:
Ensure the resource is stopped Ensure the resource at the operating system level. at system level. Flush the service group to stop service group all online and offline processes. offline processes.
hagrp flush DemoSG sys S1 hagrp flush DemoSG sys S1
Flushing a Service Group Occasionally, agents for the resources in a service group can appear to become suspended waiting for resources to be brought online or be taken offline. Generally, this condition occurs during initial configuration and testing because the required attributes for a resource are not defined properly or the underlying operating system resources are not prepared correctly. If it appears that a resource or group has become suspended while being brought online, you can flush the service group to enable corrective action. Flushing a service group stops VCS from attempting to bring resources online or take them offline and clears any internal wait states. You can then check resources for configuration problems or underlying operating system configuration problems, and then attempt to bring resources back online. Note: Before flushing a service group, verify that the physical or software resource is actually stopped.
816

Disabling and Enabling a Resource

Upon disabling:
VCS calls the agent on each system in the SystemList. The agent:
Calls the close entry point, if present, to reset the resource Stops monitoring disabled resources.
Upon enabling: VCS calls the agent to monitor the resource on each system.
Nonpersistent resources Nonpersistent resources must be taken offline must be taken offline before disabling. before disabling.
hares modify DemoIP Enabled 0 hares modify DemoIP Enabled 0
Disabling and Enabling a Resource Disable a resource before you start modifying attributes to fix a misconfigured resource. When you disable a resource: VCS stops monitoring the resource, so it does not fault or wait to come online while you are making changes. The agent calls the close entry point, if defined. The close entry point is optional. When the close tasks are completed, or if there is no close entry point, the agent stops monitoring the resource. When you enable a resource, VCS calls the agent to immediately monitor the resource and then continues to periodically directs the agent to monitor the resource.

817
Clearing Resource Faults

Agents monitor both online and offline resources periodically. A resource fault occurs when:
The underlying component is not available The next monitor entry point returns an unexpected offline status
Fix the underlying problem.

Clear the fault in VCS: Clear nonpersistent resources and bring them back online manually. Probe persistent resources or wait for a monitor cycle to return an online state.
hares clear hares clear hares -probe hares -probe
Clearing Resource Faults A fault indicates that the monitor entry point is reporting an unexpected offline state for a previously online resource. This indicates a problem with the underlying component being managed by the resource. Before clearing a fault, you must resolve the problem that caused the fault. Use the VCS logs to help you determine which resource has faulted and why. It is important to clear faults for critical resources after fixing underlying problems so that the system where the fault originally occurred can be a failover target for the service group. In a two-node cluster, a faulted critical resource would prevent the service group from failing back if another fault occurred. You can clear a faulted resource on a particular system, or on all systems when the service group can run. Note: Persistent resource faults are cleared automatically when the next monitor cycle returns an online status. Alternately, you can probe the resource to force the agent to monitor the resource immediately. Clearing and Probing Resources Using the CLI To clear a faulted resource, type: hares -clear resource [-sys system] If the system name is not specified then the resource is cleared on all systems. To probe a resource, type: hares -probe resource -sys system
Test Procedure
After all resources are online locally:
1. 2. 3. 4. Link resources. Switch the service group to each system. Set resources to critical, as needed. Test failover.
Start Start Link Resources Link Resources Test Switching Test Switching N
Success? Success? Y Set Critical Set Critical
Check Logs/Fix Check Logs/Fix
Test Failover Test Failover N Y
Success? Success?
Done Done
Testing the Service Group

After you have successfully brought each resource online, link the resources and switch the service group to each system on which the service group can run. Test Procedure For simplicity, the example service group uses the default Priority failover policy. That is, if a critical resource in DemoSG faults, the service group is taken offline and brought online on the system with the highest priority. The Configuring VCS Response to Resource Faults lesson provides additional information about configuring and testing failover behavior. Additional failover policies are also described in the VERITAS Cluster Server for UNIX, Implementing Local Clusters course.

819
Linking Resources
main.cf main.cf DemoIP requires DemoNIC DemoIP requires DemoNIC hares hares hares hares hares hares link DemoIP DemoNIC link DemoIP DemoNIC dep dep unlink DemoIP DemoNIC unlink DemoIP DemoNIC
Linking Resources When you link a parent resource to a child resource, the dependency becomes a component of the service group configuration. When you save the cluster configuration, each dependency is listed at the end of the service group definition, after the resource specifications, in the format show in the slide. In addition, VCS creates a dependency tree in the main.cf file at the end of the service group definition to provide a more visual view of resource dependencies. This is not part of the cluster configuration, as denoted by the // comment markers.
// resource dependency tree // //group DemoSG //{ //IP DemoIP // // // //} { NIC DemoNIC }
Note: You cannot use the // characters as general comment delimiters. VCS strips out all lines with // upon startup and re-creates these lines based on the requires statements in the main.cf file.
820

DemoSG DemoSG DemoIP DemoNIC DemoProcess DemoMount DemoVol DemoDG Resource Dependency Definition Service Group DemoSG Parent Resource DemoVol DemoMount DemoIP DemoProcess DemoProcess Requires Child Resource DemoDG DemoVol DemoNIC DemoMount DemoIP
Dependency rules: Parent depends on child:

Child resource comes online first Parent can only come online if the child is online Parent must go offline first
Parents cannot be persistent No resource links across service groups Unlimited number of parent/child resources No cyclical dependencies
Resource Dependencies VCS enables you to link resources to specify dependencies. For example, an IP address resource is dependent on the NIC providing the physical link to the network. Ensure that you understand the dependency rules shown in the slide before you start linking resources.

821
Setting the Critical Attribute

When Critical is set: The attribute is removed from main.cf (Critical=1 is the default setting for all resources.) The entire service group is faulted if the resource faults.
main.cf main.cf DiskGroup DemoDG ( DiskGroup DemoDG ( DiskGroup = DemoDG DiskGroup = DemoDG ) ) hares modify DemoDG Critical 1 hares modify DemoDG Critical 1
Setting the Critical Attribute The Critical attribute is set to 1, or true, by default. When you initially configure a resource, you set the Critical attribute to 0, or false. This enables you to test the resources as you add them without the resource faulting and causing the service group to fail over as a result of configuration errors you make. Some resources may always be set to non-critical. For example, a resource monitoring an Oracle reporting database may not be critical to the overall service being provided to users. In this case, you can set the resource to non-critical to prevent downtime due to failover in the event that it was the only resource that faulted. Note: When you set an attribute to a default value, the attribute is removed from main.cf. For example, after you set Critical to 1 for a resource, the Critical = 0 line is removed from the resource configuration because it is now set to the default value for the NIC resource type. To see the values of all attributes for a resource, use the hares command. For example:
hares -display DemoNIC
822

Running a Virtual Fire Drill
S1
S1 S2
havfd DemoDG sys S2 -v havfd DemoDG sys S2 -v
Running a Virtual Fire Drill You can run a virtual fire drill for a service group to check that the underlying infrastructure is properly configured to enable failover to other systems. The service group must be fully online on one system, and can then be checked on all other systems where it is offline. You can select which type of infrastructure components to check, or run all checks. In some cases, you can use the virtual fire drill to correct problems, such as making a mount point directory if it does not exist. However, not all resources have defined actions for virtual fire drills, in which case a message is displayed indicating that no checks were performed. You can also run fire drills using the havfd command, as shown in the slide.

823
Lesson Summary
Key Points
Follow a standard procedure for creating and testing service groups. Recognize common configuration problems and apply a methodology for finding solutions.
Reference Materials
VERITAS Cluster Server Bundled Agent Reference Guide VERITAS Cluster Server User's Guide VERITAS Cluster Server Command Line Quick Reference
Lab 8: Online Configuration of a Service Group

Use the Java GUI to: Create a service group. Add resources to the service group from the bottom of the dependency tree. Substitute the name you used to create the disk group and volume.
nameSG1 nameSG1 nameProcess1
nameMount1
nameIP1
nameVol1
nameNIC1
nameDG1
Labs and solutions for this lesson are located on the following pages. "Lab 8: Online Configuration of a Service Group," page A-43. "Lab 8 Solutions: Online Configuration of a Service Group," page B-71.
824

Lesson 9 Offline Configuration
Lesson Introduction
Topic
Offline Configuration Procedures Solving Offline Configuration Problems

Describe offline configuration procedures. Resolve common errors made during offline configuration.
Testing the Service Group Test the service group to ensure it is correctly configured.
92

Offline Configuration ProcedureNew Cluster

Save and close the configuration. Save and close the configuration. Change to configuration directory. Change to configuration directory. Stop VCS on all systems. Stop VCS on all systems. Edit the configuration file. Edit the configuration file. Verify configuration file syntax. Verify configuration file syntax. Errors? N Start VCS on this system. Start VCS on this system. Verify that VCS is running. Verify that VCS is running. Start VCS on all other systems. Start VCS on all other systems. Y haconf dump -makero cd /etc/VRTSvcs/conf/config hastop -all vi main.cf hacf verify . Primary System hastart hastatus -sum
hastart
All Other Systems
Offline Configuration Procedures

New Cluster The diagram illustrates a process for modifying the cluster configuration when you are configuring your first service group and do not already have services running in the cluster. Select on system to be your primary node for configuration. Work from this system for all steps up to the final point of restarting VCS. 1 Save and close the configuration. Always save and close the configuration before making any modifications. This ensures the configuration in the main.cf file on disk is the most recent in-memory configuration. 2 Change to the configuration directory. The examples used in this procedure assume you are working in the /etc/ VRTSvcs/conf/config directory. 3 Stop VCS. Stop VCS on all cluster systems. This ensures that there is no possibility of another administrator changing the cluster configuration while you are modifying the main.cf file. 4 Edit the configuration files. You must choose a system on which to modify the main.cf file. You can choose any system. However, you must then start VCS first on that system.

93
Verify the Configuration File Syntax Run the hacf command in the /etc/VRTSvcs/conf/config directory to verify the syntax of the main.cf and types.cf files after you have modified them. VCS cannot start if the configuration files have syntax errors. Run the command in the config directory using the dot (.) to indicate the current working directory, or specify the full path.
Note: The hacf command only identifies syntax errors, not configuration errors. Start VCS on the system with the modified configuration file. Start VCS first on the primary system with the modified main.cf file. 7 Verify that VCS is running. Verify that VCS is running on the primary configuration system before starting VCS on other systems. 8 Start Other Systems After VCS is in a running state on the first system, start VCS on all other systems. If you cannot bring VCS to a running state on all systems, see the Solving Offline Configuration Problems section.
6
94

Existing Cluster Procedure: Part 1

Save and close the configuration. Save and close the configuration. Change to the config directory. Change to the config directory. Back up the main.cf file. Back up the main.cf file. Create a working directory. Create a working directory. Copy main.cf and *types.cf. Copy main.cf and *types.cf. Change to the stage directory. Change to the stage directory. Edit the configuration file. Edit the configuration file. Set Frozen attribute. Set Frozen attribute.
cd /etc/VRTSvcs/conf/config cp p main.cf main.cf.orig mkdir stage cp *.cf stage cd stage vi main.cf First First System System
Set Frozen=1 for modified groups.
Existing Cluster The diagram illustrates a process for modifying the cluster configuration when you want to minimize the time that VCS is not running to protect existing services. This procedure includes several built-in protections from common configuration errors and maximizes high availability. First System Designate one system as the primary change management node. This makes troubleshooting easier if you encounter problems with the configuration. 1 Save and close the configuration. Save and close the cluster configuration before you start making changes. This ensures that the working copy has the latest in-memory configuration. 2 Back up the main.cf file Make a copy of the main.cf file with a different name. This ensures that you have a backup of the configuration that was in memory when you saved the configuration to disk. 3 Make a staging directory. Make a subdirectory of /etc/VRTSvcs/conf/config in which you can edit a copy of the main.cf file. This helps ensure that your edits are not overwritten if another administrator changes the configuration simultaneously. 4 Copy the configuration files. Copy the *.cf files /etc/VRTSvcs/conf/config to the staging directory.
95
haconf dump -makero
Modify the configuration files. Modify the main.cf file in the staging directory on one system. The diagram on the slide refers to this as the first system. 6 Freeze the service groups. If you are modifying existing service groups, freeze those service groups persistently by setting the Frozen attribute to 1. This simplifies fixing resource configuration problems after VCS is started because the service groups will not fail over between systems if faults occur. . . . group AppSG ( SystemList = { S1 = 1, S2 = 0} AutoStartList = { S2 } Operators = { AppSGoper } Frozen = 1 )
5
96

Existing Cluster Part 2: Restarting VCS

Verify configuration file syntax. Verify configuration file syntax. Stop VCS; leave services running. Stop VCS; leave services running. Copy the main.cf file back. Copy the main.cf file back. Start VCS on this system. Start VCS on this system. Verify that HAD is running. Verify that HAD is running. hacf verify . hastop all -force cp -p main.cf .. hastart hastatus -sum Primary System
Start VCS on other systems. Start VCS on other systems.
hastart
Other Systems
After you verify that all resources come online properly: Unfreeze service groups. Test switching the service group.
Verify the configuration file syntax. Run the hacf command in the staging directory to verify the syntax of the main.cf and types.cf files after you have modified them.
Note: The dot (.) argument indicates that the current working directory is used as the path to the configuration files. You can run hacf -verify from any directory by specifying the path to the configuration directory: hacf -verify /etc/VRTSvcs/conf/config
8
10 11 12
Stop VCS. Stop VCS on all cluster systems after making configuration changes. To leave applications running, use the -force option, as shown in the diagram. Copy the new configuration file. Copy the modified main.cf file and all *types.cf files from the staging directory back into the configuration directory. Start VCS. Start VCS first on the system with the modified main.cf file. Verify that VCS is in a local build or running state on the primary system. Start other systems. After VCS is in a running state on the first system, start VCS all other systems. You must wait until the first system has built a cluster configuration in memory and is in a running state to ensure the other systems perform a remote build from the first systems configuration in memory.

97
VCS Startup Using a Specific main.cf File

S1
Local Build Cluster Conf 4 2 main.cf had hashadow 1 hastart hastart 3 main.cf
S2
No config in memory VCS not running
System with modified main.cf
VCS Startup Using a Specific main.cf File The diagram illustrates how to start VCS to ensure that the cluster configuration in memory is built from a specific main.cf file. Starting VCS Using a Modified main.cf File Ensure that VCS builds the new configuration in memory on the system where the changes were made to the main.cf file. All other systems must wait for the build to successfully complete and the system to transition to the running state before VCS is started elsewhere. 1 Run hastart on S1 to start the had and hashadow processes. 2 HAD checks for a valid main.cf file. 3 HAD checks for an active cluster configuration on the cluster interconnect. 4 Because there is no active cluster configuration, the had daemon on S1 reads the local main.cf file and loads the cluster configuration into local memory on S1.
98

Building the Configuration

S1
Local Build or Running Cluster Cluster Conf Conf
S2
10
hastart hastart
hastatus -sum hastatus -sum
5 6 7 8 9 10 11
Verify that VCS is in a local build or running state on S1 using hastatus sum. When VCS is in a running state on S1, run hastart on S2 to start the had and hashadow processes. HAD on S2 checks for a valid main.cf file. HAD on S2 checks for an active cluster configuration on the cluster interconnect. S1 sends a copy of the cluster configuration over the cluster interconnect to S2. S2 performs a remote build to put the new cluster configuration in memory. HAD on S2 copies the cluster configuration into the local main.cf and types.cf files after moving the original files to backup copies with timestamps.

99
Resource Dependency Definition Service Group AppSG Parent AppProcess AppProcess AppMount AppVol AppIP Requires Child AppIP AppMount AppVol AppDG AppNIC AppProcess requires AppIP AppProcess requires AppMount AppMount requires AppVol AppVol requires AppDG AppIP requires AppNIC
Ensure resource dependencies are defined. Review these rules:

No persistent parent resources No links between resources in different groups No limits to the number of parent and child resources No cyclical dependencies
Resource Dependencies Ensure that you create the resource dependency definitions at the end of the service group definition. Add the links using the syntax shown in the slide. A complete example service group definition is shown in the lab solution for this lesson.
910

A Completed Configuration File

. . .
main.cf main.cf
Original group Original group Copied group Copied group group DemoSG ( SystemList = { S1 = 0, S2 = 1} AutoStartList = { S1 } . . . Operators = { DemoSGoper } group AppSG ( Frozen = 1 SystemList = { S1 = 1, S2 = 0} ) AutoStartList = { S2 } DiskGroup DemoDG ( Operators = { AppSGoper } DiskGroup = DemoDG Frozen = 1 ) ) IP DemoIP ( DiskGroup AppDG ( Device = eri0 DiskGroup = DemoDG Address = "10.10.21.198" ) ) IP DemoIP ( . . . Device = eri0 Address = "10.10.21.199" Check these common problems. ) . . .
A Completed Configuration File A portion of the completed main.cf file with the new service group definition for AppSG is displayed in the slide. This service group was created by copying the DemoSG service group definition and changing the attribute names and values. Two errors are intentionally shown in the example in the slide. The DemoIP resource name was not changed in the AppSG service group. This causes a syntax error when the main.cf file is checked using hacf verify because you cannot have duplicate resource names within the cluster. The AppDG resource has the value of DemoDG for the DiskGroup attribute. This does not cause a syntax error, but is not a correct attribute value for this resource. The DemoDG disk group is being used by the DemoSG service group and cannot be imported by another failover service group. For a complete service group definition example, see the corresponding lab solution for this lesson. Note: You cannot include comment lines in the main.cf file. The lines you see starting with // are generated by VCS to show resource dependencies. Any lines starting with // are stripped out during VCS startup.

911

You can use the Simulator to: Create and modify main.cf and types.cf files. Test any main.cf file before placing the configuration into a real cluster.
Using the Simulator does not affect real cluster configuration files. Using the Simulator does not affect real cluster configuration files. Simulator configuration files are created in a separate directory. Simulator configuration files are created in a separate directory.
Using the VCS Simulator You can use the VCS Simulator to create or modify copies of VCS configuration files that are located in a Simulator-specific directory. You can also test new or modified configuration using the Simulator and then copy the test configuration files into the /etc/VRTSvcs/conf/config VCS configuration directory. In addition to the advantage of using a familiar interface, using the VCS Simulator ensures that your configuration files do not contain syntax errors that can more easily be introduced when manually editing the files directly. When you have completed the configuration, you can copy the files into the standard configuration directory and restart VCS to build that configuration in memory on cluster systems, as described earlier in the Offline Configuration Procedures section. Note: You cannot use the Simulator to manage a running VCS cluster. Simulated clusters are completely separate and Simulator configuration files are maintained in separate directory structures.
912

Starting From an Old Configuration

You start the cluster from the wrong system. VCS starts but builds the in-memory configuration from an old configuration.
9
To recover:
1. Close the configuration, if open. 2. Stop VCS on all systems and keep applications running. 3. On the system with the good main.cf file, copy main.cf.previous to main.cf. 4. Verify the syntax. 5. Start VCS on this system using the hastart command. 6. Verify that VCS is running using hastatus. 7. Start VCS on all other systems only after the first system is running.
Solving Offline Configuration Problems

Starting from an Old Configuration If you are running an old cluster configuration because you started VCS on the wrong system first, you can recover the main.cf file on the system where you originally made the modifications using the main.cf.previous backup file created automatically by VCS. Recovering from an Old Configuration Use the offline configuration procedure to restart VCS using the recovered main.cf file. Note: You must ensure that VCS is in the local build or running state on the system with the recovered main.cf file before starting VCS on other systems.

913
Forcing VCS to Start from a Wait State

S1
Local Build Cluster Conf 5 main.cf main.cf had hashadow hasys force S1 hasys force S1 main.cf main.cf had hashadow
S2
No config in memory Waiting for a running config
4 1
hacf verify /opt/VRTSvcs/conf/config hacf verify /opt/VRTSvcs/conf/config
All Systems in a Wait State This scenario results in all cluster systems entering a wait state: Your new main.cf file has a syntax problem. You forget to check the file with hacf -verify. You start VCS on the first system with hastart. The first system cannot build a configuration and goes into a wait state, such as STALE_ADMIN_WAIT or ADMIN_WAIT. You forget to verify that had is running on the first system and start all other cluster systems. Forcing VCS to Start from a Wait State To force VCS to start on the system with the correct main.cf file, use hasys -force to tell had to create the cluster configuration from that system. 1 Visually inspect the main.cf file to ensure it contains the correct configuration content. 2 Verify the configuration with hacf -verify /opt/VRTSvcs/conf/ config. 3 Run hasys -force S1 on S1. This starts the local build process. You must have a valid main.cf file to force VCS to a running state. If the main.cf file has a syntax error, VCS enters the ADMIN_WAIT state. 4 HAD checks for a valid main.cf file. 5 The had daemon on S1 reads the local main.cf file, and if it has no syntax errors, HAD loads the cluster configuration into local memory on S1.
Forcing VCS to Start (Continued)

S1
Running Cluster Cluster Conf Conf
S2
When had is in a running state on S1, this state change is broadcast on the cluster interconnect by GAB. 7 S2 then performs a remote build to place the new cluster configuration into its memory. 8 The had process on S2 copies the cluster configuration into the local main.cf and types.cf files after moving the original files to backup copies with timestamps.
6

915
Configuration File Backups
ls l /etc/VRTSvcs/conf/config/main* -rw------ 2 root other 5992 Oct 10 -rw------ 1 root root 5039 Oct 8 -rw------ 2 root other 5051 Oct 9 -rw------ 2 root other 5992 Oct 10 -rw------ 1 root other 6859 Oct 11 -rw------ 2 root other 5051 Oct 9
12:07 8:01 17:58 12:07 7:43 17:58
main.cf main.cf.08Oct2006... main.cf.09Oct2006... main.cf.10Oct2006... main.cf.autobackup main.cf.previous
main.cf linked to last timestamp file main.cf.09Oct2006.12.07.46 linked to main.cf.previous
Configuration File Backups Each time you save the cluster configuration, VCS maintains backup copies of the main.cf and types.cf files. Although it is always recommended that you copy configuration files before modifying them, you can revert to an earlier version of these files if you damage or lose a file.
916

Service Group Testing Procedure

1. Determine service group state. 2. Switch the service group to each system. 3. Change critical attributes, as appropriate. 4. Test failover.
Start Start
Online? Online?
Troubleshoot
Test Switching Test Switching Success? Success? Y Set Critical Res Set Critical Res Test Failover Test Failover N Check Logs/Fix Check Logs/Fix
Success? Success?
Done Done
Testing the Service Group

Service Group Testing Procedure After you restart VCS throughout the cluster, use the procedure shown in the slide to verify that your configuration additions or changes are correct. Notes: This process is slightly different from online configuration, which tests each resource before creating the next and before creating dependencies. Resources should come online after you restart VCS if you have specified the appropriate attributes to automatically start the service group. Use the procedures shown in the Online Configuration of Service Groups lesson to solve configuration problems, if any. If you need to make additional modifications, you can use one of the online tools or modify the configuration files using the offline procedure.

917
Lesson Summary
Key Points
You can use a text editor or the VCS Simulator to modify VCS configuration files. Apply a methodology for modifying and testing a VCS configuration.
Reference Materials
VERITAS Cluster Server Bundled Agents Reference Guide VERITAS Cluster Server User's Guide VERITAS Cluster Server Command Line Quick Reference
Lab 9: Offline Configuration

nameProcess1 nameFileOnOff1 nameMount1 nameVol1 nameDG1 nameProcess2 nameFileOnOff2 nameMount2 nameVol2 nameDG2 nameSG2 nameSG2 nameIP2 nameNIC2 Advanced Lab: Advanced Lab: Together, edit main.cf. Together, edit main.cf. Copy each nameSG1 to Copy each nameSG1 to create nameSG2: create nameSG2:
your_nameSG2 your_nameSG2 their_nameSG2 their_nameSG2
nameSG1 nameSG1 nameIP1 nameNIC1
Basic Lab: Basic Lab: Together, edit main.cf. Together, edit main.cf. Add a resource to each Add a resource to each nameSG1: nameSG1:
your_nameFileOnOff1 your_nameFileOnOff1 their_nameFileOnOff1 their_nameFileOnOff1
Labs and solutions for this lesson are located on the following pages. "Lab 9: Offline Configuration," page A-55. "Lab 9 Solutions: Offline Configuration," page B-95.
918

Lesson 10 Sharing Network Interfaces
Lesson Introduction
Topic
Parallel Service Groups Sharing Network Interfaces Using Parallel Network Service Groups

Describe properties of parallel service groups . Describe how multiple service groups can share network interfaces. Create a parallel service group for network interfaces.
102

Parallel Service Groups
Storage Foundation for Oracle RAC example:

Enables multiple instances of Oracle to run simultaneously Uses parallel service groups Monitors an Oracle 10g instance running on each system simultaneously Manages concurrent access to database files
Parallel Service Groups

Storage Foundation for Oracle RAC (SFRAC) is an extension of VCS that enables you to use parallel service groups to manage Oracle instances running on multiple nodes concurrently and execute transactions against the same database. SFRAC coordinates access to the shared data for each node to provide consistency and integrity. Each node adds its processing power to the cluster as a whole and can increase overall throughput or performance. An application must be specifically design to run in parallel on multiple systems if the application instances use the same data. You must also have a cluster file system to enable simultaneous access to the storage from multiple nodes. Therefore, you cannot simply run any application in parallel.

103
10
Other Example Parallel Applications

Cluster Volume Manager Simultaneously import and read/write to volumes across cluster nodes Application must control mutual exclusion of read/writes to blocks Cluster File System Simultaneously mount and read/write to file systems across cluster nodes Read/writes are direct to/from disk (except for metadata) Global lock management
Read-only Web servers Networking

Share network resources among service groups No shared data to manage
Other Example Parallel Applications VERITAS Volume Manager and Cluster File Systems are the storage management components of Storage Foundation for Oracle RAC. Additional agents are provided to manage the storage objects in a parallel environment, such as: CVMVolDg: Manages shared volumes in shared disk groups CFSMount: Manages cluster file systems These types of resources are contained in parallel service groups and run simultaneously on multiple systems. Parallel service groups can also be used manage network resources that are shared by multiple service groups, as described in detail later in this lesson.
104

Properties of Parallel Service Groups

All service groups: Start up only on systems defined in AutoStartList Fail over to target systems defined in SystemList (where they are not already online) Are managed by the GUI/CLI in the same manner
Parallel groups:
Can be online on more than one system without causing a concurrency violation Cannot be switched
Properties of Parallel Service Groups Parallel service groups are managed like any other service group in VCS. The group is only started on a system if that system is listed in the AutoStartList and the SystemList attributes. The difference with a parallel service group is that it starts on multiple systems simultaneously if more than one system is listed in AutoStartList. A parallel service group can also fail over if the service group faults on a system and there is an available system (listed in the SystemList attribute) that is not already running the service group.

105
10
Configuring a Parallel Service Group

For a new group, set Parallel using online configuration: CLI GUI For existing groups, use offline configuration.
Service Group Definition Group Required Attributes Parallel SystemList Optional Attributes AutoStartList
Sample Value NetSG 1 S1=0, S2=1 S1, S2
group NetSG ( group NetSG ( SystemList = {S1 = 0, S2 = 1} SystemList = {S1 = 0, S2 = 1} AutoStartList = {S1, S2} AutoStartList = {S1, S2} Parallel = 1 Parallel = 1 ) ) main.cf main.cf
Configuring a Parallel Service Group You cannot change an existing failover service group that contains resources to a parallel service group except by using the offline configuration procedure. In this case, you can add the Parallel attribute definition to the service group, as displayed in the diagram. To create a new parallel service group in a running cluster: 1 Create a new service group using either the GUI or CLI. 2 Set the Parallel attribute to 1 (true). 3 Add resources. Set the critical attributes after you have verified that the service group is online on all systems in SystemList. Note: If you have a service group that already contains resources, you must set the Parallel attribute by editing the main.cf file and restarting VCS with the modified configuration file.
106

NIC Resources in Multiple Service Groups

Three service groups contain a NIC resource:
Each NIC resource monitors the same network interface. The NIC agent monitors each NIC resource every 60 seconds. Additional network traffic is generated by three monitor cycles running for the same device.
eri0
Sharing Network Interfaces

The first step in understanding how parallel service groups enable more efficient network resource management is to show how you can have one resource act as proxy for another. NIC Resources in Multiple Service Groups This example shows a cluster system running three service groups using the same network interface. Each service group has a unique NIC resource with a unique name, but the Device attribute for each NIC resource is the same. Because each service group has its own NIC resource for the interface, VCS monitors the same network interfaceeri0many times, creating unnecessary overhead and network traffic.

107
10
Configuration View
WebSG WebSG main.cf main.cf DBSG DBSG NFSSG NFSSG . .. main.cf Orac1SG main.cf Orac1SG . main.cf main.cf Ora1SG Ora1SG . . .. . main.cf main.cf DBSG DBSG . . . main.cf main.cf . . . .. . . main.cf main.cf . . . IP .DBIP ( IP .DBIP ( ( . . IP .AppIP ( IP .AppIP Device = qfe1 IP .AppIP ( Device == qfe1 IP .AppIP (qfe1 Device = qfe1 Device IP AppIP ( IP AppIP=" qfe1 Address = =" qfe1 Device ( Address == ( 10.10.21.198" Device ( 10.10.21.198" IP DBIP IP DBIP Address ==" 10.10.21.198" Address =" qfe1 10.10.21.198" Device Device qfe1 ) Address = =" eri0 10.10.21.198" )) Address = =" eri0 10.10.21.198" Device Device ) Address = " 10.10.21.198" ) Address =="" 10.10.21.198" ) Address = " 10.10.21.199" Address 10.10.21.199" ) ) NIC DBNIC ( NIC DBNIC ( ( ) ) NIC AppNIC ( NIC AppNIC NIC AppNIC ( Device = qfe1 Device == qfe1 NIC AppNICqfe1 ( Device = qfe1 Device NIC AppNIC ( ( ) NIC AppNICqfe1 )) Device = qfe1 Device = ( NIC DBNIC ( ) NIC DBNIC qfe1 Device = qfe1 ) Device == eri0 ) Device = eri0 ) Device ) DBIP requires DBNIC DBIP requires DBNIC AppIP )requires AppNIC requires AppNIC AppIP ) AppIP requires AppNIC AppIP requires AppNIC AppIP requires AppNIC AppIP requires AppNIC DBIP requires DBNIC DBIP requires DBNIC
. . . . . . IP WebIP ( IP WebIP ( Device = eri0 Device = eri0 Address = 10.10.21.198" Address = 10.10.21.198" ) ) NIC WebNIC ( NIC WebNIC ( Device = eri0 Device = eri0 ) ) WebIP requires WebNIC WebIP requires WebNIC
Solaris
Configuration View The example shows a configuration with many service groups using the same network interface specified in the NIC resource. Each service group has a unique NIC resource with a unique name, but the Device attribute for all is eri0 in this Solaris example. In addition to the overhead of many monitor cycles for the same resource, a disadvantage of this configuration is the effect of changes in NIC hardware. If you must change the network interface (for example, in the event the interface fails), you must change the Device attribute for each NIC resource monitoring that interface.
108

Using Proxy Resources

DBProcess1 A Proxy resource mirrors the A Proxy resource mirrors the state of another resource (for state of another resource (for example, NIC). example, NIC). DBMount1 DBVol1 DBDG1 DBIP1 DBProxy1
WebProcess WebMount WebVol WebDG WebIP WebNIC
Using Proxy Resources You can use a Proxy resource to allow multiple service groups to monitor the same network interfaces. This reduces the network traffic that results from having multiple NIC resources in different service groups monitor the same interface.

109
10
The Proxy Resource Type

The Proxy agent: Monitors the status of a specified resource on the local system Monitors the resource on another system if TargetSysName is specified. TargetResName must be in a separate service group.
main.cf main.cf Proxy DBProxy ( Proxy DBProxy ( Critical = 0 Critical = 0 TargetResName = WebNIC TargetResName = WebNIC ) )
Resource Definition Service Group Name Resource Name Resource Type Required Attributes TargetResName
Sample Value DBSG DBProxy Proxy WebNIC
The Proxy Resource Type The Proxy resource mirrors the status of another resource in a different service group. The required attribute, TargetResName, is the name of the resource whose status is reflected by the Proxy resource. Optional Attributes TargetSysName specifies the name of the system on which the target resource status is monitored. If no system is specified, the local system is used as the target system.
1010

Parallel Network Service Group

WebSG WebSG DBSG DBSG
WebProxy
DBProxy
NetSG NetSG
S1
NetNIC
S2
How do you determine the status of a service group with only a persistent How do you determine the of a service group with only a persistent resource? resource?
Using Parallel Network Service Groups

You can further refine your configuration for multiple service groups that use the same network interface by creating a service group that provides only network services. This service group is configured as a parallel service group because the NIC resource is persistent and can be online on multiple systems. With this configuration, all the other service groups are configured with a Proxy to the NIC resource on their local system. This type of configuration can be easier to monitor at an operations desk, because the state of the parallel service group can be used to indicate the state of the network interfaces on each system. Determining Service Group Status Service groups that do not include any OnOff resources as members are not reported as online, even if their member resources are online, because the status of the None and OnOnly resources is not considered when VCS reports whether a service group is online.

1011
10
Phantom Resources
WebSG WebSG DBSG DBSG
WebProxy
DBProxy
NetSG NetSG S1
NetNIC Phantom S2
A Phantom resource enables VCS to report the online status of a service A Phantom resource enables to report the online status of a service group containing only persistent resources. group containing only persistent resources.
Phantom Resources The Phantom resource is used to report the actual status of a service group that consists of only persistent resources. A service group shows an online status only when all of its nonpersistent resources are online. Therefore, if a service group has only persistent resources, VCS considers the group offline, even if the persistent resources are running properly. When a Phantom resource is added, the status of the service group is shown as online. Note: Use this resource only with parallel service groups.
1012

The Phantom Resource Type

Determines the status of a service group Requires no attributes
Resource Definition Service Group Name Resource Name Resource Type Required Attributes
Sample Value NetSG NetPhantom
main.cf main.cf Phantom NetPhantom ( Phantom NetPhantom ( ) )
The Phantom Resource Type The Phantom resource enables VCS to determine the status of service groups with no OnOff resources, that is, service groups with only persistent resources. Service groups that do not have any OnOff resources are not brought online unless they include a Phantom resource. The Phantom resource is used only in parallel service groups.

1013
10
Phantom
Localizing a NIC Resource Attribute

main.cf main.cf NIC NetNIC ( NIC NetNIC ( Device@S1 = eri0 Device@S1 = eri0 Device@S2 = qfe1 Device@S2 = qfe1 ) )
NetSG NetSG eri0 eri0 S1
NetNIC Phantom qfe1 qfe1 S2
You can localize the Device attribute for NIC resources You can localize the Device attribute for NIC resources when systems have different network interfaces. when systems have different network interfaces.
Localizing Resource Attributes

An attribute whose value applies to all systems is global in scope. An attribute whose value applies on a per-system basis is local in scope. By default, all attributes are global. Some attributes can be localized to enable you to specify different values for different systems. Localizing a NIC Resource Attribute In the example displayed in the slide, the Device attribute for the NIC resource is localized to enable you to specify a different interface for each system. After creating the resource, you can localize attribute values using the hares command, a GUI, or an offline configuration method. For example, when using the CLI, type:
hares -local NetNIC Device hares -modify NetNIC Device eri0 -sys S1 hares -modify NetNIC Device qfe1 -sys S2
Any attribute can be localized. Network-related resources are common examples for local attributes.
1014

Lesson Summary
Key Points
Proxy resources reflect the state of other resources without monitoring overhead. Network resources can be contained in a parallel service group for efficiency.
Reference Materials
Lab 10: Creating a Parallel Service Group

nameSG1 nameSG1 nameProcess1 nameMount1 nameVol1 nameDG1 Basic Lab: Basic Lab: Create NetworkSG Create NetworkSG Replace NIC with Replace NIC with Proxy in nameSG1 Proxy in nameSG1 nameIP1 nameProxy1 nameProcess2 nameMount2 nameVol2 nameDG2 nameSG2 nameSG2 nameIP2 nameProxy2
ClusterService ClusterService VCSweb webip csgProxy Advanced Lab: Advanced Lab: Replace NIC with Proxy in: Replace NIC with Proxy in: ClusterService ClusterService nameSG2 nameSG2
NetworkNIC
NetworkPhantom NetworkSG NetworkSG
Labs and solutions for this lesson are located on the following pages. "Lab 10: Creating a Parallel Service Group," page A-71. "Lab 10 Solutions: Creating a Parallel Service Group," page B-121.

1015
10
1016

Lesson 11 Configuring Notification
Course Overview
Topic
Notification Overview Configuring Notification Using Triggers for Notification

Describe how VCS provides notification. Configure notification using the NotifierMngr resource. Use triggers to provide notification.
112

Notification Overview
1. 2.
HAD sends a message to the notifier daemon when an event occurs. The notifier daemon:
a. Formats the event message b. Sends an SNMP trap or e-mail message (or both) to designated
recipients
SNMP
SMTP notifier
NotifierMngr NIC
NotifierMngr NIC
Replicated Message Queue
Notification Overview
When VCS detects certain events, you can configure the notifier to: Generate an SNMP (V2) trap to specified SNMP consoles. Send an e-mail message to designated recipients. Message Queue VCS ensures that no event messages are lost while the VCS engine is running, even if the notifier daemon stops or is not started. The had daemons throughout the cluster communicate to maintain a replicated message queue. If the service group with notifier configured as a resource fails on one of the nodes, notifier fails over to another node in the cluster. Because the message queue is guaranteed to be consistent and replicated across nodes, notifier can resume message delivery from where it left off after it fails over to the new node. Messages are stored in the queue until one of these conditions is met: The notifier daemon sends an acknowledgement to had that at least one recipient has received the message. The queue is full. The queue is circularthe last (oldest) message is deleted in order to write the current (newest) message. Messages in the queue for one hour are deleted if notifier is unable to deliver to the recipient. Note: Before the notifier daemon connects to had, messages are stored permanently in the queue until one of the last two conditions is met.

113
11
had
had
Message Severity Levels

SMTP
Resource has faulted.

SNMP
Agent has faulted. Warning

SNMP
Error
SNMP
Concurrency violation
Information Service group is online.

notifier
SevereError
had
See the Job Aids appendix for a complete list of events.
had
Message Severity Levels Event messages are assigned one of four severity levels by notifier: Information: Normal cluster activity is occurring, such as resources being brought online. Warning: Cluster or resource states are changing unexpectedly, such as a resource in an unknown state. Error: Services are interrupted, such as a service group faulting that cannot be failed over. SevereError: Potential data corruption is occurring, such as a concurrency violation. The administrator can configure notifier to specify which recipients are sent messages based on the severity level. A complete list of events and corresponding severity levels is provided in the Job Aids appendix.
114

Notifier and Log Events
engine_A.log
2006/07/06 15:19:37 VCS ERROR V-16-1-10205 Group NetSG is faulted on system S1 From root@S1.ourco.com Thu Jul 6 15:19:53 2006 From root@S1.ourco.com Thu Jul 6 15:19:53 2006 Date: Thu, 06 Jul 2006 15:19:37 -0700 Date: Thu, 06 Jul 2006 15:19:37 -0700 From: Notifier From: Notifier Subject: VCS Error, Service group has faulted Subject: VCS Error, Service group has faulted Event Time: Thu Jul Event Time: Thu Jul 6 15:19:37 2006 6 15:19:37 2006
Entity Name: NetSG Entity Name: NetSG Entity Type: Service Group Entity Type: Service Group Entity Subtype: Parallel Entity Subtype: Parallel Entity State: Service group has faulted Entity State: Service group has faulted Traps Origin: Veritas_Cluster_Server Traps Origin: Veritas_Cluster_Server System Name: S1 System Name: S1 Entities Container Name: vcs_web Entities Container Name: vcs_web Entities Container Type: VCS Entities Container Type: VCS
Notifier E-Mail
Information
Log File
INFO NOTICE
Error SevereError
ERROR CRITICAL
Notifier and Log Events The table in the slide shows how the notifier levels shown in e-mail messages compare to the log file codes for corresponding events. Notice that notifier SevereError events correlate to CRITICAL entries in the engine log.

115
11
Warning
WARNING
Configuring Notification
Note: Note: Add a NotifierMngr Add a NotifierMngr resource to only resource to only one service group. one service group. Add a NotifierMngr resource to Add a NotifierMngr resource to ClusterService. ClusterService. Modify the SmtpServer and Modify the SmtpServer and SmtpRecipients attributes. SmtpRecipients attributes. Optionally, modify ResourceOwner and Optionally, modify ResourceOwner and GroupOwner. GroupOwner. Modify SnmpConsoles, if using SNMP Modify SnmpConsoles, if using SNMP notification. notification. Configure the SNMP console to receive Configure the SNMP console to receive VCS traps. VCS traps. Modify any other optional attributes, as Modify any other optional attributes, as appropriate. appropriate. If SMTP notification is required
If SNMP notification is required
Configuring Notification
While you can start and stop the notifier daemon manually outside of VCS, you can make the notifier component highly available by placing the daemon under VCS control. Perform the following steps to configure highly available notification within the cluster: 1 Add a NotifierMngr type of resource to the ClusterService group. 2 If SMTP notification is required: a Modify the SmtpServer and SmtpRecipients attributes of the NotifierMngr type of resource. b Optionally, modify the ResourceOwner attribute of individual resources (described later in the lesson). c Optionally, specify a GroupOwner e-mail address for each service group. 3 If SNMP notification is required: a Modify the SnmpConsoles attribute of the NotifierMngr type of resource. b Verify that the SNMPTrapPort attribute value matches the port configured for the SNMP console. The default is port 162. c Configure the SNMP console to receive VCS traps (described later in the lesson). 4 Modify any other optional attributes of the NotifierMngr type of resource, as desired. See the manual pages for notifier and hanotify for a complete description of notification configuration options.
The NotifierMngr Resource Type

Resource Definition Service Group Name Resource Name Resource Type
Required Attributes*
Sample Value ClusterService notifier NotifierMngr smtp.yourco.com vcsadmin@yourco.com = Error
SmtpServer SmtpRecipients
*Required attributes: Either SnmpConsoles or SmtpXxx attributes must be specified. Both can be specified. Restart resource if you change attributes.
NotifierMngr notifier ( NotifierMngr notifier ( SmtpServer = "smtp.yourco.com" SmtpServer = "smtp.yourco.com" SmtpRecipients = { "vcsadmin@yourco.com" = Error } SmtpRecipients = { "vcsadmin@yourco.com" = Error } ) ) main.cf main.cf
The NotifierMngr Resource Type The notifier daemon runs on only one system in the cluster, where it processes messages from the local had daemon. If the notifier daemon fails on that system, the NotifierMngr agent detects the failure and migrates the service group containing the NotifierMngr resource to another system. Because the message queue is replicated throughout the cluster, any system that is a target for the service group has an identical queue. When the NotifierMngr resource is brought online, had sends the queued messages to the notifier daemon. Adding a NotifierMngr Resource You can add a NotifierMngr resource using one of the usual methods for adding resources to service groups: Edit the main.cf file and restart VCS. Use the Cluster Manager graphical user interface to add the resource dynamically. Use the hares command to add the resource to a running cluster. Note: Before modifying resource attributes, ensure that you take the resource offline and disable it. The notifier daemon must be stopped and restarted with new parameters in order for changes to take effect.

117
11
The ResourceOwner Attribute

Provides e-mail notification for individual resources Writes an entry in the log file Requires notifier to be configured
2003 /12/03 11:23:48 VCS INFO V-16-1-10304 2003 /12/03 11:23:48 VCS INFO V-16-1-10304 Resource file1 (Owner=daniel, Group=SG1) Resource file1 (Owner=daniel, Group=SG1) is offline on S1 is offline on S1
engine_A.log
ResourceStateUnknown
ResourceRestartingByAgent ResourceWentOnlineByItself ResourceFaulted
Notification Events
ResourceMonitorTimeout ResourceNotGoingOffline
CLI
hares modify resource ResourceOwner daniel
The ResourceOwner Attribute You can set the ResourceOwner attribute to define an owner for a resource. After the attribute is set to a valid e-mail address and notification is configured, an email message is sent to the defined recipient when one of the resource-related events occurs, shown in the table in the slide. VCS also creates an entry in the log file in addition to sending an e-mail message. ResourceOwner can be specified as an e-mail ID (daniel@domain.com) or a user account (daniel). If a user account is specified, the e-mail address is constructed as login@smtp_system, where smtp_system is the system that was specified in the SmtpServer attribute of the NotifierMngr resource.
118

The GroupOwner Attribute

Provides e-mail notification for individual service groups Requires notifier to be configured
From: Notifier From: Notifier Subject: VCS Information, Service group is Subject: VCS Information, Service group is online online Event Time: Wed Aug 23 18:23:09 2006 Event Time: Wed Aug 23 18:23:09 2006 . . . . . . Entities Owner: chris Entities Owner: chris
Concurrency violation Restarting Faulted and cannot failover Autodisabled
E-mail message
Notification Events
Online Offline Switching
CLI
hagrp modify group GroupOwner chris
The GroupOwner Attribute You can set the GroupOwner attribute to define an owner for a service group. After the attribute is set to a valid e-mail address and notification is configured, an email message is sent to the defined recipient when one of the group-related events occurs, as shown in the table in the slide. GroupOwner can be specified as an e-mail ID (chris@domain.com) or a user account (chris). If a user account is specified, the e-mail address is constructed as login@smtp_system, where smtp_system is the system that was specified in the SmtpServer attribute of the NotifierMngr resource.

119
11
Configuring the SNMP Console

Load the MIB for VCS traps into the SNMP management console. For HP OpenView Network Node Manager, merge events: xnmevents -merge vcs_trapd VCS SNMP configuration files:
/etc/VRTSvcs/snmp/vcs.mib
/etc/VRTSvcs/snmp/vcs_trapd
Configuring the SNMP Console To enable an SNMP management console to recognize VCS traps, you must load the VCS MIB into the console. The textual MIB is located in the /etc/VRTSvcs/snmp/vcs.mib file. For HP OpenView Network Node Manager (NNM), you must merge the VCS SNMP trap events contained in the /etc/VRTSvcs/snmp/vcs_trapd file. To merge the VCS events, type:
xnmevents -merge vcs_trapd
SNMP traps sent by VCS are then displayed in the HP OpenView NNM SNMP console.
1110

Using Triggers for Notification

Triggers:
Are scripts run by VCS when defined events occur Can be used as an alternate method of notification Can be used for other purposes (described elsewhere)
Example Triggers
Enabled by presence Enabled by presence of script file: of script file: ResNotOff ResNotOff SysOffline SysOffline PostOffline PostOffline PostOnline PostOnline Apply cluster-wide Apply cluster-wide Configured by service Configured by service group attributes: group attributes: PreOnline PreOnline ResStateChange ResStateChange Apply only to enabled Apply only to enabled service groups service groups ResFault configured by ResFault configured by resource attribute resource attribute
Using Triggers for Notification

VCS provides an additional method for notifying the administrator of important events. When VCS detects certain events, you can configure a trigger to notify an administrator or perform other actions. You can use event triggers in place of, or in conjunction with, notification. Triggers are executable programs, batch files, shell or Perl scripts that reside in /opt/VRTSvcs/bin/triggers. The script name must be one of the predefined event types supported by VCS that are shown in the table. For example, the ResNotOff trigger is a shell or Perl script (depending on the platform) named resnotoff that resides in /opt/VRTSvcs/bin/triggers. Most triggers are configured (enabled) if the trigger program is present; no other configuration is necessary or possible. Each system in the cluster must have the script in this location. These triggers apply to the entire cluster. For example, ResNotOff applies to all resources running on the cluster. Some triggers are enabled by an attribute. Examples are: PreOnline: If the PreOnline attribute for the service group is set to 1, the PreOnline trigger is run as part of the online procedure, before the service group is actually brought online. ResFault: If the TriggerResFault attribute is set to 1 for a resource, the ResFault trigger is enabled for that resource. ResStateChange: If the TriggerResStateChange attribute is set to 1 for a service group, the ResStateChange trigger is enabled for that service group. These types of triggers must also have a corresponding script present in the /opt/ VRTSvcs/bin/triggers directory.
1111
11
Creating Triggers
Sample scripts are provided for each type of trigger. Scripts can be copied and modified.
more /opt/VRTSvcs/bin/sample_triggers/resfault more /opt/VRTSvcs/bin/sample_triggers/resfault . . . . . . # Usage: # Usage: # resfault <system> <resource> <oldstate> # resfault <system> <resource> <oldstate> # # # <system>: is the name of the system where resource faulted. # <system>: is the name of the system where resource faulted. # <resource>: is the name of the resource that faulted. # <resource>: is the name of the resource that faulted. # <oldstate>: is the previous state of the resource that # <oldstate>: is the previous state of the resource that faulted. faulted. # # # Possible values for oldstate are ONLINE and OFFLINE. # Possible values for oldstate are ONLINE and OFFLINE. . . . . . .
Creating g Triggers A set of sample trigger scripts is provided in /opt/VRTSvcs/bin/ sample_triggers. These scripts can be copied to /opt/VRTSvcs/bin/ triggers and modified to your specifications. Note: Remember that for PreOnline and ResStateChange, you must also configure the corresponding service group attributes.
1112

Lesson Summary
Key Points
You can choose from a variety of notification methods. Customize the notification facilities to meet your specific requirements.
Reference Materials
11
Lab 11: Configuring Notification

nameSG1 nameSG1 ClusterService ClusterService VCSweb notifier webip csgProxy nameSG2 nameSG2
Advanced Lab Triggers Triggers

resfault resfault nofailover nofailover resadminwait resadminwait
SMTP Server: SMTP ___________________________________ ___________________________________
Labs and solutions for this lesson are located on the following pages. "Lab 11: Configuring Notification," page A-77. "Lab 11 Solutions: Configuring Notification," page B-135.

1113
1114

Lesson 12 Configuring VCS Response to Resource Faults
Lesson Introduction
Topic
VCS Response to Resource Faults Determining Failover Duration

Describe how VCS responds to resources faults. Determine failover duration for a service group.
Controlling Fault Behavior Control fault behavior using resource type attributes. Recovering from Resource Faults Fault Notification and Event Handling Recover from resource faults. Configure fault notification and triggers.
122

Failover Decisions and Critical Resources

A service group must have at least one critical resource to enable automatic failover. Default VCS behavior for a failover service group:
If a critical resource faults, the service group fails over. If any critical resource is taken offline as a result of a fault, the service group fails over.
Other attributes modify this behavior, as described throughout this lesson.
Failover Decisions and Critical Resources Critical resources define the basis for failover decisions made by VCS. When the monitor entry point for a resource returns with an unexpected offline status, the action taken by the VCS engine depends on whether the resource is critical. By default, if a critical resource in a failover service group faults or is taken offline as a result of another resource fault, VCS determines that the service group is faulted. VCS then fails the service group over to another cluster system, as defined by a set of service group attributes. The rules for selecting a failover target are described in the Service Group Workload Management lesson in the High Availability Using VERITAS Cluster Server for UNIX, Implementing Local Clusters course. The default failover behavior for a service group can be modified using one or more optional service group attributes. Failover determination and behavior are described throughout this lesson.

123
12
VCS Response to Resource Faults
How VCS Responds to Resource Faults

A resource goes offline unexpectedly Execute clean entry point Execute clean entry point for the failed resource for the failed resource Fault the resource Fault the resource Take all resources in path offline Take all resources in path offline Y Critical online resource in path?
Default Default Behavior Behavior
Fault the service group Fault the service group Take the entire SG offline Take the entire SG offline Failover target available? Y N
Keep group Keep group partially online partially online
Keep the service group offline Keep the service group offline Bring the service group online elsewhere Bring the service group online elsewhere
How VCS Responds to Resource Faults by Default VCS responds in a specific and predictable manner to faults. When VCS detects a resource failure, it performs the following actions: 1 Instructs the agent to execute the clean entry point for the failed resource to ensure that the resource is completely offline The resource transitions to a FAULTED state. 2 Takes all resources in the path of the fault offline starting from the faulted resource up to the top of the dependency tree 3 If an online critical resource is part of the path that was faulted or taken offline, faults the service group and takes the group offline to prepare for failover If no online critical resources are affected, no more action occurs. 4 Attempts to start the service group on another system in the SystemList attribute according to the FailOverPolicy defined for that service group and the relationships between multiple service groups Failover policies and the impact of service group interactions during failover are discussed in detail in the VERITAS Cluster Server, Implementing Local Clusters course. Note: The state of the group on the new system prior to failover must be offline (not faulted). 5 If no other systems are available, the service group remains offline. VCS also executes certain triggers and carries out notification while it performs each task in response to resource faults. The role of notification and event triggers in resource faults is explained in detail later in this lesson.
The Impact of Service Group Attributes

A resource goes offline unexpectedly Frozen? No Yes Fault resource Fault resource and service group and service group Send notification Send notification Run resfault Run resfault
Path for default settings
Place resource in an ManageFaults NONE Place resource in an ADMIN_WAIT state ADMIN_WAIT state ALL
Execute clean entry point Execute clean entry point Fault the resource and the service group Fault the resource and the service group 0 FaultPropagation 1
Do not take any Do not take any other resource offline other resource offline
Take all resources Take all resources in the path offline in the path offline 1
Several service group attributes can be used to change the default behavior of VCS while responding to resource faults. Frozen or TFrozen These service group attributes are used to indicate that the service group is frozen due to an administrative command. When a service group is frozen, all agent actions except for monitor are disabled. If the service group is temporarily frozen using the hagrp -freeze group command, the TFrozen attribute is set to 1. If the service group is persistently frozen using the hagrp -freeze group -persistent command, the Frozen attribute is set to 1. When the service group is unfrozen using the hagrp -unfreeze group [-persistent] command, the corresponding attribute is set back to the default value of 0. ManageFaults The ManageFaults attribute can be used to prevent VCS from taking any automatic actions whenever a resource failure is detected. Essentially, ManageFaults determines whether VCS or an administrator handles faults for a service group. If ManageFaults is set to the default value of ALL, VCS manages faults by executing the clean entry point for that resource to ensure that the resource is completely offline, as shown previously. This is the default value (ALL).

125
12
The Impact of Service Group Attributes on Failover
If this attribute is set to NONE, VCS places the resource in an ADMIN_WAIT state and waits for administrative intervention. This is often used for service groups that manage database instances. You may need to leave the database in its FAULTED state in order to perform problem analysis and recovery operations. Note: This attribute is set at the service group level. This means that any resource fault within that service group requires administrative intervention if the ManageFaults attribute for the service group is set to NONE. FaultPropagation The FaultPropagation attribute determines whether VCS evaluates the effects of a resource fault on parents of the faulted resource. If ManageFaults is set to ALL, VCS runs the clean entry point for the faulted resource, and then checks the FaultPropagation attribute of the service group. If this attribute is set to 0, VCS does not take any further action. In this case, VCS fails over the service group only on system failures and not on resource faults. The default value is 1, which means that VCS continues through the failover process shown in the next section. Note: ManageFaults and FaultPropagation have essentially the same effect when enabledservice group failover is suppressed. The difference is that when ManageFaults is set to NONE, the clean entry point is not run, and that resource is put in an ADMIN_WAIT state.
126

The Impact of Service Group Attributes (Continued)

1 Critical online resource in path? Y Take the entire Take the entire group offline group offline AutoFailOver 1 Choose a failover target from Choose a failover target from SystemList based on FailOverPolicy SystemList based on FailOverPolicy Failover target available? 0 N Keep group Keep group partially online partially online
Bring the service group Bring the service group online elsewhere online elsewhere
Keep the service Keep the service group offline group offline
This attribute determines whether automatic failover takes place when a resource or system faults. The default value of 1 indicates that the service group should be failed over to other available systems if at all possible. However, if the attribute is set to 0, no automatic failover is attempted for the service group, and the service group is left in an OFFLINE|FAULTED state.

127
12
AutoFailOver
Practice Exercise
Case Non(M, F, A) Critical SG Attributes
ManageFaults FaultPropagation AutoFailOver

Offline Taken offline due to fault Starts on another system
7 5 3 1 2 6 4
4 4,6,7 -
ALL, 1, 1 ALL, 1, 1 ALL, 1, 1 ALL, 1, 1 NONE,1,1 ALL, 0, 1 ALL, 1, 0
7 -
8 9
B C D E F G
Resource 4 Faults
Practice: How VCS Responds to a Fault The service group illustrated in the slide demonstrates how VCS responds to faults. In each case (A, B, C, and so on), assume that the group is configured as listed and that the service group is not frozen. As an exercise, determine what occurs if the fourth resource in the group fails. For example, in case A in the slide, the clean entry point is executed for resource 4 to ensure that it is offline, and resources 7 and 6 are taken offline because they depend on 4. Because 4 is a critical resource, the rest of the resources are taken offline from top to bottom, and the group is then failed over to another system.
128

Failover Duration When a Resource Faults

Service group failover time is the sum of the duration of each failover task. You can affect failover time behavior by setting resource type attributes.
+ Detect the resource failure (< MonitorInterval). + Fault the resource. + Take the entire service group offline. + Select a failover target. + Bring the service group online on another system in the cluster.
= Failover Duration
Failover Duration on a Resource Fault When a resource failure occurs, application services may be disrupted until either the resource is restarted on the same system or the application services migrate to another system in the cluster. The time required to address the failure is a combination of the time required to: Detect the failure. A resource failure is only detected when the monitor entry point of that resource returns an offline status unexpectedly. The resource type attributes used to tune the frequency of monitoring a resource are MonitorInterval (default of 60 seconds) and OfflineMonitorInterval (default of 300 seconds). Fault the resource. This is related to two factors: How much tolerance you want VCS to have for false failure detections For example, in an overloaded network environment, the NIC resource can return an occasional failure even though there is nothing wrong with the physical connection. You may want VCS to verify the failure a couple of times before faulting the resource. Whether or not you want to attempt a restart before failing over For example, it may be much faster to restart a failed process on the same system rather than to migrate the entire service group to another system. Take the entire service group offline.

129
12
Determining Failover Duration
In general, the time required for a resource to be taken offline is dependent on the type of resource and what the offline procedure includes. However, VCS enables you to define the maximum time allowed for a normal offline procedure before attempting to force the resource to be taken offline. The resource type attributes related to this factor are OfflineTimeout and CleanTimeout. Select a failover target. The time required for the VCS policy module to determine the target system is negligible, less than one second in all cases, in comparison to the other factors. Bring the service group online on another system in the cluster. In most cases, in order to start an application service after a failure, you need to carry out some recovery procedures. For example, a file systems metadata needs to be checked if it is not unmounted properly, or a database needs to carry out recovery procedures, such as applying the redo logs to recover from sudden failures. Take these considerations into account when you determine the amount of time you want VCS to allow for an online process. The resource type attributes related to bringing a service group online are OnlineTimeout, OnlineWaitLimit, and OnlineRetryLimit.
For more information on attributes that affect failover, refer to the VERITAS Cluster Server Bundled Agents Reference Guide.
1210

Adjusting Monitoring
MonitorInterval: Frequency of online monitoring Default is 60 seconds for most resource types Reduce to 10 or 20 seconds for testing Use caution when changing this value:
Lower values increase load on cluster systems. Some false resource faults can occur if resources cannot respond in the interval specified.
OfflineMonitorInterval: Frequency of offline monitoring Default is 300 seconds for most resource types Reduce to 60 seconds for testing
If you change a resource type attribute, you If you change a resource type attribute, you affect all resources of that type. affect all resources of that type.
You can change some resource type attributes to facilitate failover testing. For example, you can change the monitor interval to see the results of faults more quickly. You can also adjust these attributes to affect how quickly an application fails over when a fault occurs. MonitorInterval This is the duration (in seconds) between two consecutive monitor calls for an online or transitioning resource. The default is 60 seconds for most resource types. OfflineMonitorInterval This is the duration (in seconds) between two consecutive monitor calls for an offline resource. If set to 0, offline resources are not monitored. The default is 300 seconds for most resource types. Refer to the VERITAS Cluster Server Bundled Agents Reference Guide for the applicable monitor interval defaults for specific resource types.

1211
12
Adjusting Monitoring
Adjusting Timeouts
Timeout interval values define the maximum time within which the entry points must finish or be terminated. OnlineTimeout and OfflineTimeout: Default is 300 seconds Increase if all resources of a type require more time to be brought online or taken offline in your environment MonitorTimeout: Default is 60 seconds for most resource types
Before modifying defaults: modifying Measure the online and offline times outside of VCS. Measure the monitor time:
1. Fault the resource. 1. Fault the resource. 2. Issue a probe. 2. Issue a probe.
Adjusting Timeout Values The attributes MonitorTimeout, OnlineTimeout, and OfflineTimeout indicate the maximum time (in seconds) within which the monitor, online, and offline entry points must finish or be terminated. The default for the MonitorTimeout attribute is 60 seconds. The defaults for the OnlineTimeout and OfflineTimeout attributes are 300 seconds. For best results, measure the length of time required to bring a resource online, take it offline, and monitor it before modifying the defaults. Simply issue an online or offline command to measure the time required for each action. To measure how long it takes to monitor a resource, fault the resource, and then issue a probe, or bring the resource online outside of VCS control and issue a probe.
1212

Type Attributes Related to Resource Faults

RestartLimit
Controls the number of times a resource is restarted on the same system before it is marked as FAULTED Default: 0
ConfInterval
Determines the amount of time that must elapse before restart and tolerance counters are reset to zero Default: 600 seconds
ToleranceLimit
Enables the monitor entry point to return OFFLINE several times before the resource is declared FAULTED Default: 0
Type Attributes Related to Resource Faults Although the failover capability of VCS helps to minimize the disruption of application services when resources fail, the process of migrating a service to another system can be time-consuming. In some cases, you may want to attempt to restart a resource on the same system before failing it over to another system. Whether a resource can be restarted depends on the application service: The resource must be successfully cleared (taken offline) after failure. The resource must not be a child resource with dependent parent resources that must be restarted. If you have determined that a resource can be restarted without impacting the integrity of the application, you can potentially avoid service group failover by configuring the RestartLimit, ConfInterval, and ToleranceLimit resource type attributes. For example, you can set the ToleranceLimit to a value greater than 0 to allow the monitor entry point to run several times before a resource is determined to be faulted. This is useful when the system is very busy and a service, such as a database, is slow to respond.

1213
12
Controlling Fault Behavior
Restart Example
RestartLimit = 1 The resource is restarted one time within the ConfInterval timeframe. ConfInterval = 180 The resource can be restarted once within a three-minute interval. MonitorInterval = 60 seconds (default value) The resource is monitored every 60 seconds.
ConfInterval
Online
Online
Offline
Online
Offline
MonitorInterval
Restart
Faulted
Restart Example This example illustrates how the RestartLimit and ConfInterval attributes can be configured for modifying the behavior of VCS when a resource is faulted. Setting RestartLimit = 1 and ConfInterval = 180 has this effect when a resource faults: 1 The resource stops after running for 10 minutes. 2 The next monitor returns offline. 3 The ConfInterval counter is set to 0. 4 The agent checks the value of RestartLimit. 5 The resource is restarted because RestartLimit is set to 1, which allows one restart within the ConfInterval counter 6 The next monitor returns online. 7 The ConfInterval counter is now 60; one monitor cycle has completed. 8 The resource stops again. 9 The next monitor returns offline. 10 The ConfInterval counter is now 120; two monitor cycles have completed. 11 The resource is not restarted because the RestartLimit counter is now 1 and the ConfInterval counter is 120 (seconds). Because the resource has not been online for the ConfInterval time of 180 seconds, it is not restarted. 12 VCS faults the resource. If the resource had remained online for 180 seconds, the internal RestartLimit counter would have been reset to 0.
1214

Modifying Resource Type Attributes

type NIC type NIC static static static static static static static static ) ) ( ( int int int int int int str str MonitorInterval = 15 MonitorInterval = 15 OfflineMonitorInterval = 60 OfflineMonitorInterval = 60 ToleranceLimit = 2 ToleranceLimit = ArgList[] = { Device, ArgList[] = { Device, types.cf types.cf
Can be used to optimize agents Is applied to all resources of the specified type
hatype modify NIC ToleranceLimit 2 hatype modify NIC ToleranceLimit 2
You can modify the resource type attributes to affect how an agent monitors all resources of a given type. For example, agents usually check their online resources every 60 seconds. You can modify that period so that the resource type is checked more often. This is good for either testing situations or time-critical resources. You can also change the period so that the resource type is checked less often. This reduces the load on VCS overall, as well as on the individual systems, but increases the time it takes to detect resource failures. For example, to change the ToleranceLimit attribute for all NIC resources so that the agent ignores occasional network problems, type:
hatype -modify NIC ToleranceLimit 2

1215
12
Modifying Resource Type Attributes
Overriding Resource Type Attributes

hares hares
override myMount MonitorInterval modify myMount MonitorInterval 10
Override MonitorInterval Modify overridden attribute

main.cf main.cf Mount myMount ( Mount myMount ( MountPoint="/mydir" MountPoint="/mydir" . . . . . . MonitorInterval=10 MonitorInterval=10 . . . . . . ) )
display ovalues myMount Display overridden values undo_override myMount MonitorInterval Restore default settings
Overriding Resource Type Attributes Resource type attributes apply to all resources of that type. You can override a resource type attribute to change its value for a specific resource. Use the options to hares shown on the slide or the GUI to override resource type attributes. Note: The configuration must be in read-write mode in order to modify and override resource type attributes. The changes are reflected in the main.cf file only after you save the configuration using the haconf -dump command. Some predefined static resource type attributes (those resource type attributes that do not appear in types.cf unless their value is changed, such as MonitorInterval) and all static attributes that are not predefined (static attributes that are defined in the type definition file) can be overridden. For a detailed list of predefined static attributes that can be overridden, refer to the VERITAS Cluster Server Users Guide.
1216

Recovering a Resource from a FAULTED State

To clear a nonpersistent resource fault: To clear a nonpersistent resource fault:
1. Verify that: 1. Verify that: The fault is fixed outside of VCS The fault is fixed outside of VCS The resource is completely offline. The resource is completely offline. 2. Run hares clear resource to clear the FAULTED status. 2. Run hares clear resource to clear the FAULTED status.
To clear a persistent resource fault: To clear a persistent resource fault:
1. Ensure that the fault is fixed outside of VCS. 1. Ensure that the fault is fixed outside of VCS. 2. Either wait for the monitor cycle to run, or probe the resource 2. Either wait for the monitor cycle to run, or probe the resource manually: manually: hares probe resource sys system hares probe resource sys system
When a resource failure is detected, the resource is put into a FAULTED or an ADMIN_WAIT state depending on the cluster configuration. In either case, administrative intervention is required to bring the resource status back to normal. Recovering a Resource from a FAULTED State A critical resource in FAULTED state cannot be brought online on a system. When a critical resource is FAULTED on a system, the service group status also changes to FAULTED on that system, and that system can no longer be considered as an available target during a service group failover. You have to clear the FAULTED status of a nonpersistent resource manually. Before clearing the FAULTED status, ensure that the resource is completely offline and that the fault is fixed outside of VCS. Note: You can also run hagrp -clear group [-sys system] to clear all FAULTED resources in a service group. However, you have to ensure that all of the FAULTED resources are completely offline and the faults are fixed on all the corresponding systems before running this command. The FAULTED status of a persistent resource is cleared when the monitor returns an online status for that resource. Note that offline resources are monitored according to the value of OfflineMonitorInterval, which is 300 seconds (five minutes) by default. To avoid waiting for the periodic monitoring, you can initiate the monitoring of the resource manually by probing the resource.

1217
12
Recovering from Resource Faults
Recovering a Resource from ADMIN_WAIT

When a resource is in ADMIN_WAIT state, VCS:
Waits for administrative intervention Takes no further action until the status is cleared Solutions:
To continue operation without a failover: without failover:
Fix the fault and bring the resource online outside of VCS. Fix the fault and bring the resource online outside of Clear the ADMIN_WAIT status without faulting the service group. Clear the ADMIN_WAIT status without faulting the service group. hares probe resource sys system hares probe resource sys system If the resource is offline at the next monitor cycle, it is placed back into If the resource is offline at the next monitor cycle, it is placed back into ADMIN_WAIT state. ADMIN_WAIT state. 1. 1. 2. 2.
To initiate failover after collecting debug information: initiate failover

hagrp clearadminwait fault group sys system hagrp clearadminwait fault group sys system
Remember to set ManageFaults back to ALL after resolving the problem causing the fault.
Recovering a Resource from an ADMIN_WAIT State If the ManageFaults attribute of a service group is set to NONE, VCS does not take any automatic action when it detects a resource fault. VCS places the resource into the ADMIN_WAIT state and waits for administrative intervention. There are two primary reasons to configure VCS in this way: You want to analyze and recover from the failure manually with the aim of continuing operation on the same system. In this case, fix the fault and bring the resource back to the state it was in before the failure (online state) manually outside of VCS. After the resource is back online, you can inform VCS to take the resource out of ADMIN_WAIT state and put it back into ONLINE state. Notes: If the next monitor cycle does not report an online status, the resource is placed back into the ADMIN_WAIT state. If the next monitor cycle reports an online status, VCS continues normal operation without any failover. If the resource is restarted outside of VCS and a monitor cycle runs before you can probe it, the resource returns to an online state automatically. You cannot clear the ADMIN_WAIT state from the GUI. You want to collect debugging information before any action is taken. The intention in this case is prevent VCS intervention until the failure is analyzed. You can then let VCS continue with the normal failover process. When you clear the ADMIN_WAIT state, the clean entry point runs and the resource changes status to OFFLINE | FAULTED. VCS then continues with the service group failover, depending on the cluster configuration.
Fault Notification
A resource becomes offline unexpectedly. A resource cannot be taken offline. The service group is faulted due to a critical resource fault. The service group is brought online or taken offline successfully. The failover target does not exist. Send notification (Error). E-mail ResourceOwner (if configured). Send notification (Warning). E-mail ResourceOwner (if configured). Send notification (SevereError). E-mail GroupOwner (if configured).
Send notification (Information). E-mail GroupOwner (if configured). Send notification (Error). E-mail GroupOwner (if configured).
Fault Notification As a response to a resource fault, VCS carries out tasks to take resources or service groups offline and to bring them back online elsewhere in the cluster. While carrying out these tasks, VCS generates certain messages with a variety of severity levels and the VCS engine passes these messages to the notifier daemon. Whether these messages are used for SNMP traps or SMTP notification depends on how the notification component of VCS is configured, as described in the Configuring Notification lesson. The following events are examples that result in a notification message being generated: A resource becomes offline unexpectedly; that is, a resource is faulted. VCS cannot take the resource offline. A service group is faulted, and there is no failover target available. The service group is brought online or taken offline successfully. The service group has faulted on all nodes where the group could be brought online, and there are no nodes to which the group can fail over.

1219
12
Fault Notification and Event Handling
Extended Event Handling Using Triggers

A resource becomes offline unexpectedly. A resource cannot be taken offline. A resource is placed in an ADMIN_WAIT state. A resource is brought online or taken offline successfully. The failover target does not exist. Call resfault (if present). Call resstatechange. (if present and configured). Call resnotoff (if present).
Call resadminwait (if present).
Call resstatechange. (if present and configured).
Call nofailover (if present).
Extended Event Handling Using Triggers You can use triggers to customize how VCS responds to events that occur in the cluster. For example, you could use the ResAdminWait trigger to automate the task of taking diagnostics of the application as part of the failover and recovery process. If you set ManageFaults to NONE for a service group, VCS places faulted resources into the ADMIN_WAIT state. If the ResAdminWait trigger is configured, VCS runs the script when a resource enters ADMIN_WAIT. Within the trigger script, you can run a diagnostic tool and log information about the resource, and then take a desired action, such as clearing the state and faulting the resource: hagrp -clearadminwait -fault group -sys system The Role of Triggers in Resource Faults As a response to a resource fault, VCS carries out tasks to take resources or service groups offline and to bring them back online elsewhere in the cluster. While these tasks are being carried out, certain events take place. If corresponding event triggers are configured, VCS executes the trigger scripts, as shown in the slide. Triggers are placed in the /opt/VRTSvcs/bin/triggers directory. Sample trigger scripts are provided in /opt/VRTSvcs/bin/sample_triggers. Trigger configuration is described in the VERITAS Cluster Server Users Guide and the High Availability Design and Customization Using VERITAS Cluster Server virtual academy training course.
1220

Lesson Summary
Key Points
You can customize how VCS responds to faults by configuring attributes. Failover duration can also be adjusted to meet your specific requirements.
Reference Materials
VERITAS Cluster Server Bundled Agent Reference Guide VERITAS Cluster Server User's Guide High Availability Design and Customization Using VERITAS Cluster Server Virtual Academy course
Lab 12: Configuring Resource Fault Behavior

nameSG1 Basic Lab: Basic Lab: Critical=0 Critical=0 Critical=1 Critical=1 Advanced Lab: Advanced Lab: TFrozen=1 TFrozen=1 FaultPropagation=0 FaultPropagation=0 FaultPropagation=1 FaultPropagation=1 ManageFaults=NONE ManageFaults=NONE ManageFaults=ALL ManageFaults=ALL RestartLimit=1 RestartLimit=1 nameSG2
Labs and solutions for this lesson are located on the following pages. "Lab 12: Configuring Resource Fault Behavior," page A-83. "Lab 12 Solutions: Configuring Resource Fault Behavior," page B-143.

1221
12
1222

Lesson 13 Cluster Communications
Lesson Introduction

Topic
VCS Communications Review Cluster Membership Cluster Interconnect Configuration Joining the Cluster Membership

Describe how components communicate in a VCS environment. Describe how VCS determines cluster membership. Describe the files that specify the cluster interconnect configuration. Describe how systems join the cluster membership.
Changing the Interconnect Change the cluster interconnect Configuration configuration.
132

On-Node and Off-Node Communication

Agent Agent Agent Agent Agent Agent Agent Agent Agent
HAD HAD GAB GAB LLT LLT
LLT sends a LLT sends a heartbeat on each heartbeat each interface every interface every second. second.
Each LLT module Each LLT module tracks the status of tracks the status heartbeats from each heartbeats from each peer on each peer on each interface. interface.
LLT forwards the LLT forwards the heartbeat status of heartbeat status of each node to GAB. each node to GAB.
VCS Communications Review

VCS maintains the cluster state by tracking the status of all resources and service groups in the cluster. The state is communicated between had processes on each cluster system by way of the atomic broadcast capability of Group Membership Services/Atomic Broadcast (GAB). HAD is a replicated state machine, which uses the GAB atomic broadcast mechanism to ensure that all systems within the cluster are immediately notified of changes in resource status, cluster membership, and configuration. Atomic means that all systems receive updates, or all systems are rolled back to the previous state, much like a database atomic commit. If a failure occurs while transmitting status changes, GABs atomicity ensures that, upon recovery, all systems have the same information regarding the status of any monitored resource in the cluster. VCS On-Node Communications VCS uses agents to manage resources within the cluster. Agents perform resourcespecific tasks on behalf of had, such as online, offline, and monitoring actions. These actions can be initiated by an administrator issuing directives using the VCS graphical or command-line interface, or by other events that require had to take some action. Agents also report resource status back to had. Agents do not communicate with one another, but only with had. The had processes on each cluster system communicate cluster status information over the cluster interconnect.

133
13
VCS Inter-Node Communications In order to replicate the state of the cluster to all cluster systems, VCS must determine which systems are participating in the cluster membership. This is accomplished by the group membership services mechanism of GAB. Cluster membership refers to all systems configured with the same cluster ID and interconnected by a pair of redundant Ethernet LLT links. Under normal operation, all systems configured as part of the cluster during VCS installation actively participate in cluster communications. Systems join a cluster by issuing a cluster join message during GAB startup. Cluster membership is maintained by heartbeats. Heartbeats are signals sent periodically from one system to another to determine system state. Heartbeats are transmitted by the LLT protocol. VCS Communications Stack Summary The hierarchy of VCS mechanisms that participate in maintaining and communicating cluster membership and status information is shown in the slide diagram. Agents communicate with had. The had processes on each system communicate status information by way of GAB. GAB determines cluster membership by monitoring heartbeats transmitted from each system over LLT.
134

Cluster Interconnect Specifications

Up to eight links per cluster High-priority:
Heartbeats every half-second Cluster status information carried over links Usually configured for dedicated cluster network links
Low-priority:
Heartbeats every second No cluster status sent Automatically promoted to high priority if there are no high-priority links functioning Can be configured on public network interfaces
Cluster Interconnect Specifications LLT can be configured to designate links as high-priority or low-priority links. High-priority links are used for cluster communications (GAB) as well as heartbeats. Low-priority links carry only heartbeats unless there is a failure of all configured high-priority links. At this time, LLT switches cluster communications to the first available low-priority link. Traffic reverts to high-priority links as soon as they are available. Later lessons provide more detail about how VCS handles link failures in different environments.

135
13
GAB Status and Membership Notation

Cluster with four nodes: 0, 1, 21, 22 Cluster with four nodes: 0, 1, 21, 22 Nodes 0 and 1 20s placeholder
# gabconfig -a # gabconfig -a GAB Port Memberships GAB Port Memberships =============================================== ===============================================
Port a gen a36e003 membership 01 Port a gen a36e003 membership 01 Port h gen fd57002 membership 01 Port h gen fd57002 membership 01 HAD is communicating. GAB is communicating.
; ; ; ;
;12 ;12 ;12 ;12
Indicates 10s digit (0 displayed if node 10 is a member of the cluster) Nodes 21 and 22
Cluster with 22 nodes Cluster with 22 nodes Port a gen a45e098 membership 0123456789012345678901 Port a gen a45e098 membership 0123456789012345678901 Port h gen fe25061 membership 0123456789012345678901 Port h gen fe25061 membership 0123456789012345678901
Cluster Membership
GAB Status and Membership Notation To display the cluster membership status, type gabconfig on each system. For example:
gabconfig -a
The first example in the slide shows: Port a, GAB membership, has four nodes: 0, 1, 21, and 22 Port h, VCS membership, has four nodes: 0, 1, 21, and 22 Note: The port a and port h generation numbers change each time the membership changes. GAB Membership Notation The gabconfig output uses a positional notation to indicate which systems are members of the cluster. Only the last digit of the node number is displayed relative to semicolons that indicate the 10s digit. The second example shows gabconfig output for a cluster with 22 nodes.
136

LLT Link Status: The lltstat Command

Solaris Example Solaris Example
S1# lltstat -nvv | more S1# lltstat -nvv | more LLT node information: LLT node information: Node State Node State * 0 S1 OPEN * 0 S1 OPEN
Link Status Address Link Status Address qfe0 qfe0 qfe4 qfe4 qfe0 qfe0 qfe4 qfe4 UP UP UP UP UP UP UP UP 08:00:20:AD:BC:78 08:00:20:AD:BC:78 08:00:20:AD:BC:79 08:00:20:AD:BC:79 08:00:20:B4:0C:3B 08:00:20:B4:0C:3B 08:00:20:B4:0C:3C 08:00:20:B4:0C:3C
1 S2 1 S2
OPEN OPEN
Shows which system runs the command
Viewing LLT Link Status The lltstat Command Use the lltstat command to verify that links are active for LLT. This command returns information about the LLT links for the system on which it is typed. In the example shown in the slide, lltstat -nvv is typed on the S1 system to produce the LLT status in a cluster with two systems. The -nvv options cause lltstat to list systems with very verbose status: Link names from llttab Status MAC address of the Ethernet ports Other lltstat uses: Without options, lltstat reports whether LLT is running. The -c option displays the values of LLT configuration directives. The -l option lists information about each configured LLT link. Note: This level of detailed information about LLT links is only available through the CLI. Basic status is shown in the GUI.

137
13
Configuration Overview
The cluster interconnect is automatically configured during installation. You may never need to modify any portion of the interconnect configuration. Interconnect configuration and functional details are provided to give you a complete understanding of the VCS architecture. Knowing how a cluster membership is formed and maintained is necessary for understanding effects of system and communications faults, described in later lessons.
Cluster Interconnect Configuration

Configuration Overview The VCS installation utility sets up all cluster interconnect configuration files and starts up LLT and GAB. You may never need to modify communication configuration files. Understanding how these files work together to define the cluster communication mechanism helps you understand how VCS responds if a fault occurs.
138

LLT Configuration Files

The llttab file: Assigns node numbers to systems. Sets the cluster ID number to identify a system with a cluster. Specifies the network devices used for the cluster interconnect. Modifies default LLT behavior, such as heartbeat frequency.
/etc/llttab /etc/llttab
set-cluster 10 set-cluster 10 set-node S1 set-node S1 link qfe0 /dev/qfe:0 - ether - link qfe0 /dev/qfe:0 - ether - - Solaris link qfe4 /dev/qfe:4 - ether - - Solaris link qfe4 /dev/qfe:4 - ether - AIX HP-UX Linux
LLT Configuration Files The LLT configuration files are located in the /etc directory. The llttab File The llttab file is the primary LLT configuration file and is used to: Set the cluster ID number. Set system ID numbers. Specify the network device names used for the cluster interconnect. Modify LLT behavior, such as heartbeat frequency. Note: Note: Ensure that there is only one set-node line in the llttab file. This is the minimum recommended set of directives required to configure LLT. The basic format of the file is an LLT configuration directive followed by a value. These directives and their values are described in more detail in the next sections. For example, you can add the exclude directive in llttab to eliminate information about nonexistent systems. For a complete list of directives, see the sample-llttab file in the /opt/VRTS/llt directory and the llttab manual page. See the lab exercise for this lesson for sample llttab files for other platforms.
13

139
How Node and Cluster Numbers Are Specified

/etc/llttab /etc/llttab 0 64K set-cluster 10 set-node S1 link qfe0 /dev/qfe:0 - ether - link qfe4 /dev/qfe:4 - ether - Solaris Solaris
0 - 31
0 S1 1 S2 /etc/llthosts /etc/llthosts
AIX
HP-UX
Linux
How Node and Cluster Numbers Are Specified A unique number must be assigned to each system in a cluster using the set-node directive. Each system in the cluster must have a unique llttab file, which has a unique value for set-node, which can be one of the following: An integer in the range of 0 through 31 (32 systems per cluster maximum) A system name matching an entry in /etc/llthosts The set-cluster Directive LLT uses the set-cluster directive to assign a unique number to each cluster. Although a cluster ID is optional when only one cluster is configured on a physical network, you should always define a cluster ID. This ensures that each system only joins other systems with the same cluster ID to form a cluster. If LLT detects multiple systems with the same node ID and cluster ID on a private network, the LLT interface is disabled on the node that is starting up. This prevents a possible split-brain condition, where a service group might be brought online on the two systems with the same node ID. Note: You can use the same cluster interconnect network infrastructure for multiple clusters. The llttab file must specify the appropriate cluster ID to ensure that there are no conflicting node IDs.
1310

The llthosts File

Associates a system name with a VCS cluster node ID: Has the same entries on all systems Maps unique node numbers to system names Matches system names with llttab and main.cf Matches system names with sysname, if used
0 S1 1 S2
/etc/llthosts /etc/llthosts
The llthosts File The llthosts file associates a system name with a VCS cluster node ID number. This file must be present in the /etc directory on every system in the cluster. It must contain a line with the unique name and node ID for each system in the cluster. The format is: The critical requirements for llthosts entries are: Node numbers must be unique. If duplicate node IDs are detected on the Ethernet LLT cluster interconnect, LLT in VCS 4.0 is stopped on the joining node. In VCS versions before 4.0, the joining node panics. The system name must match the name in llttab if a name is configured for the set-node directive (rather than a number). System names must match those in main.cf, or VCS cannot start. Note: The system (node) name does not need to be the UNIX host name found using the hostname command. However, VERITAS recommends that you keep the names the same to simplify administration, as described in the next section. See the llthosts manual page for a complete description of the file.
13
node_number name

1311
The sysname File

Is configured automatically during installation Provides VCS with a short-form system name Removes VCS dependency on UNIX uname
S1
/etc/VRTSvcs/conf/sysname /etc/VRTSvcs/conf/sysname
The sysname File The sysname file is an optional LLT configuration file that is configured automatically during VCS installation. This file is used to store the short-form of the system (node) name. The purpose of the sysname file is to remove VCS dependence on the UNIX uname utility for determining the local system name. On some versions of UNIX, uname returns a fully qualified domain name (sys.company.com) and VCS cannot match the name to the systems in the main.cf cluster configuration and therefore cannot start on that system. See the sysname manual page for a complete description of the file.
1312

The GAB Configuration File

The /etc/gabtab file: Contains the command to start GAB:
/sbin/gabconfig -c -n number_of_systems
Specifies the number of systems that must be communicating to allow VCS to start.
/etc/gabtab /etc/gabtab
/sbin/gabconfig c n 4 /sbin/gabconfig c n 4
The GAB Configuration File GAB is configured with the /etc/gabtab file. This file contains one line that is used to start GAB. For example:
/sbin/gabconfig -c -n 4
This example starts GAB and specifies that four systems are required to be running GAB to start within the cluster. A sample gabtab file is included in /opt/VRTSgab. Note: Other gabconfig options are discussed later in this lesson. See the gabconfig manual page for a complete description of the file.

1313
13
Seeding During Startup

II am alive am alive HAD HAD GAB GAB LLT 1 LLT S1 S1 S2 S2 3 HAD HAD GAB GAB LLT LLT S3 S3 2 II am alive am alive II am alive am alive HAD HAD Failed GAB GAB LLT LLT
1 2 3
LLT starts on each system. GAB starts on each system with a seed value equal to the number of systems in the cluster: gabconfig c n 3 When GAB sees three members, the cluster is seeded. HAD starts only after GAB is communicating on all systems. Solaris AIX HP-UX Linux
Joining the Cluster Membership

GAB and LLT are started automatically when a system starts up. HAD can only start after GAB membership has been established among all cluster systems. The mechanism that ensures that all cluster systems are visible on the cluster interconnect is GAB seeding. Seeding During Startup Seeding is a mechanism to ensure that systems in a cluster are able to communicate before VCS can start. Only systems that have been seeded can participate in a cluster. Seeding is also used to define how many systems must be online and communicating before a cluster is formed. By default, a system is not seeded when it boots. This prevents VCS from starting, which prevents applications (service groups) from starting. If the system cannot communicate with the cluster, it cannot be seeded. Seeding is a function of GAB and is performed automatically or manually, depending on how GAB is configured. GAB seeds a system automatically in one of two ways: When an unseeded system communicates with a seeded system When all systems in the cluster are unseeded and able to communicate with each other The number of systems that must be seeded before VCS is started on any system is also determined by the GAB configuration.
1314

When the cluster is seeded, each node is listed in the port a membership displayed by gabconfig -a. In the following example, all four systems (nodes 0, 1, 2, and 3) are seeded, as shown by port a membership:
# gabconfig -a GAB Port Memberships ======================================================= Port a gen a356e003 membership 0123
LLT, GAB, and VCS Startup Files These startup files are placed on the system when VCS is installed.
Solaris
/etc/rc2.d/S70llt /etc/rc2.d/S92gab /etc/rc3.d/S99vcs

AIX
Checks for /etc/llttab and runs /sbin/lltconfig -c to start LLT Calls /etc/gabtab Runs /opt/VRTSvcs/bin/hastart
/etc/rc.d/rc2.d/ S70llt /etc/rc.d/rc2.d/ S92gab /etc/rc.d/rc2.d/ S99vcs

HP-UX
13
/sbin/rc2.d/ S680llt /sbin/rc2.d/ S920gab /sbin/rc2.d/ S990vcs

Linux
/etc/rc[2345].d/ S66llt /etc/rcx.d/S67gab /etc/rcx.d/S99vcs

1315
Manual Seeding
5 Seeded Seeded GAB GAB 3 LLT LLT S1 S1 HAD HAD HAD HAD GAB GAB 2 S2 S2 LLT LLT
Warning: Manual seeding can cause split-brain condition.
Seeds Seeds 4 1 S3 S3
Failed
1 2 3 4 5
S3 is down for maintenance. S1 and S2 are rebooted. LLT starts on S1 and S2. GAB cannot seed with S3 down. Start GAB on S1 manually and force it to seed: gabconfig c -x. Start GAB on S2: gabconfig -c; S2 seeds because it can detect another seeded system (S1). Start HAD on S1 and S2.
Manual Seeding You can override the seed values in the gabtab file and manually force GAB to seed a system using the gabconfig command. This is useful when one of the systems in the cluster is out of service and you want to start VCS on the remaining systems. To seed the cluster, start GAB on one node with -x to override the -n value set in the gabtab file. For example, type: gabconfig -c -x
Warning: Only manually seed the cluster when you are sure that no other systems have
GAB seeded. In clusters that do not use I/O fencing, you can potentially create a split brain condition by using gabconfig improperly. After you have started GAB on one system, start GAB on other systems using gabconfig with only the -c option. You do not need to force GAB to start with the -x option on other systems. When GAB starts on the other systems, it determines that GAB is already seeded and starts up.
1316

Probing Resources During Normal Startup

A 3 1 A, B Autodisabled for S1, S2 B 3 monitor 2
HAD HAD
HAD HAD
1 2
During startup, HAD autodisables service groups. HAD directs agents to probe (monitor) all resources on all systems in the SystemList to determine their status.
3 If agents successfully probe resources, HAD brings service groups online according to AutoStart and AutoStartList attributes.
Probing Resources During Startup During initial startup, VCS autodisables a service group until all its resources are probed on all systems in the SystemList that have GAB running. When a service group is autodisabled, VCS sets the AutoDisabled attribute to 1 (true), which prevents the service group from starting on any system. This protects against a situation where enough systems are running LLT and GAB to seed the cluster, but not all systems have HAD running. In this case, port a membership is complete, but port h is not. VCS cannot detect whether a service is running on a system where HAD is not running. Rather than allowing a potential concurrency violation to occur, VCS prevents the service group from starting anywhere until all resources are probed on all systems. After all resources are probed on all systems, a service group can come online by bringing offline resources online. If the resources are already online, as in the case where HAD has been stopped with the hastop -all -force option, the resources are marked as online.

1317
13
Example Scenarios
Cluster interconnect configuration is required for:
Adding or removing cluster nodes Merging clusters Changing communication parameters, such as the heartbeat time interval Changing recovery behavior Changing or adding interfaces used for the cluster interconnect Configuring additional network heartbeat links
Changing the Interconnect Configuration

You may never need to perform any manual configuration of the cluster interconnect because the VCS installation utility sets up the interconnect based on the information you provide about the cluster. However, certain configuration tasks require you to modify VCS communication configuration files, as shown in the slide.
1318

Procedure for Modifying the Interconnect

llttab set-node S1 set-cluster 10 llthosts 0 S1 1 S2 sysname S1 gabtab
path/gabconfig c n #
Back up and edit files
*One system One system

hastop all -force Start VCS
All systems All systems
Stop VCS
hastart
Stop GAB
gabconfig -U
Start GAB
sh /etc/gabtab
Stop LLT
lltconfig -U
Start LLT
lltconfig -c
Modifying the Cluster Interconnect Configuration The process shown in the diagram can be used for any type of change to the VCS communications configuration. The first task refers to the procedure provided in the Offline Configuration lesson.
13
Although some types of modifications do not require you to stop both GAB and LLT, using this procedure ensures that any type of change you make takes effect. For example, if you added a system to a running cluster, you can change the value of -n in the gabtab file without having to restart GAB. However, if you added the -j option to change the recovery behavior, you must either restart GAB or execute the gabtab command manually for the change to take effect. Similarly, if you add a host entry to llthosts, you do not need to restart LLT. However, if you change llttab, or you change a host name in llthosts, you must stop and restart LLT, and, therefore, GAB. Following this procedure ensures that any type of changes take effect. You can also use the scripts in the rc*.d directories to stop and start services. Note: On Solaris, you must also unload the LLT and GAB modules if you are removing a system from the cluster, or upgrading LLT or GAB binaries. For example:
modinfo | grep gab modunload -i gab_id modinfo | grep llt modunload -i llt_id
1319
Example LLT Link Specification

Range (all) SAP
/etc/llttab set-node S1 set-cluster 10 # Solaris example link qfe0 /dev/qfe:0 - ether - link qfe4 /dev/qfe:4 - ether - link qfe5 /dev/qfe:5 - ether - link-lowpri eri0 /dev/eri:0 - ether - -
Solaris Solaris
Tag Name NIC ? AIX
Device:Unit HP-UX
Link Type Linux
MTU
Example LLT Link Specification You can add links to the LLT configuration as additional layers of redundancy for the cluster interconnect. You may want an additional interconnect link for: VCS for heartbeat redundancy Storage Foundation for Oracle RAC for additional bandwidth To add an Ethernet link to the cluster interconnect: 1 Cable the link on all systems. 2 Use the process on the previous page to modify the llttab file on each system to add the new link directive. To add a low-priority public network link, add a link-lowpri directive using the same syntax as the link directive, as shown in the llttab file example in the slide. VCS uses the low-priority link only for heartbeats (at half the normal rate), unless it is the only remaining link in the cluster interconnect.
1320

Lesson Summary
Key Points
The cluster interconnect is used for cluster membership and status information. The cluster interconnect configuration may never require modification, but can be altered for site-specific requirements.
Reference Materials
VERITAS Cluster Server Installation Guide VERITAS Cluster Server User's Guide
Lab 13: Configuring LLT

Failed
trainX trainX
trainY trainY
In this lab, you: Set the exclude directive to simplify lltstat display output. Add a low-priority LLT link on your public network interface.
Labs and solutions for this lesson are located on the following pages. "Lab 13: Configuring LLT," page A-91. "Lab 13 Solutions: Configuring LLT," page B-157.

1321
13
1322

Lesson 14 System and Communication Faults
Lesson Introduction
Topic
Ensuring Data Integrity Cluster Interconnect Failures

Describe VERITAS recommendations for ensuring data integrity. Describe how VCS responds to cluster interconnect failures.
142

Ensuring Data Integrity

VCS methods for protecting shared storage:
I/O fencing for storage hardware with SCSI-3 persistent reservation (PR) support GAB seeding Hardware without SCSI-3 PR support:
Redundant communication links Jeopardy cluster membership Autodisabled service groups
Details provided in the next lesson. provided in the next lesson.
Ensuring Data Integrity

A cluster implementation must protect data integrity by preventing a split brain condition. The simplest implementation is to disallow any corrective action in the cluster when contact is lost with a node. However, this means that automatic recovery after a legitimate node fault is not possible. In order to support automatic recovery after a legitimate node fault, you must have an additional mechanism to verify that a node failed. VCS 4.0 introduced I/O fencing as the primary mechanism to verify node failure, which ensures that data is safe above all other criteria. For environments that do not support I/O fencing, several additional methods for protecting data integrity are also supported.

143
14
System Failure Example

A C B C
Failed S1 S1 S2 S2 S3 S3
S3 faults; C started on S1 or S2 Regular Membership: S1, S2
No Membership: S3
VCS Response to System Failure The example cluster used throughout most of this section contains three systems, S1, S2, and S3, each of which can run any of the three service groups, A, B, and C. The abbreviated system and service group names are used to simplify the diagrams. In this example, there are two Ethernet LLT links for the cluster interconnect. Prior to any failures, systems S1, S2, and S3 are part of the regular membership of cluster number 1. When system S3 fails, it is no longer part of the cluster membership. Service group C fails over and starts up on either S1 or S2, according to the SystemList values.
144

Failover Duration on a System Failure

In the case of a system failure, service group failover time is the sum of the duration of each of these tasks.
+ Detect the system failure21 seconds for heartbeat timeouts. + Select a failover targetless than one second. + Bring the service group online on another system in the cluster.
= Failover Duration
Failover Duration on a System Failure When a system faults, application services that were running on that system are disrupted until the services are started up on another system in the cluster. The time required to address a system fault is a combination of the time required to: Detect the system failure. A system is determined to be faulted according to these default timeout periods: LLT timeout: If LLT on a running system does not receive a heartbeat from a system for 16 seconds, LLT notifies GAB of a heartbeat failure. GAB stable timeout: GAB determines that a membership change is occurring, and after five seconds, GAB delivers the membership change to HAD. Select a failover target. The time required for the VCS policy module to determine the target system is negligible, less than one second in all cases, in comparison to the other factors. Bring the service group online on another system in the cluster. As described in an earlier lesson, the time required for the application service to start up is a key factor in determining the total failover time.

145
14
Single LLT Link Remaining

A B C
Regular Membership: S1, S2, S3
S1 S1
S2 S2
S3 S3
gabconfig -a gabconfig -a GAB Port Memberships GAB Port Memberships ================================ ================================ Port a gen a36e003 membership 012 Port a gen a36e003 membership 012 Port a gen a36e003 Port a gen a36e003 jeopardy ; 2 jeopardy ; 2 Port h gen fd57002 membership 012 Port h gen fd57002 membership 012 Port h gen fd57002 Port h gen fd57002 jeopardy ; 2 jeopardy ; 2
Jeopardy Membership: S3
Cluster Interconnect Failures

Single LLT Link Failure In the case where a node has only one functional LLT link, the node is a member of the regular membership and the jeopardy membership. Being in a regular membership and jeopardy membership at the same time changes only the failover behavior on system fault. All other cluster functions remain. This means that failover due to a resource fault or switchover of service groups at operator request is unaffected. The only change is that other systems prevented from starting service groups on system fault. VCS continues to operate as a single cluster when at least one network channel exists between the systems. In the example shown in the diagram where one LLT link fails: A jeopardy membership is formed that includes just system S3. System S3 is also a member of the regular cluster membership with systems S1 and S2. Service groups A, B, and C continue to run and all other cluster functions remain unaffected. Failover due to a resource fault or an operator request to switch a service group is unaffected. If system S3 now faults or its last LLT link is lost, service group C is not started on systems S1 or S2.
146

Jeopardy Membership
Jeopardy is:
A special cluster membership Formed when one or more systems have only a single LLT link Returned to normal membership when links are fixed
Effects of jeopardy: The cluster functions normally. Service groups continue to run. Failover and switching actions are unaffected. Service groups running on a system in jeopardy cannot fail over if the system faults or loses its last link.
Jeopardy Membership When a system is down to a single LLT link, VCS can no longer reliably discriminate between loss of a system and loss of the last LLT connection. Systems with only a single LLT link are put into a special cluster membership known as jeopardy. Jeopardy is a mechanism for preventing split-brain condition if the last LLT link fails. If a system is in a jeopardy membership, and then loses its final LLT link: Service groups in the jeopardy membership are autodisabled in the regular cluster membership. Service groups in the regular membership are autodisabled in the jeopardy membership. Recovering from a Jeopardy Membership Recovery from a single LLT link failure is simplefix and reconnect the link. When GAB detects that the link is now functioning and the system in Jeopardy again has reliable (redundant) communication with the other cluster systems, the Jeopardy membership is removed.

147
14
Transition from Jeopardy to Network Partition

3 A C C autodisabled for S1, S2 B A B 3 A, B autodisabled for S3 C
S1 S1
S2 S2 2
S3 S3
1 Jeopardy membership: S3 Mini-cluster with regular membership: S1, S2 Mini-cluster with regular 2 membership: S3 No Jeopardy membership 3 SGs autodisabled
Transition from Jeopardy to Network Partition If the last LLT link fails: A new regular cluster membership is formed that includes only systems S1 and S2. This is referred to as a mini-cluster. A new separate membership is created for system 3, which is a mini-cluster with a single system. Because system S3 was in a jeopardy membership prior to the last link failing: Service group C is autodisabled in the mini-cluster containing systems S1 and S2 to prevent either system from starting it. Service groups A and B are autodisabled in the cluster membership for system S3 to prevent system S3 from starting either one. Service groups A and B can still fail over between systems S1 and S2. In this example, the cluster interconnect has partitioned and two separate cluster memberships have formed as a result, one on each side of the partition. Each of the mini-clusters continues to operate. However, because they cannot communicate, each maintains and updates only its own version of the cluster configuration and the systems on different sides of the network partition have different cluster configurations.
148

Recovering from a Network Partition

3 A C C autoenabled for S1, S2 B A B 1 S1 S1 S2 S2 2 3 1 Stop HAD forcibly on S3. Mini-cluster with S1, S2 continues to run. 2 Fix and check LLT links. Start HAD on S3. 3 A, B, C are autoenabled by HAD. S3 S3 3 A, B autoenabled for S3 C
Recovering from a Network Partition After a cluster is partitioned, reconnecting the LLT links must be undertaken with care because each mini-cluster has its own separate cluster configuration. You must enable the cluster configurations to resynchronize by stopping VCS on the systems on one or the other side of the network partitions. When you reconnect the interconnect, GAB rejoins the regular cluster membership and you can then start VCS using hastart so that VCS rebuilds the cluster configuration from the other running systems in the regular cluster membership. To recover from a network partition: 1 On the cluster with the fewest systems (S3, in this example), stop VCS and leave services running. hastop -all -force Note: Although this example has only one system (S3) on one side of the network partition, there could be multiple systems on each side of the partition. You must stop VCS on all systems on one side of the partition. Recable or fix LLT links. Use gabconfig -a to verify links are up. 3 Restart VCS. VCS autoenables all service groups so that failover can occur. hastart
2

149
14
Default Recovery Behavior

If you reconnect the interconnect after a network partition without first stopping VCS as recommended, VCS is automatically stopped and restarted as follows: Two-system cluster:
The system with the lowest LLT node number continues to run VCS. VCS is stopped on the higher-numbered system.
Multisystem cluster:
The mini-cluster with the most systems running continues to run VCS. VCS is stopped on the systems in the smaller miniclusters. If split into two equal size mini-clusters, the cluster containing the lowest node number continues to run VCS.
Recovery Behavior When a cluster partitions because the cluster interconnect has failed, each of the mini-clusters continues to operate. However, because they cannot communicate, each maintains and updates only its own version of the cluster configuration, and the systems on different sides of the network partition have different cluster configurations. If you reconnect the LLT links without first stopping VCS on one side of the partition, GAB automatically stops HAD on selected systems in the cluster to protect against a potential split-brain scenario. GAB protects the cluster as follows: In a two-system cluster, the system with the lowest LLT node number continues to run VCS and VCS is stopped on the higher-numbered system. In a multisystem cluster, the mini-cluster with the most systems running continues to run VCS. VCS is stopped on the systems in the smaller miniclusters. If a multisystem cluster is split into two equal-size mini-clusters, the cluster containing the lowest node number continues to run VCS. To summarize the rule of recovery behavior: If the cluster divides into an even number of nodes on each side of the partition, the cluster with the lowest LLT node ID continues to run. Otherwise, the larger cluster continues to run.
1410

Potential Split Brain Condition

A 2 C B A B C
S1 S1
S2 S2
S3 S3
S1 and S2 determine that S3 is faulted. No jeopardy occurs, so no SGs are autodisabled. If all systems are in all SGs SystemList, VCS tries to bring them online on a failover target. 1 S3 determines that S1 and S2 are faulted.
Potential Split Brain Condition When both LLT links fail simultaneously: The cluster partitions into two separate clusters. Each cluster determines that the other systems are down and tries to start the service groups. If an application starts on multiple systems and can gain control of what are normally exclusive resources, such as disks in a shared storage device, split brain condition results and data can be corrupted.

1411
14
Interconnect Failures with a Low-Priority Public Link

A B C
S1 S1
S2 S2 2
S3 S3
1 2
No change in membership Regular membership: S1, S2, S3 2
Jeopardy membership: S3 Public is now used for heartbeat and status.
Interconnect Failures with a Low-Priority Public Link LLT can be configured to use a low-priority network link as a backup to normal heartbeat channels. Low-priority links are typically configured on the public network or administrative network. In normal operation, the low-priority link carries only heartbeat traffic for cluster membership and link state maintenance. The frequency of heartbeats is reduced by half to minimize network overhead. When the low-priority link is the only remaining LLT link, LLT switches all cluster status traffic over the link. Upon repair of any configured link, LLT switches cluster status traffic back to the high-priority link. Notes: Nodes must be on the same public network segment in order to configure lowpriority links. LLT is a non-routable protocol. You can have up to eight LLT links total, which can be a combination of lowand high-priority links. If you have three high-priority links in the scenario shown in the slide, you have the same progression to jeopardy membership. The difference is that all three links are used for regular heartbeats and cluster status information.
1412

Preexisting Network Partition

A C B C
1 Failed S1 S1 S2 S2 2 S3 S3 3
1 2 3
S3 faults; C started on S1 or S2 Regular membership: S1, S2 LLT links to S3 disconnected S3 reboots; S3 cannot start HAD because GAB on S3 can only detect one member No membership: S3
Preexisting Network Partition A preexisting network partition occurs if LLT links fail while a system is down. If the system comes back up and starts running services without being able to communicate with the rest of the cluster, a split-brain condition can result. When a preexisting network partition occurs, VCS prevents systems on one side of the partition from starting applications that may already be running by preventing HAD from starting on those systems. In the scenario shown in the diagram, system S3 cannot start HAD when it reboots because the network failure prevents GAB from communicating with any other cluster systems; therefore, system S3 cannot seed.
14

1413
Lesson Summary
Key Points
Use redundant cluster interconnect links to minimize interruption to services. Use a standard procedure for modifying the interconnect configuration when changes are required.
Reference Materials
VERITAS Cluster Server Installation Guide VERITAS Cluster Server User's Guide
Lab 14: Testing Communication Failures

1. 2. Configure the InJeopardy trigger (optional). Test failures.
trainxx trainxx trainxx trainxx
Optional Lab Trigger Trigger injeopardy injeopardy
Labs and solutions for this lesson are located on the following pages. "Lab 14: Testing Communication Failures," page A-95. "Lab 14 Solutions: Testing Communication Failures," page B-161.
1414

Lesson 15 I/O Fencing
Course Overview

Topic
Data Protection Requirements I/O Fencing Concepts and Components I/O Fencing Operations I/O Fencing Implementation Configuring I/O Fencing

Define requirements for protecting data in a cluster. Describe I/O fencing concepts and components. Describe I/O fencing operations. Describe how I/O fencing is implemented. Configure I/O fencing in a VCS environment.
152

Normal VCS Operation
App DB
Heartbeats travel on the cluster interconnect sending I am alive messages. Applications (Service Groups) run in the cluster and their current status is known.
Data Protection Requirements

Understanding the Data Protection Problem In order to understand how VCS protects shared data in a high availability environment, it helps to see the problem that needs to be solvedhow a cluster goes from normal operation to responding to various failures. Normal VCS Operation When the cluster is functioning normally, one system is running one or more service groups and has the storage objects for those services imported or accessible from that system only.

153
15
System Failure
App DB
Failed
System failure is detected when the I am alive heartbeats no longer are seen coming from a given node. VCS then takes corrective action to fail over the service group from the failed server.
System Failure In order to keep services highly available, the cluster software must be capable of taking corrective action on the failure of a system. Most cluster implementations are lights out environmentsthe HA software must automatically respond to faults without administrator intervention. Example corrective actions are: Starting an application on another node Reconfiguring parallel applications to no longer include the departed node in locking operations The animation shows conceptually how VCS handles a system fault. The yellow service group that was running on Server 2 is brought online on Server 1 after GAB on Server 1 stops receiving heartbeats from Server 2 and notifies HAD.
154

Interconnect Failure
App DB
If the interconnect fails between the clustered systems: The symptoms look the same as a system failure. However; VCS should not take corrective action and fail over the service groups.
Interconnect Failure A key function of a high availability solution is to detect and respond to system faults. However, the system may still be running but unable to communicate heartbeats due to a failure of the cluster interconnect. The other systems in the cluster have no way to distinguish between the two situations. This problem is faced by all HA solutionshow can the HA software distinguish a system fault from a failure of the cluster interconnect? As shown in the example diagram, whether the system on the right side (Server 2) fails or the cluster interconnect fails, the system on the left (Server 1) no longer receives heartbeats from the other system. The HA software must have a method to prevent an uncoordinated view among systems of the cluster membership in any type of failure scenario. In the case where nodes are running but the cluster interconnect has failed, the HA software needs to have a way to determine how to handle the nodes on each side of the network split, or partition. Network Partition A network partition is formed when one or more nodes stop communicating on the cluster interconnect due to a failure of the interconnect.
15

155
Split Brain Condition
App DB
App DB
Changing Block 1024
Changing Block 1024
If each system were to take corrective action and bring the other systems service groups online: Each application would be running on each system. Data corruption can occur.
Split Brain Condition A network partition can lead to split brain conditionan issue faced by all cluster implementations. This problem occurs when the HA software cannot distinguish between a system failure and an interconnect failure. The symptoms look identical. For example, in the diagram, if the system on the right fails, it stops sending heartbeats over the private interconnect. The left node then takes corrective action. Failure of the cluster interconnect presents identical symptoms. In this case, both nodes determine that their peer has departed and attempt to take corrective action. This can result in data corruption if both nodes are able to take control of storage in an uncoordinated manner. Other scenarios can cause this situation. If a system is so busy that it appears to be hung such that it seems to have failed, its services can be started on another system. This can also happen on systems where the hardware supports a break and resume function. If the system is dropped to command-prompt level with a break and subsequently resumed, the system can appear to have failed. The cluster is reformed and then the system recovers and begins writing to shared storage again. The remainder of this lesson describes how the VERITAS fencing mechanism prevents split brain condition in failure situations.
156

Data Protection Requirements

A cluster environment requires a method to guarantee:
Cluster membership The membership must be consistent. Data protection Upon membership change, only one cluster can survive and have exclusive control of shared data disks.
DB
Failed App ?
When the heartbeats stop, VCS When the heartbeats stop, VCS needs to take action, but both needs to take action, but both failures have the same symptoms. failures have the same symptoms. What action should be taken? What action should be taken? Which failure is it? Which failure is it?
Data Protection Requirements The key to protecting data in a shared storage cluster environment is to guarantee that there is always a single consistent view of cluster membership. In other words, when one or more systems stop sending heartbeats, the HA software must determine which systems can continue to participate in the cluster membership and how to handle the other systems.

157
15
I/O Fencing
VCS uses a mechanism called I/O fencing to guarantee data protection. I/O fencing uses SCSI-3 persistent reservations (PR) to fence off data drives to prevent split brain condition.
App DB
Failed
I/O Fencing Concepts and Components

In order to guarantee data integrity in the event of communication failures among cluster members, full protection requires determining who should be able to remain in the cluster (membership) as well as guaranteed access blocking to storage from any system that is not an acknowledged member of the cluster. VCS 4.x uses a mechanism called I/O fencing to guarantee data protection. I/O fencing uses SCSI-3 persistent reservations (PR) to fence off data drives to prevent split-brain condition, as described in detail in this lesson. I/O fencing uses an enhancement to the SCSI specification, known as SCSI-3 persistent reservations, (SCSI-3 PR or just PR). SCSI-3 PR is designed to resolve the issues of using SCSI reservations in a modern clustered SAN environment. SCSI-3 PR supports multiple nodes accessing a device while at the same time blocking access to other nodes. Persistent reservations are persistent across SCSI bus resets and PR also supports multiple paths from a host to a disk.
158

I/O Fencing Components

Coordinator disks: Act as a global lock device Determine which nodes are currently registered in the cluster Ensure that only one cluster survives in the case of a network partition
App DB
I/O Fencing Components VCS uses fencing to allow write access to members of the active cluster and to block access to nonmembers. I/O fencing in VCS consists of several components. The physical components are coordinator disks and data disks. Each has a unique purpose and uses different physical disk devices. Coordinator Disks The coordinator disks act as a global lock mechanism, determining which nodes are currently registered in the cluster. This registration is represented by a unique key associated with each node that is written to the coordinator disks. In order for a node to access a data disk, that node must have a key registered on coordinator disks. When system or interconnect failures occur, the coordinator disks ensure that only one cluster survives, as described in the I/O Fencing Operations section.
15

159
Data Disks
Are located on a shared storage device Store application data for service groups Must support SCSI-3 Must be in a Volume Manager 4.x or 5.x disk group Disk group must be managed by a VCS DiskGroup resource
App DB
Data Disks Data disks are standard disk devices used for shared data storage. These can be physical disks or RAID logical units (LUNs). These disks must support SCSI-3 PR. Data disks are incorporated into standard VM disk groups. In operation, Volume Manager is responsible for fencing data disks on a disk group basis. Disks added to a disk group are automatically fenced, as are new paths to a device are discovered.
1510

Registration with Coordinator Disks

At GAB startup, just after port a membership is established:
Nodes must register with the coordinator disks Nodes register keys based on node number.
Nodes can only register with the fencing membership if coordinator disks have expected keys. Fencing membership uses GAB port b. HAD cannot be started until port b membership is complete.
a 01 b 01
a 01 b 01
Node 0 Node 0 A B A B A B
Node 1 Node 1
Keys are based on LLT node number: 0=A, 1=B, and so on.
I/O Fencing Operations

Registration with Coordinator Disks After GAB has started and port a membership is established, each system registers with the coordinator disks. HAD cannot start until registration is complete. Registration keys are based on the LLT node number. Each key is eight charactersthe left-most character is the ASCII character corresponding to the LLT node ID. For example, Node 0 uses key A-------, Node 1 uses B------, Node 2 is C, and so on. The right-most seven characters are dashes. For simplicity, these are shown as A and B in the diagram. Note: The registration key is not actually written to disk, but is stored in the drive electronics or RAID controller. All systems are aware of the keys of all other systems, forming a membership of registered systems. This fencing membershipmaintained by way of GAB port bis the basis for determining cluster membership and fencing data drives.

1511
15
Service Group Startup

HAD is allowed to start after fencing is finished. As service groups are started, VM disk groups are imported. Importing a disk group:
Writes a registration key Places a reservation on its disks
a 01 b 01 h 01 DB DB Node 0 Node 0 A B Disk group for DB, has key of AVCS and reservation for Node 0 exclusive access A B A B Disk group for App, has key of BVCS and reservation for Node 1 exclusive access Node 1 Node 1 a 01 b 01 h 01 App App
Service Group Startup After each system has written registration keys to the coordinator disks, the fencing membership is established and port b shows all systems as members. In the example shown in the diagram, the cluster has two members, Node 0 and Node 1, so port b membership shows 0 and 1. At this point, HAD is started on each system. When HAD is running, VCS brings service groups online according to their specified startup policies. When a disk group resource associated with a service group is brought online, the Volume Manager disk group agent (DiskGroup) imports the disk group and writes a SCSI3 registration key to the data disk. This registration is performed in a similar way to coordinator disk registration. The key is different for each node; Node 0 uses AVCS, Node 1 uses BVCS, and so on. In the example shown in the diagram, Node 0 is registered to write to the data disks in the disk group belonging to the DB service group. Node 1 is registered to write to the data disks in the disk group belonging to the App service group. After registering with the data disk, Volume Manager sets a Write Exclusive Registrants Only reservation on the data disk. This reservation means that only the registered system can write to the data disk.
1512

System Failure
1. 2. Node 0 detects no more heartbeats from Node 1. Node 0 races for the coordinator disks, ejecting all B keys. Node 0 wins all coordinator disks. Node 0 knows it has a perfect membership. VCS can now fail over the App service group and import the disk group, changing the reservation.
a 01 a0 b0 01 b 01 h 01 0 App App DB DB Node 0 Node 0 A Disk group for DB, has key of AVCS and reservation for Node 0 exclusive access B X A A X B X A B A X B group Disk Group for App, has key of AVCS BVCS and reservation Reservation 0 for Node 1 exclusive access Node 1 Node 1 Failed a 01 b 01 h 01 App App
3. 4. 5.
System Failure The diagram shows the fencing sequence when a system fails. 1 Node 0 detects that Node 1 has failed when the LLT heartbeat times out and informs GAB. At this point, port a on Node 0 (GAB membership) shows only 0. 2 The fencing driver is notified of the change in GAB membership and Node 0 races to win control of a majority of the coordinator disks. This means Node 0 must eject Node 1 keys (B) from at least two of three coordinator disks. The fencing driver ejects the registration of Node 1 (B keys) using the SCSI-3 Preempt and Abort command. This command allows a registered member on a disk to eject the registration of another. Because I/O fencing uses the same key for all paths from a host, a single preempt and abort ejects a host from all paths to storage. 3 In this example, Node 0 wins the race for each coordinator disk by ejecting Node 1 keys from each coordinator disk. 4 Now port b (fencing membership) shows only Node 0 because Node 1 keys have been ejected. Therefore, fencing has a consistent membership and passes the cluster reconfiguration information to HAD. 5 GAB port h reflects the new cluster membership containing only Node 0 and HAD now performs failover operations defined for the service groups that were running on the departed system. Fencing takes place when a service group is brought online on a surviving system as part of the disk group importing process. When the DiskGroup resources come online, the agent online entry point instructs Volume Manager to import the disk group with options to remove the Node 1 registration and reservation, and place a SCSI-3 registration and reservation for Node 0.
1513
15
Interconnect Failure
1. Node 0 detects no more heartbeats from Node 1. Node 1 detects no more heartbeats from Node 0. Nodes 0 and 1 race for the coordinator disks, ejecting each other's keys. Only one node can win each disk. Node 0 wins majority coordinator disks. Node 1 panics. Node 0 now has perfect membership. VCS fails over the App service group, importing the disk group and changing the reservation.
a 01 a0 b 01 b0 h 01 h0 App App DB DB Node 0 Node 0 A
B X A
2.
X
A B X A B B X A
a 01 1 b 01 h 01 App App
Node 1 Node 1
3. 4. 5. 6.
Disk group for DB, has key of AVCS and reservation for Node 0 exclusive access
Disk Group group for App, has key of AVCS BVCS and reservation Reservation for Node 1 0 exclusive access
Interconnect Failure The diagram shows how VCS handles fencing if the cluster interconnect is severed and a network partition is created. In this case, multiple nodes are racing for control of the coordinator disks. 1 LLT on Node 0 informs GAB that it has not received a heartbeat from Node 1 within the timeout period. Likewise, LLT on Node 1 informs GAB that it has not received a heartbeat from Node 0. 2 When the fencing drivers on both nodes receive a cluster membership change from GAB, they begin racing to gain control of the coordinator disks. The node that reaches the first coordinator disk (based on disk serial number) ejects the failed nodes key. In this example, Node 0 wins the race for the first coordinator disk and ejects the B------- key. After the B key is ejected by Node 0, Node 1 cannot eject the key for Node 0 because the SCSI-PR protocol says that only a member can eject a member. This condition means that only one system can win. 3 Node 0 also wins the race for the second coordinator disk. Node 0 is favored to win the race for the second coordinator disk according to the algorithm used by the fencing driver. Because Node 1 lost the race for the first coordinator disk, Node 1 has to sleep for one second (default) before it tries to eject the other nodes key. This favors the winner of the first coordinator disk to win the remaining coordinator disks. Therefore, Node 1 does not gain control of the second or third coordinator disks.
1514

After Node 0 wins control of the majority of coordinator disks (all three in this example), Node 1 loses the race and calls a kernel panic to shut down immediately and reboot. 5 Now port b (fencing membership) shows only Node 0 because Node 1 keys have been ejected. Therefore, fencing has a consistent membership and passes the cluster reconfiguration information to HAD. 6 GAB port h reflects the new cluster membership containing only Node 0, and HAD now performs the defined failover operations for the service groups that were running on the departed system. When a service group is brought online on a surviving system, fencing takes place as part of the disk group importing process.
4

1515
15
Interconnect Failure on Node Restart

1. 2. Node 1 reboots. Node 1 cannot join; gabconfig c n2 is set in gabtab and only one node is seen. Admin mistakenly seeds Node 1 with gabconfig c x. The fence driver expects to see only B keys or no keys on coordinator disks. When the fence driver sees A keys, it is disabled. No further cluster startup can occur without a reboot. To start the cluster, fix the interconnect.
a 01 a0 b 01 b0 h0 01 a a 1 b b h h
App App
3. 4. 5.
DB DB Node 0 Node 0 A X B Disk group for DB, has key of AVCS and reservation for Node 0 exclusive access A X B A X B A B X
Node 1 Node 1
6.
Disk group for App, has key of AVCS and reservation for Node 0 exclusive access
Interconnect Failure on Node Restart A preexisting network partition occurs when the cluster interconnect is severed and a node subsequently reboots and attempts to form a cluster. In this example, the cluster interconnect remains severed. Node 0 is running and has an A key registered with the coordinator disks. 1 Node 1 starts up. 2 GAB cannot seed because it detects only Node 1 and the gabtab file specifies gabconfig -c -n2. GAB can only seed if two systems are communicating. Therefore, HAD cannot start and service groups do not start. 3 At this point, an administrator mistakenly forces GAB to seed Node 1. 4 During initialization, the fencing driver compares the list of nodes in the GAB membership with the keys present on the coordinator disks. In this example, the fencing driver on Node 1 detects keys from Node 0 (A------) but does not detect Node 0 in the GAB membership because the cluster interconnect has been severed. gabconfig -a . . . Port a gen b7r004 membership 1 5 Because Node 1 detects keys on the coordinator disks for systems not present the GAB membership, the Node 1 fencing driver determines that a preexisting network partition exists and prints an error message to the console. The fencing driver prevents HAD from starting, which prevents importing of disk groups. To enable Node 1 to rejoin the cluster, you must repair the interconnect.
1516

I/O Fencing Behavior

I/O fencing behavior is the same for both scenarios: System failure Cluster interconnect failure I/O fencing makes no assumptionsthe driver races for the control of the coordinator disks to form a perfect membership. Data disks are fenced when imported by Volume Manager. Nodes that have departed from the cluster membership are not allowed access to the data disks until they rejoin the normal cluster membership and the service groups are started on those nodes.
I/O Fencing Behavior As demonstrated in the example failure scenarios, I/O fencing behaves the same regardless of the type of failure: The fencing drivers on each system race for control of the coordinator disks and the winner determines cluster membership. Reservations are placed on the data disks by Volume Manager when disk groups are imported.

1517
15
I/O Fencing with Multiple Nodes

The lowest LLT node number in each mini-cluster races for coordinator disks. The larger mini-cluster is preferred to win. The winning node broadcasts success on GAB port b. All nodes in the losing minicluster panic. Tunable parameters control the race: vxfen_min_delay (1 second default) vxfen_max_delay (60 seconds default)
I/O Fencing with Multiple Nodes In a multinode cluster, the lowest numbered (LLT ID) node always races on behalf of the remaining nodes. This means that at any time only one node is the designated racer for any mini-cluster. If a designated racer wins the coordinator disk race, it broadcasts this success on port b to all other nodes in the mini-cluster. If the designated racer loses the race, it panics and reboots. All other nodes immediately detect another membership change in GAB when the racing node panics. This signals all other members that the racer has lost and they must also panic. Majority Clusters The I/O fencing algorithm is designed to give priority to larger clusters in any arbitration scenario. For example, if a single node is separated from a 16-node cluster due to an interconnect fault, the 15-node cluster should continue to run. The fencing driver uses the concept of a majority cluster. The algorithm determines if the number of nodes remaining in the cluster is greater than or equal to the number of departed nodes. If so, the larger cluster is considered a majority cluster. The majority cluster begins racing immediately for control of the coordinator disks on any membership change. The fencing drivers on the nodes in the minority cluster delay the start of the race to give an advantage to the larger cluster. This delay is established by the vxfen_min_delay and vxfen_max_delay tunable parameters. The algorithm ensures that the larger cluster wins, but also allows a smaller cluster to win if the departed nodes are not actually running.
1518

Communication Stack
I/O fencing:
HAD
vxfen
VM
Is implemented by the fencing driver (vxfen) Uses GAB port b for LMX communication Determines coordinator disks on vxfen startup Intercepts RECONFIG messages from GAB destined for the VCS engine Controls fencing actions by Volume Manager
GAB LLT LLT GAB
b
vxfen
HAD
VM
I/O Fencing Implementation

Communication Stack I/O fencing uses GAB port b for communications. The following steps describe the start-up sequence used by vxfen and associated communications: 1 Fencing is started with the vxfenconfig -c command. The fencing driver vxfen is started during system startup by way of the /etc/rc2.d/ S97vxfen file on Solaris (or the /sbin/rc2.d/S930vxfen file on HPUX), whether or not fencing is configured for VCS. The fencing driver: a Passes in the list of coordinator disks b Checks for other members on port b c If this is the first member to register, reads serial numbers from coordinator disks and stores them in memory d If this is the second or later member to register, obtains serial numbers of coordinator disks from the first member e Reads and compares the local serial number f Errors out, if the serial number is different g Begins a preexisting network partition check h Reads current keys registered on coordinator disks i Determines that all keys match the current port b membership j Registers the key with coordinator disks 2 Membership is established (port b). 3 HAD is started and port h membership is established. 4 HAD starts service groups. 5 The DiskGroup resource is brought online and control is passed to VxVM to import disk groups with SCSI-3 reservations.
1519
15
Fencing Driver
The VERITAS fencing driver (vxfen): Coordinates membership with the race for coordinator disks Is called by other modules for authorization to continue Is installed by VCS and started during system startup
Fencing Driver Fencing in VCS is implemented in two primary areas: The vxfen fencing driver, which directs Volume Manager Volume Manager, which carries out actual fencing operations at the disk group level The fencing driver is a kernel module that connects to GAB to intercept cluster membership changes (reconfiguration messages). If a membership change occurs, GAB passes the new membership in the form of a reconfiguration message to vxfen on GAB port b. The fencing driver on the node with lowest node ID in the remaining cluster races for control of the coordinator disks, as described previously. If this node wins, it passes the list of departed nodes to VxVM to have these nodes ejected from all shared disk groups. After carrying out required fencing actions, vxfen passes the reconfiguration message to HAD.
1520

Fencing Implementation in Volume Manager

The VCS DiskGroup agent fences the disks using VM disk and disk group functions:
Online entry point:
Imports the disk group Places the node key on each disk Places SCSI-3 reservations on the disks vxdg -o groupreserve=VCS -o clearreserve \ -t import group
Offline entry point:

Removes node keys Removes SCSI-3 reservations
VxVM allows a disk or path to be added or removed online.
Fencing Implementation in Volume Manager Volume Manager handles all fencing of data drives for disk groups that are controlled by the VCS DiskGroup resource type. After a node successfully joins the GAB cluster and the fencing driver determines that a preexisting network partition does not exist, the VCS DiskGroup agent directs VxVM to import disk groups using SCSI-3 registration and a Write Exclusive Registrants Only reservation. This ensures that only the registered node can write to the disk group. Each path to a drive represents a different I/O path. I/O fencing in VCS places the same key on each path. For example, if node 0 has four paths to the first disk group, all four paths have key AVCS registered. Later, if node 0 must be ejected, VxVM preempts and aborts key AVCS, effectively ejecting all paths. Because VxVM controls access to the storage, adding or deleting disks is not a problem. VxVM fences any new drive added to a disk group and removes keys when drives are removed. VxVM also determines if new paths are added and fences these, as well.
15

1521
Fencing Implementation in VCS

HAD:
Is modified to use fencing for coordination Requires UseFence to be set to SCSI3 in the main.cf file Cannot be set dynamically:
Edit main.cf Restart VCS
Does not start unless the fencing driver is operational
Legacy behavior available

UseFence=NONE
Fencing Implementation in VCS In VCS 4.x and 5.x, had is modified to enable the use of fencing for data protection in the cluster. When the UseFence cluster attribute is set to SCSI3, had cannot start unless the fencing driver is operational. This ensures that services cannot be brought online by VCS unless fencing is already protecting shared storage disks. You cannot set the UseFence attribute while VCS is running. You must use the offline configuration method and make the change in the main.cf file manually.
1522

Coordinator Disk Implementation

Coordinator disk properties:
Three standard disks or LUNs (minimum 32MB LUNs on separate spindles) SCSI-3 persistent reservation support Separate disk group used only for fencing: Do not store data on the coordinator disks. Set the coordinator=on option. Deport the disk group. Hardware mirroring; coordinator disks cannot be replaced without stopping the cluster.
Coordinator Disk Implementation Coordinator disks are special-purpose disks in a VCS environment. Coordinator disks are three standard disks or LUNs that are set aside for use by I/O fencing during cluster reconfiguration. The coordinator disks can be any three disks that support persistent reservations. VERITAS typically recommends using small LUNs (at least 32MB) for coordinator use. You cannot use coordinator disks for any other purpose in the VCS configuration. Do not store data on these disks or include the disks in disk groups used for data. Using the coordinator=on option to vxdg for the coordinator disk group ensures that the coordinator disks cannot inadvertently be used for other purposes. For example:
vxdg -g fendg set coordinator=on
When the coordinator attribute is set, Volume Manager prevents the reassignment of coordinator disks to other disk groups. Note: Discussion of coordinator disks in metropolitan area (campus) clusters is provided in other Symantec high availability courses.
15

1523
Dynamic Multipathing (DMP) Support

Fencing supports DMP for coordinator and data disks.
Set mode in /etc/vxfenmode:
cp /etc/vxfen.d/vxfenmode_scsi3_dmp /etc/vxfenmode
Other modes: raw, sanvm, disabled

/etc/vxfentab DMP Mode
/dev/vx/rdmp/c3t0d74s2 /dev/vx/rdmp/c3t0d74s2 /dev/vx/rdmp/c3t0d75s2 /dev/vx/rdmp/c3t0d75s2 /dev/vx/rdmp/c3t0d76s2 /dev/vx/rdmp/c3t0d76s2
Raw Mode
/dev/rdsk/c3t0d74s2 /dev/rdsk/c3t0d74s2 /dev/rdsk/c2t4d74s2 /dev/rdsk/c2t4d74s2 /dev/rdsk/c3t0d75s2 /dev/rdsk/c3t0d75s2 /dev/rdsk/c2t4d75s2 /dev/rdsk/c2t4d75s2 /dev/rdsk/c3t0d76s2 /dev/rdsk/c3t0d76s2 /dev/rdsk/c2t4d76s2 /dev/rdsk/c2t4d76s2
DMP Support In 5.0, VCS supports dynamic multipathing for both data and coordinator disks. In 4.x, only data disks were supported and coordinator disks were specified as raw devices. The /etc/vxfenmode file is used to set the mode for coordinator disks and these sample files are provided for configuration: /etc/vxfen.d/vxfenmode_scsi3_dmp /etc/vxfen.d/vxfenmode_scsi3_raw /etc/vxfen.d/vxfenmode_scsi3_sanvm /etc/vxfen.d/vxfenmode_scsi3_disabled The following example shows the vxfenmode file contents for a DMP configuration:
vxfen_mode=scsi3 scsi3_disk_policy=dmp
1524

I/O Fencing Configuration Procedure

Configure coordinator Configure coordinator disk group. disk group. vxdisksetup i disk1 vxdisksetup i disk2 vxdisksetup i disk3 vxdg init fendg disk1 disk2 disk3 vxdg g fendg set coordinator=on vxfentsthdw g fendg vxfentsthdw rg dataDG1 vxfentsthdw rg dataDG2 vxdg deport fendg
Verify SCSI-3 support. Verify SCSI-3 support.
Deport disk group. Deport disk group. Create /etc/vxfendg Create /etc/vxfendg on all systems. on all systems. Create /etc/vxfenmode Create /etc/vxfenmode on all systems. on all systems. Solaris
echo "fendg" > /etc/vxfendg
cp /etc/vxfen.d/vxfenmode_scsi3_raw \ /etc/vxfenmode AIX HP-UX Linux
Configuring I/O Fencing

The diagram above shows the basic procedure used to configure I/O fencing in a VCS environment. Additional information is provided as follows: 1 Create a disk group for the coordinator disks. You can choose any name. The name in this example is fendg. 2 Use the /opt/VRTSvcs/vxfen/bin/vxfentsthdw utility to verify that the shared storage array supports SCSI-3 persistent reservations. Assuming that the same array is used for coordinator and data disks, you need only check one of the disks in the array. Warning: The vxfentsthdw utility overwrites and destroys existing data on the disks by default. You can change this behavior using the -r option to perform read-only testing. Other commonly used options include: -f file (Verify all disks listed in the file.) -g disk_group (Verify all disks in the disk group.) After you have verified the paths to that disk on each system, you can run vxfentsthdw with no arguments, which prompts you for the systems and then for the path to that disk from each system. A verified path means that the SCSI inquiry succeeds. For example, vxfenadm returns a disk serial number from a SCSI disk and an ioctl failed message from non-SCSI 3 disk. 3 Deport the coordinator disk group. 4 Create the /etc/vxfendg fencing configuration file on each system in the cluster. The file must contain the coordinator disk group name. 5 Create the /etc/vxfenmode file on each system in the cluster.

1525
15
Configuring Fencing (Continued)

Run the start Run the start script for fencing. script for fencing. Save and close the Save and close the configuration. configuration. /etc/init.d/vxfen start haconf dump -makero Solaris Example Solaris Example
On each system On each system Set UseFence in main.cf. UseFence=SCSI3 Set UseFence in main.cf. Verify syntax. Verify syntax. hacf verify /etc/VRTSvcs/conf/config Stop VCS on all systems. Stop VCS systems. hastop all
Restart VCS. Restart VCS.
hastart
!
6
You must stop and restart service groups so that the You must stop and restart service groups so that the disk groups are imported using SCSI-3 reservations. disk groups are imported using SCSI-3 reservations.
7 8 9 10
Start the fencing driver on each system using the vxfen startup file (/sbin/ init.d/vxfen on HP-UX) with the start option. Upon startup, the script creates the /etc/vxfentab file with a list of all paths to each coordinator disk. This is accomplished as follows: a Read the vxfendg file to obtain the name of the coordinator disk group. vxdisk -o alldgs list b Run grep to create a list of each device name (path) in the coordinator disk group. c For each disk device in this list, run vxdisk list disk and create a list of each device that is in the enabled state. d Write the list of enabled devices to the vxfentab file. This ensures that any time a system is rebooted, the fencing driver reinitializes the vxfentab file with the current list of all paths to the coordinator disks. Note: This is the reason coordinator disks cannot be dynamically replaced. The fencing driver must be stopped and restarted to populate the vxfentab file with the updated paths to the coordinator disks. Save and close the cluster configuration before modifying main.cf to ensure that the changes you make to main.cf are not overridden. Set the UseFence attribute to SCSI3 in the main.cf file and verify the syntax. Stop VCS on all systems. Do not use the -force option. You must have VCS reimport disk groups to place data under fencing control. Start VCS on the system with the modified main.cf file and propagate that configuration to all cluster systems. Ensure that the system on which you edited the main.cf is in the running state before starting VCS on any other systems. See the Offline Configuration of a Service Group lesson for details.
1526
Viewing and Clearing Keys

Use vxfenadm to view fencing status, keys, and registrations.
Read existing keys on coordinator disks: vxfenadm G all f /etc/vxfentab Read existing reservations on data disks listed in disklist file: vxfenadm R all f /tmp/disklist Display fencing membership and state information: vxfenadm d Read SCSI inquiry information for a disk: vxfenadm i /dev/rdsk/c1t2d3s2
Checking Keys You can check the keys on coordinator and data disks using the vxfenadm command, as shown in the slide. The following example shows the vxfenadm -R command with a specific disk:
vxfenadm -R /dev/rdsk/c1t2d11s2 Device Name: /dev/rdsk/c1t2d11s2 Total Number Of Keys: 1 key[0]: Reservation Type: SCSI3_RESV_WRITEEXCLUSIVEREGISTRANTS ONLY Reserved by Node: 0: S1 Key Value: VCS

1527
15
Stopping Systems Running I/O Fencing

To shut down a cluster system with fencing running: Use a graceful shutdown command.
This ensures that keys are removed from disks reserved by that system.
Do not use the reboot command.

Keys are not removed from the disks reserved because the shutdown scripts are bypassed.
Stopping and Recovering Fenced Systems

Stopping Systems Running I/O Fencing To ensure that keys held by a system are removed from disks when you stop a cluster system, use the shutdown command. If you use the reboot command, the fencing shutdown scripts do not run to clear keys from disks. If you inadvertently use reboot to shut down, you may see a message about a pre-existing split brain condition when you try to restart the cluster. In this case, you can use the vxfenclearpre utility described in the Recovering from a Partition-In-Time section to remove keys.
1528

Lesson Summary
Key Points
I/O fencing ensures data is protected in a cluster environment. Disk devices must support SCSI-3 persistent reservations to implement I/O fencing.
Reference Materials
VERITAS Cluster Server Installation Guide VERITAS Cluster Server User's Guide VERITAS Volume Manager User's Guide http://van.veritas.com
Quorum
SCSI
Lab 15: Configuring I/O Fencing

Work with your lab partner to configure fencing.
Disk 1:___________________ Disk 2:___________________ Coordinator Disks
trainxx
trainxx
Disk 3:___________________ nameDG1, nameDG2
Labs and solutions for this lesson are located on the following pages. "Lab 15: Configuring I/O Fencing," page A-101. "Lab 15 Solutions: Configuring I/O Fencing," page B-171.

1529
15
1530

Lesson 16 Troubleshooting
Course Overview
Topic
Monitoring VCS Troubleshooting Guide Archiving VCS-Related Files

Describe tools for monitoring VCS operations. Apply troubleshooting techniques in a VCS environment. Create an archive of VCS-related files.
162

Monitoring Facilities
VCS log files System log files The hastatus utility Notification Event triggers VCS GUIs
Monitoring VCS
VCS provides numerous resources you can use to gather information about the status and operation of the cluster. These include: VCS log files VCS engine log file, /var/VRTSvcs/log/engine_A.log Agent log files in /var/VRTSvcs/log hashadow log file, /var/VRTSvcs/log/hashadow-err_A.log System log files: /var/adm/messages (/var/adm/syslog/syslog.log on HPUX) /var/log/syslog The hastatus utility Notification by way of SNMP traps and e-mail messages Event triggers Cluster Manager The information sources that have not been covered elsewhere in the course are discussed in more detail in the next sections.
163
16
VCS Logs
Engine log: /var/VRTSvcs/log/engine_A.log View logs using the GUI or the hamsg command:
hamsg engine_A
Agent logs kept in /var/VRTSvcs/log
2003/05/20 16:00:09 VCS NOTICE V-16-1-10322 System S1 (Node '0') changed state from STALE_DISCOVER_WAIT to STALE_ADMIN_WAIT 2003/05/20 16:01:27 VCS INFO V-16-1-50408 Received connection from client Cluster Manager Java Console (ID:400) 2003/05/20 16:01:31 VCS ERROR V-16-1-10069 All systems have configuration files marked STALE. Unable to form cluster. Most Recent Most Recent
Unique Message Identifier (UMI)
VCS Logs In addition to the engine_A.log primary VCS log file, VCS logs information for had, hashadow, and all agent programs in these locations: had: /var/VRTSvcs/log/engine_A.log hashadow: /var/VRTSvcs/log/hashadow-err_A.log Agent logs: /var/VRTSvcs/log/AgentName_A.log Messages in VCS logs have a unique message identifier (UMI) built from product, category, and message ID numbers. Each entry includes a text code indicating severity, from CRITICAL entries indicating that immediate attention is required, to INFO entries with status information. The log entries are categorized as follows: CRITICAL: VCS internal message requiring immediate attention Note: Contact Customer Support immediately. ERROR: Messages indicating errors and exceptions WARNING: Messages indicating warnings NOTICE: Messages indicating normal operations INFO: Informational messages from agents
Entries with CRITICAL and ERROR severity levels indicate problems that require troubleshooting.
164

Changing the Log Level and File Size You can change the amount of information logged by agents for resources being monitored. The log level is controlled by the LogDbg resource type attribute. Changing this value affects all resources of that type. Possible values are any combination of DBG_1 to DBG_21 log tags, and one of the severity levels represented by DBG_AGINFO, DBG_AGDEBUG, and DBG_AGTRACE. Use the hatype command to change the LogDbg value and then write the inmemory configuration to disk to save the results in the types.cf file. Note: Only increase agent log levels when you experience problems. The performance impacts and disk space usage can be substantial. You can also change the size of the log file from the default of 32 MB. The minimum log size is 65K and maximum is 128MB, specified in bytes. When a log file reaches the size limit defined in the LogFileSize cluster attribute, a new log file is created with B and C appended to the file name. The letter A indicates the first log file, B the second, C the third.
165
16
Using the Technical Support Web Sites

Use the Support Use the Support Web site to: Web site to: Download Download patches. patches. Track your Track your cases. cases. Search for Search for tech notes. tech notes. The VERITAS The VERITAS Architect Network Architect Network (VAN) is another (VAN) is another forum for technical forum for technical information. information.
Using the Technical Support Web Sites The Symantec/VERITAS Support Web site contains product and patch information, a searchable knowledge base of technical notes, access to productspecific news groups and e-mail notification services, and other information about contacting technical support staff. You can access Technical Support from http://entsupport.symantec.com. The VERITAS Architect Network (VAN) provides a portal for accessing technical resources, such as product documentation, software, technical articles, and discussion groups. You can access VAN from http://van.veritas.com.
166

Procedure Overview
Start hastatus -sum
Cannot connect to server Communication Problems
SG1 Autodisabled . . . Service Group Problems
System1 in WAIT State VCS Startup Problem
Troubleshooting Guide
A VCS problem is typically one of three types: Cluster communication VCS engine startup Service groups, resources, or agents Procedure Overview To start troubleshooting, determine which type of problem is occurring based on the information displayed by hastatus -summary output. Cluster communication problems are indicated by the message: Cannot connect to server -- Retry Later VCS engine startup problems are indicated by systems in the STALE_ADMIN_WAIT or ADMIN_WAIT state. Other problems are indicated when the VCS engine, LLT, and GAB are all running on all systems, but service groups or resources are in an unexpected state. The hastatus display uses categories to indicate the status of different types of cluster objects. Examples are shown in the following table.
A B C D Systems Service groups Failed resources Resources not probed
16
167
Using the Troubleshooting Job Aid
Using the Troubleshooting Job Aid You can use the troubleshooting job aid provided with this course to assist you in solving problems in your VCS environment. This lesson provides the background for understanding the root causes of problems, as well as the effects of applying solutions described in the job aid. Ensure that you understand the consequences of the commands and methods you use for troubleshooting when using the job aid.
168

Making Backups
Back up key VCS files as part of your regular backup procedure:
types.cf and customized types files main.cf main.cmd sysname LLT and GAB configuration files in /etc Customized trigger scripts in /opt/VRTSvcs/bin/triggers Enterprise agents in /opt/VRTS/agents/ha/bin Customized agents in /opt/VRTSvcs/bin
Archiving VCS-Related Files

Making Backups Include VCS configuration information in your regular backup scheme. Consider archiving these types of files and directories: /etc/VRTSvcs/conf/config/types.cf and any other custom types files /etc/VRTSvcs/conf/config/main.cf main.cmd, generated by: hacf -cftocmd /etc/VRTSvcs/conf/config LLT and GAB configuration files in /etc: llthosts llttab (unique on each system) gabtab Customized triggers in /opt/VRTSvcs/bin/triggers Enterprise agents in /opt/VRTSagents/ha/bin Customized agents in /opt/VRTSvcs/bin Note: The VCS software distribution includes the VRTSspt package, which provides vxexplorer, a tool for gathering system information that may be needed by Support to troubleshoot a problem.
16
169
The hasnap Utility

Backs up and restores VCS configuration files. Serves as a support tool for collecting information needed for problem analysis
# hasnap backup f /tmp/vcs.tar -n -m Oracle_Cluster # hasnap backup f /tmp/vcs.tar -n -m Oracle_Cluster V-8-1-15522 Initializing file "vcs.tar" for backup. V-8-1-15522 Initializing file "vcs.tar" for backup. V-8-1-15526 Please wait... V-8-1-15526 Please wait... Checking VCS package integrity Checking VCS package integrity Collecting VCS information Collecting VCS information .. .. Compressing /tmp/vcs.tar to /tmp/vcs.tar.gz Compressing /tmp/vcs.tar to /tmp/vcs.tar.gz Done. Done.
Note: Use hagetcf for earlier VCS versions.
The hasnap Utility The hasnap utility backs up and restores predefined and custom VCS files on each node in a cluster. A snapshot is a collection of predefined VCS configuration files and any files added to a custom file list. A snapshot also contains information such as the snapshot name, description, time, and file permissions. In the example shown in the slide, hasnap is used to: Create a single file containing all backed up files (-f vcs.tar). Specify no prompts for user input (-n). Create a description for the snapshot (-m Oracle_Cluster). The following table shows samples of hasnap options:
Option -backup -restore -display -sdiff -fdiff -export Purpose Copies the files to a local predefined directory Copies the files in the specified snapshot to a directory Lists all snapshots and the details of a specified snapshot Shows differences between configuration files in a snapshot and the files on a specified systems Shows differences between a specified file in a snapshot and the file on a specified system Exports a snapshot to a single file
You can use the hagetcf command for versions of VCS earlier than 4.0.
Lesson Summary
Key Points
Develop an understanding of common problem causes and solutions to problems using the background provided in this lesson. Use the troubleshooting job aid as a guide.
Reference Materials
Troubleshooting Job Aid VERITAS Cluster Server User's Guide VERITAS Cluster Server Bundled Agents Reference Guide http://entsupport.symantec.com
Lab 16: Troubleshooting

Work with your lab partner: 1. Verify that both systems are in a running state. 2. Verify that all service groups are running and no resources are faulted. 3. Switch the service groups to ensure that the cluster is fully functional. 4. Tell your instructor you are ready to start solving a problem. 5. When you have solved the problem created by your instructor, repeat steps 1 through 3 to verify you have correctly solved the problem.
Lab Solutions Optional lab: If your instructor indicates that your classroom has Optional lab: If your instructor indicates that your classroom has access to the Symantec Support Web site, search access to the Symantec Support Web site, search http://entsupport.symantec.com for technical notes to help you http://entsupport.symantec.com for technical notes to help you solve the problems created as part of this lab exercise. solve the problems created as part of this lab exercise.
Labs and solutions for this lesson are located on the following pages. "Lab 16: Troubleshooting," page A-111. "Lab 16 Solutions: Troubleshooting," page B-191.
1611
16
1612

Index
A
access control 6-21 access, controlling 6-19 add LLT link 13-21 resource 8-6 admin account 3-9 admin wait state ManageFaults attribute 12-6 recovering resources 12-18 ResAdminWait trigger 12-20 troubleshooting 9-14 administration application 5-3 administrator intervention 12-18 agent clean entry point 2-10, 12-4 close entry point 8-17 communication 13-3 definition 2-10 logs 16-4 monitor entry point 2-10 offline entry point 2-10 online entry point 2-10 AIX lslpp command 4-10 startup files 13-15 AllowNativeCliUsers attribute 6-20 application clean 7-9 component definition 7-3 configure 7-9 high availability 1-10 IP address 7-14 management 5-3 manual migration 7-19 preparation procedure 7-10 prepare 7-3 service 7-3 shutdown 7-9 start 7-15 application components stopping 7-18 application service testing 7-10 atomic broadcast mechanism 13-3 attribute AutoDisabled 13-18 AutoFailOver 12-7 AutoStart 5-10 AutoStartList 8-4, 10-5 CleanTimeout 12-10 Critical 5-10, 8-22 display 5-8 FailOverPolicy 12-4 FaultPropagation 12-6 Frozen 5-13, 12-5 local 10-14 ManageFaults 12-5, 12-18 MonitorInterval 12-9, 12-11 MonitorTimeout 12-12 OfflineMonitorInterval 12-9, 12-11 OfflineTimeout 12-10, 12-12 OnlineRetryLimit 12-10 OnlineTimeout 12-10 OnlineWaitLimit 12-10 override 12-16 Parallel 8-5 resource 1-10, 2-8, 8-7 resource type 12-11, 12-13, 12-15 ResourceOwner 11-8 service group failover 12-5 service group validation 7-21 static 12-16 SystemList 8-4, 10-5 TFrozen 5-13, 12-5 verify 7-7 autodisable jeopardy 14-7 service group 13-18 AutoDisabled attribute 13-18 AutoFailover attribute 12-7 automatic failover 12-7 AutoStart attribute 5-10 AutoStartList attribute 8-4, 10-5 availability levels 1-6
Index-1
B
backup configuration files 16-9 backup configuration files 9-16 best practice application service testing 7-10 managing applications 5-3 boot disk 3-4 Bundled Agents Reference Guide 2-11
C
cable LLT link reconnection 14-9 child resource configuration 8-6 dependency 2-7 linking 8-20 clean entry point 12-4 CleanTimeout attribute 12-10 clear resource fault 8-18 CLI online configuration 6-10 resource configuration 8-10 service group configuration 8-5 close cluster configuration 6-13 entry point 8-17 cluster administration tools 5-4 campus 15-23 communication 13-3 concepts 1-7 configuration 2-17 configuration files 4-9 configure 4-4 definition 2-3 design 3-12 duplicate configuration 6-17 duplicate service group configuration 6-18 ID 3-8 interconnect 2-12 interconnect configuration 13-19 interconnect link failures 14-6 jeopardy membership 14-7 joining membership 13-14
local 1-8 managing applications 5-3 membership 2-12, 13-6 membership seeding 13-14 membership status 13-6 name 3-8 partition 14-11 regular membership 14-7 Running state 6-4 Simulator 5-16 terminology 2-3 wide-area 1-9 cluster communication configuration files 4-8 overview 1-14, 2-12 cluster configuration build from file 9-8 close 6-13 in memory 6-3 in-memory 6-4 offline 6-16, 9-3 online 8-3 open 6-11, 8-5 protection 6-14 save 6-12 save and close 8-5 Simulator 9-12 cluster interconnect configuration files 4-8 definition 2-12 VCS startup 6-3, 6-4 Cluster Manager installation 4-13 online configuration 6-10 service group creation 8-4 Simulator 5-17, 5-19 Cluster Monitor Simulator 5-19 cluster state GAB 9-15, 13-3 remote build 6-4, 9-15 running 9-15 Wait 9-14 ClusterService group main.cf file 4-9 notification 11-6 command-line interface service group configuration 8-5
Index-2

Simulator 5-20 comments main.cf file 8-20 communication agent 13-3 between cluster systems 13-4 configuration 13-19 fencing 15-19 within a system 13-3 component testing 7-17 concurrency violation prevention 13-18 service group freeze 5-13 configuration application 7-9 application IP address 7-14 application service 7-5 backup files 9-16 cluster 4-4 downtime 6-8 fencing 4-12, 15-25 files 2-18 GAB 13-13 in-memory 6-9 interconnect 13-19 main.cf file 2-18 methods 6-7 notification 11-6 NotifierMngr 11-7 offline method 6-16 online 8-3 overview 2-17 resource 8-6 resource type attribute 12-15 shared storage 7-8 Simulator samples 5-17 troubleshooting 8-15 types.cf file 2-18 configuration file edit 9-3 verify syntax 9-7 configuration files backing up 16-9 GAB 13-13 installation 4-7 llttab 13-9 configure interconnect 13-8
LLT 13-21 ConfInterval attribute 12-14 coordinator disk definition 15-9 disk group 15-25 requirements 15-23 Critical attribute 8-22 in critical resource 5-10 critical resource role of in failover 12-3 crossover cable 3-3 custom triggers 16-9
D
data disk 15-10 protecting 14-11 data disk reservation 15-12 data protection definition 15-3 fencing 2-15, 14-3 HAD 15-22 jeopardy membership 14-7 requirement definition 15-7 database fault management 12-6 default failover behavior 12-3 monitor interval 12-11 timeout values 12-12 dependency offline order 5-11 online order 5-10 resource 2-7, 7-20, 8-20 resource offline order 5-15 resource rules 8-21 resource start order 7-10 resource stop order 7-18 rule 2-7 dependency tree 8-20 design resource dependency 7-20 worksheet 3-12 directive

Index-3
link 13-21 LLT configuration 13-10 set-cluster 13-10 set-node 13-10 disable resource 8-17 disk coordinator 15-9 data 15-10 fencing 15-9 shared 3-4 disk group fencing 15-10 DiskGroup resource 8-10 display cluster membership status 13-6 LLT status 13-7 service group 8-5 DMP 3-4 downtime causes 1-4 cluster configuration 6-8 system fault 14-5 duplicate node ID 13-10 duration failover 12-9 dynamic multipathing 3-4
online 12-12 timeouts 12-12 environment variable VCS_SIM_PORT 5-20 VCS_SIM_WAC_PORT 5-20 error configuration 8-15 Ethernet LLT link 13-4, 13-7 Ethernet interconnect network 3-3, 3-4 event notification 12-19 severity level 11-4 SNMP 11-10 trigger 12-20 event messages 11-4 example restart a resource 12-14
F
failover active/active 1-8 active/passive 1-8 automatic 12-7 critical resource 8-22, 12-3 default behavior 12-3 duration 12-9, 14-5 global 1-12 manual 5-12, 12-7 N + 1 1-8 N-+-N 1-8 N-to-1 1-8 policy 12-4 selecting target systems 14-5 service group attributes 12-5 service group fault 12-3 service group type 2-5, 8-4 troubleshooting 12-18 FailOverPolicy attribute 12-4 failure communication 15-5 fencing 15-14 interconnect recovery 15-29 LLT link 14-6 system 15-4 fault critical resource 8-22
E
edit configuration file 9-3 e-mail Support notification service 16-6 e-mail notification configuration 11-3 from GroupOwner attribute 11-9 from ResourceOwner attribute 11-8 enable resource 8-6 engine_A.log file 16-4 entry point clean 12-4 close 8-17 definition 2-10 monitor 12-12 offline 12-12
Index-4

detection 12-9 effects of resource type attributes 12-13 example 12-8 failover duration 14-5 ManageFaults attribute 12-6 manual management 12-5 notification 12-19 recover 12-17 resource 8-18 system 14-4 trigger 12-20 faulted state resource 12-17 FaultPropagation attribute 12-6 fencing communication 15-19 components 15-9 configure 15-25 coordinator disk requirements 15-23 data protection 15-8 definition 2-15 GAB communication 15-14 I/O 14-3 installation 4-12 interconnect failure 15-14 partition-in-time 15-30 race 15-13 recovering a system 15-28 startup 15-26 system failure 15-13 vxfen driver 15-20 file system 8-11 flush service group 8-16 force VCS stop 6-5 freeze persistent 5-13 service group 5-13 temporary 5-13 frequency heartbeats 14-12 Frozen attribute 5-13, 12-5
configuration file 13-13 definition 2-14 fencing 15-20 manual seeding 13-16 membership 13-14, 15-16 membership notation 13-6 network partition recovery 14-10 port a 13-6 Port b 15-19 Port h 13-6 preexisting network partition 14-13 seeding 13-14 startup files 13-15 status 13-6 timeout 14-5 gabconfig command 13-6, 13-13 gabtab file 13-13 Group Membership Services/Atomic Broadcast
13-3
definition 1-15, 2-14 GroupOwner attribute 11-9 GUI adding a service group 8-4 resource configuration 8-7
H
hacf command 9-4 verify 9-7 haconf command dump 8-5 makerw 8-5 HAD data protection 15-22 definition 2-16 notifier 11-3 online configuration 6-10 startup 6-3, 15-22 hagetcf command 16-10 hagrp command add 8-5 display 8-5 freeze 5-13, 12-5 modify 8-5 offline 5-11 online 5-10 switch 5-12 unfreeze 5-13
G
GAB cluster state change 9-15 communication 13-3

Index-5
halogin command 6-20 hanotify command 11-6 hardware requirements 3-3 storage 3-4 support 3-3 hardware compatibility list 3-3 hares command clear 8-18 display 8-22 list 8-6 offline 5-15 online 5-14 probe 8-18 hashadow daemon 2-16 hashadow_A.log file 16-4 hasim command 5-20 hasimgui 5-17 hasnap command 16-10 hastart command 6-3 hastatus command 5-5 hastop command 6-5 hatype command 12-15 HBA 3-4 HCL 3-3 heartbeat definition 13-4 frequency reduction 14-12 loss of 15-5, 15-14 low-priority link 13-5, 13-21 network requirement 3-3 public network 14-12 high availability concepts 1-3 notification 11-6 online configuration 6-9 high availability daemon 2-16 high availability, reference 1-16 high-priority link 13-5 hostname command 13-11 HP OpenView Network Node Manager 11-10 HP-UX startup files 13-15 swlist command 4-10 hub 3-3 hybrid service group type 2-5
I
I/O fencing 14-3 ID cluster 3-8 messages 16-4 install Simulator 5-16 installation Cluster Manager 4-13 fencing 4-12 Java GUI 4-13 log 4-3 VCS preparation 3-8 view cluster configuration 4-10 Installing 4-13 installvcs command 4-4 interconnect cable 14-9 cluster communication 13-4 configuration 13-8, 13-19 configuration procedure 13-20 Ethernet 3-3 failure 15-5, 15-14, 15-16 failure recovery 15-29 link failures 14-6 network partition 14-8, 15-16 network partition recovery 14-9 partition 15-5 requirement 3-3 specifications 13-5 intervention failover 12-18 IP application address configuration 7-14 IP resource 8-9 Address attribute 8-9 Device attribute 8-9 NetMask attribute 8-9
J
Java GUI installation 4-13 installation on Windows 4-13 Windows 4-13 jeopardy definition 14-7
Index-6

join cluster membership 13-14
K
key SCSI registration 15-11
L
license upgrading 3-5 verification 3-5 link high-priority 13-5 LLT status 13-7 low-priority 13-5 resource 8-20 link-lowpri directive 13-21 Linux rpm command 4-10 startup files 13-15 LLT adding links 13-21 configuration 13-21 configuration files 13-9 definition 2-13 link failure 14-7 link failures 14-11 link status 13-7 links 13-4 llttab file 13-9 low-priority link 13-21, 14-12 node name 13-11 simultaneous link failure 14-8 startup files 13-15 timeout 14-5 LLT link reconnecting 14-7, 14-9 lltconfig command 4-10 llthosts file 4-8, 13-10, 13-11 lltstat command 13-7 llttab file 4-8, 13-9, 13-21 local build 6-3 log agent 16-3, 16-4 display 5-6 file size 16-5 HAD 5-6
hashadow 16-3 installation 4-3 level of detail 16-5 troubleshooting 8-15 VCS engine 16-3 log file engine 16-3 system 16-3 LogDbg attribute 16-5 LogFileSize attribute 16-5 low-latency transport 1-14, 2-13 low-priority link 13-5 low-priority LLT link 13-21, 14-12 LUN 15-10
M
MAC address 13-7 main.cf backing up 9-5 main.cf file backup 9-15 backup files 9-16 Critical attribute 8-22 definition 2-18 fencing 15-26 installation 4-9 offline configuration 9-3 old configuration 9-13 service group example 9-11 syntax problem 9-14 system name 13-11 main.previous.cf file 9-13 maintenance cluster 3-6 ManageFaults attribute 12-5, 12-18 manual application migration 7-19 application start 7-15 fault management 12-5 mount file system 7-11 seeding 13-16 member systems 13-6 membership cluster 13-3 GAB 13-6 jeopardy 14-7

Index-7
joining 13-14 regular 14-7 message troubleshooting 16-4 message queue notification 11-3 message severity levels 11-4 MIBSNMP 11-10 migration application 7-5 application service 7-19 VCS stop 6-5 mini cluster 14-8 mini-cluster 14-8 mkfs command 7-8 modify cluster interconnect 13-20 service group 8-4 modinfo command 13-20 modunload command 13-20 monitor adjusting 12-11, 12-12 entry point 12-12 interval 12-11 network interface 10-8, 10-9 resource 8-17 VCS 16-3 MonitorInterval attribute 12-9, 12-11 MonitorTimeout attribute 12-12 mount command 7-11, 8-12 Mount resource 8-12 mounting a file system 7-11
N
name cluster 3-8 convention 8-6 resource 8-6 service group 8-4 network interconnect interfaces 13-9 interconnect partition 14-8 interface monitoring 10-9 interface sharing 10-7 LLT link 14-12 partition 14-8, 15-16
preexisting partition 14-13 network partition 14-9 definition 15-5 NIC resource 8-7, 10-14 changing ToleranceLimit attribute 12-15 false failure detections 12-9 parallel service groups 10-11 sharing network interfaces 10-8 node LLT ID number 13-9 nonpersistent resource 10-12 clearing faults 12-17 offline 5-10 notation GAB membership 13-6 notification ClusterService group 11-6 configuration 11-6 configure 3-9, 11-7 e-mail 11-3 event messages 11-4 fault 12-19 GroupOwner attribute 11-9 message queue 11-3 overview 11-3 ResourceOwner attribute 11-8 severity level 11-4 support e-mail 16-6 trigger 11-11 notifier daemon configuration 11-6 message queue 11-3 NotifierMngr resource configuring notification 11-6 definition 11-7
O
offline entry point 12-12 nonpersistent resource 5-11 resource 5-15 troubleshooting 10-12 offline configuration benefits 6-16 cluster 6-7 existing cluster 9-5 new cluster 9-3
Index-8

Simulator 9-12 stuck in wait state 9-14 OfflineMonitorInterval attribute 12-9, 12-11 OfflineTimeout attribute 12-10, 12-12 online definition 5-10 entry point 12-12 nonpersistent resource 5-10 resource 5-14 online configuration benefits 6-9 cluster 6-7 overview 6-10 procedure 8-3 resource 8-6 OnlineRetryLimit attribute 12-10 OnlineTimeout attribute 12-10, 12-12 OnlineWaitLimit attribute 12-10 operating system patches 3-5 VCS support 3-6 override attributes 12-16 override seeding 13-16
P
package installation 4-4 parallel service group 10-11 service group configuration 10-6 service group type 2-5 Parallel attribute 8-5 parent resource 2-7, 8-20 partial online taking resources offline 5-15 partition cluster 14-11 interconnect failure 14-8 preexisting 14-13 recovery 14-9 partition-in-time 15-30 path coordinator disk 15-26 fencing data disks 15-21 storage 3-4 persistent
resource fault 8-18 service group freeze 5-13 persistent resource 5-11, 10-11 clearing faults 12-17 Phantom resource 10-12, 10-13 port GAB membership 13-6 HAD membership 13-6 preexisting network partition 14-13 severed cluster interconnect 15-16 PreOnline trigger 11-11 prepare applications 7-3 identify application components 7-4 VCS installation 3-8 prevent HAD startup 14-13 private network 2-12 privilege UNIX user account 6-20 probe clearing faults 12-17 persistent resource fault 8-18 resource 8-18, 13-18 procedure offline configuration 9-3 resource configuration 8-6 Process resource 8-13 protect autodisabling 13-18 data 14-11 jeopardy membership 14-7 LLT link reconnection 14-10 seeding 13-14 Proxy resource 10-9, 10-10 public network LLT link 14-12
R
RAID 3-4 raw device 8-11 raw disks 7-8 reconnect LLT links 14-10 recover admin wait state 12-18

Index-9
after interconnect failures 14-10 jeopardy 14-7 network partition 14-9 resource fault 12-17 recovery fenced system 15-29 partition-in-time 15-30 references for high availability 1-16 registration with coordinator disks 15-12 regular membership 14-7 Remote Build state 9-15 replicated state machine 13-3 requirements hardware 3-3 software 3-5 reservation 15-12 ResFault trigger 11-11 ResNotOff trigger 11-11 resource add 8-6 admin wait state 12-6 attribute 2-8 attribute verification 7-7 child 2-7, 8-20 clear fault 8-18 CLI configuration 8-10 configuration procedure 8-6 critical 12-3 Critical attribute 8-22 definition 2-6 dependencies in offline configuration 9-10 dependency 7-20, 8-20 dependency definition 2-7 dependency rules 2-7, 8-21 disable 8-17 DiskGroup 8-10 event handling 12-20 fault 8-18, 12-4 fault detection 12-9 fault example 12-8 fault recovery 12-17 GUI configuration 8-7 link 8-20 local attribute 10-14 Mount 8-12 name 8-6 non-critical 8-6
nonpersistent 10-12 NotifierMngr 11-7 offline definition 5-15 offline order 5-11 online definition 5-14 online order 5-10 parent 2-7, 8-20 persistent 5-11, 10-11 Phantom 10-12 probe 13-18 Process 8-13 Proxy 10-10 recover 12-18 restart example 12-14 troubleshooting 8-15 type 2-9 type attribute 2-9 verify 7-16 Volume 8-11 resource type configuration 12-15 controlling faults 12-13 None 10-11 OnOff 10-11 OnOnly 10-11 testing failover 12-11 ResourceOwner attribute 11-6, 11-8 ResStateChange trigger 11-11 restart resource example 12-14 RestartLimit attribute 12-14 root user account 6-9, 6-19 rsh 3-7 rules resource dependency 8-21 Running state 9-15
S
sample llttab file 13-9 Simulator configurations 5-17 trigger scripts 12-20 SAN 3-4 script trigger 11-11, 12-20 seeding definition 13-14
Index-10

manual 13-16 override 13-16 preexisting network partition 14-13 split brain condition 13-16 service group autodisable 13-18 CLI configuration 8-5 data protection 15-13, 15-15 definition 2-4 event handling 12-20 failover attributes 12-5 failover type 2-5 fault 12-3 flush 8-16 freeze 5-13 GUI configuration 8-4 hybrid type 2-5 management 5-5, 5-10 name 8-4 network interface 10-7 offline 5-11 online 5-10 parallel 10-6, 10-11 parallel type 2-5 sharing network interfaces 10-9 status 10-11, 10-12 switching 5-12 testing 9-17 testing procedure 8-19 troubleshooting 8-15, 10-12 types 2-5 unfreeze 5-13 validate attributes 7-21 service groups autodisabled 14-7 severity log entries 16-4 shutdown application 7-9 VCS 6-5 Simulator Cluster Manager 5-19 command-line interface 5-20 definition 5-16 install path 5-17 installation 5-16 Java Console 5-17 sample configurations 5-18 test configuration 9-12
Simulator, configuration files 5-18 single LLT link failure 14-6 single point of failure 3-4 snapshot file system 8-12 SNMP console configuration 11-10 notification configuration 11-6 software configuration 3-6 managing applications 5-3 requirements 3-5 Solaris abort sequence 3-7 configure virtual IP address 7-15 llttab 13-9 pkginfo command 4-10 startup files 13-15 split brain condition 14-11, 14-13 definition 15-6 split-brain condition 14-7 ssh 3-7 staging offline cluster configuration 9-5 stale admin wait state troubleshooting 9-14 start volumes 7-11 with a .stale file 9-8 startup autodisabling service groups 13-18 default 6-3 fencing 15-19, 15-26 files 13-15 GAB 13-13 probing resources 13-18 seeding 13-14 service group configuration 8-4 state cluster 13-4 cluster membership 13-3 static attributes 12-16 status display 5-8 LLT link 13-7 service group 10-11 storage

Index-11
requirement 3-4 shared bringing up 7-11 configuring 7-8 switch network 3-3 service group 5-12 testing a service group 8-19 switchover 5-12 sysname file 13-12 system cluster member 3-8 failover duration 14-5 failure 15-4, 15-13 failure recovery 15-29 fault 14-4 GAB startup specification 13-13 ID 13-10 join cluster membership 13-14 LLT node name 13-11 local attribute 10-14 seeding 13-14 service group configuration 8-4 state 13-4 SystemList attribute 8-4, 10-5
T
temporary service group freeze 5-13 test service group 8-19 testing application service 7-10 integrated components 7-17 service group 9-17 TFrozen attribute 5-13, 12-5 timeout adjusting 12-12 GAB 14-5 LLT 14-5 tools online configuration 6-10 VCS management 5-4 traffic interconnect 13-5 traps
SNMP 11-10 SNMP notification 11-6 trigger directory 12-20 event handling 12-20 events 12-20 fault handling 12-20 notification 11-11 PreOnline 11-11 ResFault 11-11 resnotoff 11-11, 11-12 ResStateChange 11-11 troubleshooting 14-9 configuration backup files 9-16 flushing a service group 8-16 guide 16-7 log 8-15 log files 16-4 offline configuration 9-13 recovering the cluster configuration 9-13 service group configuration 8-15 VCS 16-7 types.cf file backup 9-15 backup files 9-16 definition 2-18 installation 4-9
U
UMI, unique message identifier 16-4 unfreeze service group 5-13 unique message identifier 16-4 UNIX root user account 6-19 user account 6-20 unmount 8-12 UseFence attribute 15-22 user account creating 6-22 root 6-20 UNIX 6-20 VCS 6-21
V
validation
Index-12

service group attributes 7-21 VCS access control authority 6-21 administration 5-4 administrator 6-20 architecture 2-17 communication 13-3 communication overview 13-4 fencing configuration 4-12 fencing implementation 15-22 forcing startup 9-14 installation preparation 3-8 installation procedure 4-5 messages 16-4 monitoring 16-3 SNMP MIB 11-10 SNMP traps 11-10 starting 6-3 starting stale 9-8 startup 6-3, 13-18 startup files 13-15 stopping 6-5, 6-6 system fault 14-4 system name 13-12 troubleshooting 16-7 user accounts 6-19 vcs.mib file 11-10 VCS_SIM_PORT environment variable 5-20 VCS_SIM_WAC_PORT environment variable
5-20
vxfenclearpre command 15-30 vxfenconfig command 15-19 vxfendg file 15-25 vxfentab file 15-26 vxfentsthdw command 15-25 VxFS 8-12 vxrecover command 8-10 VxVM fencing 15-19 fencing implementation 15-21 resources 8-11 vxvol command 8-10
W
wait internal resource state 8-16 wait state troubleshooting 9-14 Web GUI address 4-10 configuration 3-9 worksheet design 3-12 resource dependency 7-20
X
xnmevents command 11-10
VCS_SIMULATOR_HOME environment variable 5-17 verification resources 7-16 verify cluster configuration 9-4 configuration file syntax 9-7 VERITAS Support 16-6 VERITAS Product Installer 4-3 vLicense Web site 3-5 volume management 7-8 volume management software 3-5 Volume resource 8-11 vxexplorer command 16-9 vxfen driver 15-19, 15-20 vxfenadm command 15-27

Index-13
Index-14


100 002369 A

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

100 002369 A

Загружено:

Авторское право:

Доступные форматы

VERITAS Cluster Server for UNIX, Fundamentals (Lessons)

Bilge Gerrits Siobhan Seeger

LEAD SUBJECT MATTER EXPERTS

Pete Toemmes Brad Willer

TECHNICAL CONTRIBUTORS AND REVIEWERS

VERITAS Cluster Server for UNIX, Fundamentals

VERITAS Cluster Server Curriculum

VERITAS Cluster Server, Implementing Local Clusters

High Availability Design and Customization Using VCS

Disaster Recovery Using VVR and Global Cluster Option

VERITAS Cluster Server Curriculum

VERITAS Cluster Server for UNIX, Fundamentals

Sample Cluster Design Input

VERITAS Cluster Server for UNIX, Fundamentals

Lab Design for the Course

their_nameSG1 their_nameSG2 your_nameSG1 your_nameSG2 trainxx trainxx NetworkSG

Lab Design for the Course

Lab Naming Conventions

Sample Value nameSG nameIP IP value value

VERITAS Cluster Server for UNIX, Fundamentals

Classroom Values for Labs

Courier New, plain

Courier New, Italic, bold or plain

Typographic Conventions in Graphical User Interface Descriptions

Interface elements with long names

VERITAS Cluster Server for UNIX, Fundamentals

Lesson 1 High Availability Concepts

Lesson Topics and Objectives

After completing this lesson, you will be able to:

VERITAS Cluster Server for UNIX, Fundamentals

Challenges in the Data Center

How can I automate mundane tasks?

How do I reduce planned and unplanned downtime?

High Availability Concepts

Lesson 1 High Availability Concepts

Environment 5% Client <1% LAN/WAN Equipment <1%

Prescheduled Downtime 30%

VERITAS Cluster Server for UNIX, Fundamentals

Cost per hour: $106k to $183K

Total cost: $954,000 to 1,647,000

Goal for monthly unplanned downtime:

Cost savings: $636,000 to $1,098,000

Lesson 1 High Availability Concepts

local clustering data availability backup

VERITAS Cluster Server for UNIX, Fundamentals

Cluster is a broadly-used term:

VCS is primarily an HA cluster with support for:

Lesson 1 High Availability Concepts

Local Cluster Configurations

VERITAS Cluster Server for UNIX, Fundamentals

Global Cluster Configurations

VERITAS Cluster Server for UNIX, Fundamentals

Local Application Service Failover

Lesson 1 High Availability Concepts

Local and Global Failover

VERITAS Cluster Server for UNIX, Fundamentals

Application Requirements for Clustering

Lesson 1 High Availability Concepts

Hardware and Infrastructure Redundancy

Lesson 1 High Availability Concepts

High Availability References

VERITAS Cluster Server for UNIX, Fundamentals