Академический Документы
Профессиональный Документы
Культура Документы
Other company, product, and service names are the properties of their respective owners.
3 Install OCM...........................................................................................................................................................4
3.1 Check OCM server and other OCM nodes.............................................................................................4
3.2 Check OCM License-keys and setup OCM account environment settings............................................6
3.3 Check Omega configuration database entries.........................................................................................7
3.4 Prepare hardware database......................................................................................................................8
3.5 Install ocmcontroller RPMs on OCI masters and compute nodes..........................................................9
3.6 Install OCM on OCM server.................................................................................................................10
3.7 Extra commandline OCM admin tool (optional)..................................................................................12
6 Acceptance tests...................................................................................................................................................29
i
Note
• This quick start does not cover all details of the OCM admin manual and user manual, which are
excellently written and have a lot of good information. Please do read them and refer to them for anything
we skip here. If there is any discrepancy between the quick start and the OCM admin manual, please
follow the admin manual. These guides are on DVD OmegaAndOCM under /02-OCM/01-Documentation
directory.
• This quick-start covers a quick OCM full installation. For OCM upgrade, please refer to
OCM_2.5_Release_Notes.pdf.
• If you are going to migrate from JSS to OCM, please refer to JSS_to_OCM_guide.pdf at the same place
on DVD.
• The directory structure we post here are Linux directories, hence we use format like
/02-OCM/02-Installation instead of \02-OCM\01-Documentation.
A script, ocm_server_checker.sh, may be used to check server's setup and port availability.
In this guide, we use the sample xxoc001 discussed in above system preparation quick start as OCM host.
The load of OCM instance on Oracle server is very light. Most sites can share the same oracle server with Omega
OPM/RDM host.
Large site can have separate Oracle server for OCM. Please consult Schlumberger on this.
We will use xxoc001 as OCM server, xxmm00[1-2] as OCI masters, xxa00[01-50] as compute nodes, as
described in section 1.4 of System Preparation Quick Start Guide.
As mentioned above, Oracle instance xxocm001 is created on xxos001 in the Oracle Installation section of
2017.1ext Linux Installation Quick Start Guide.
Please refer to step 3.1 of Quick Start Guide for System Preparation to check directories permission here.
# if java version is lower than 1.7.0.55, find a newer java openjdk rpm
# install java, as root
yum install java-1.xxxx-openjdk
In addition to the above preparation, there is a script to check OCM server. It's on DVD OmegaAndOCM under
/02-OCM/03-Miscellaneous/scripts. Below is the sample outputs:
2. Oracle installation
Oracle Home = /oracle/12.1
# check if /wg/omega and /wgjss are NFS mounted on the masters and compute nodes
ls /wg/omega /wgjss/ocm
# check if local /wglogs, /tmp, /local1/scr exist and have permission of 777
ls -ald /wglogs/ /tmp /local1/scr
# if java version is lower than 1.7.0.55, find a newer java openjdk rpm
# install java, as root
yum install java-1.xxxx-openjdk
we should see at least OCMBase license. Without this license we can not run OCM. We will get a license
verification error on OCM web.
If you purchased OCM Dynamic license,then you should see both OCMBase and OCMDynamic license.
su - ocm
echo setenv SLBSLS_LICENSE_FILE @xxls001.dnsdomain.com >> .cshrc
echo setenv PATH /wgas/Server/bin:/wgas/ocm/bin:$PATH >> .cshrc
source .cshrc
Note:
• OCM will install without checking license. We can even restart ocmadmin without licenses. Only when
we bring up the OCM server web page or submit OCI jobs, OCM licenses are checked.
• We can submit parallel OCI jobs using 'dynamic node allocation' when we have OCMDynamic license;
and we can only submit parallel OCI jobs using 'static node allocation' if we don't have OCMDynamic
license.
Note 1: Omega.WAN.Sites.Default.JobDirectory is where OCM pick up the job files. The directory should
generally be /wgjss/ocm/workorder.
Note 2: For a small center with flat network, Omega hardware database is not required anymore in OCM 2.5 and
Omega 2017.1.
# as omadmin on OCM installation server xxoc001, make sure we have display here
omega2 config&
Switch names can be an arbitrary name like core, level1switch1, level1switch2, level2switch1, level2switch2 etc.
OCM only cares for the structure of the network and which nodes connect to the same switch. Use core as the
root switch name.
GPU hosts that connect to different Infiniband (IB) islands need to be separated into different groups, even they
are connected to the same ethernet switch. This is a hard requirement for IB aware applications such as RTM,
FWI, and FD_MOD. It does not matter if ethernet connectivity information is represented for the nodes. It is
sufficient for each of the IB switches in the HWDB to be connected to 'core'.
Scenario 1:
Scenario 2:
deviceID,neighbor-deviceID,bandwidth
core,level1swtich1,10
core,level1swtich2,10
level1switch1,xxa[0001-0025],10
level1switch2,xxa[0026-0050],10
Scenario 3:
All xxa nodes are GPU nodes and connect to same ethernet switch
But xxa00[01-25] connect to IB island 1 and xxa00[26-50] connect to IB island 2
deviceID,neighbor-deviceID,bandwidth
core,vsre001,10
core,vsre002,10
vsre001,xxa[0001-0025],10
vsre002,xxa[0026-0050],10
omega2 hwconfig&
df -h /wg/omega
omegainst
cd /wg/omega/installations/02-OCM/02-Installation
rpm -Uvh ocmcontroller-201-1.noarch.rpm
# check if it s in chkconfig
chkconfig --list | grep ocmcontroller
# if not, turn it on
chkconfig ocmcontroller on
Optional: install o2dk RPM and use o2dk to monitor omegalauncher and ocmcontroller service on each host
# on xxoc001, as root
cd /wg/omega/installations/02-OCM/02-Installation
rpm -Uvh ocmrootcontroller-205096-1.noarch.rpm
Optional: install o2dk RPM and use o2dk to monitor ocmrootcontroller service on OCM server
vi /etc/o2dk.conf
# make sure we have this line in the file
service = ocmrootcontroller
# on xxoc001, as ocm
# assume we copied OCM installation files to /wg/omega/installations/02-OCM/02-Installation
su - ocm
cd /wg/omega/installations/02-OCM/02-Installation
./OcmSetupExt.2.5.96.exe
# ./OcmSetupExt.2.5.96.exe --definepasswords
# note: this option will overwrite default Oracle OCM table passwords
# when prompted for Oracle server machine name, type in: xxos001
# when prompted for OCM instance ID: xxocm001
# when prompted for server installer path, enter:
# /wg/omega/installations/02-OCM/02-Installation/ee6u4j7installer.bin
# type password for admin (for http://xxoc001.dnsdomain.com:4848/)
# type password for ocmadministrator (for http://xxoc001.dnsdomain.com:9090/ocm)
# take default action and finish the installation
# specify user name and password, this password is independent of linux NIS password
# in this example, we can add ocm, password ocm!, as one power user
ocmadmin> startocm
ocmadmin> exit
Large site can use the command line OCM tools that are provided in two rpm.
These command lines can restart OCM and change node status. They are very powerful.
Please refer to section 8 Command Line Tools to Manage OCM of the OCM_Admin_Manual for details.
Below is a sample configurations screen. Please update 'System Configuration', 'Resource-class Load-Unit Map',
'Node-Group Definition' and 'Accounting Factor'. Log in as ocm and click the 'Edit xxxx' link in each section.
Job Archive Directory: this is where archiver saves old jobs. Usually we set it to be /wgjss/ocm/archive. This
directory needs to be writable by ocm account.
Cluster Sharing Communication Directory: directory for communication between OCM and 3rd party
scheduler. We will talk about this in detail in the end of this quick guide.
Cluster Sharing Release Iactive Nodes after: idle time for external nodes to be released back to the 3rd party
scheduler.
Fast RDM Check: check RDM pool quota for jobs at job submission time in stead of checking pool quota during
the job run. This will avoid wasting compute nodes time when there is not enough space in RDM pool.
Precheck License: recommend to set it to 'Yes', which means OCM will check license before OCM assign nodes
to a job. If set to 'No', OCM will assign nodes to jobs regardless if there are licenses available, and the jobs or
compute tasks may fail.
It's recommend to use the sample load numbers in the screen shot above. If serial job load automation is turned
on, only 'heaviest' is used.
Default/parallel = Name=parallel
Default/light = Name=light
Default/tape = Name=tape
Omega.Batch.RunClass
Priorities(in increasing order) = Default,Host,User,Project
Default/omega2 = Name=omega2
And when we batch submit jobs, we need to specify 'Targe Queue' of compute nodes. This 'Target Queue' should
be one of the 'Resource Class' we defined here.
This way, OCM knows how much resources this job will take and assign nodes accordingly.
Compute role nodes can only run OCI jobs as child nodes.
OCI-Master nodes are job server for OCI jobs and work with the 'compute' role on OCI jobs.
Each node group can have combination of 'Omega', 'Compute', and 'OCI-Master' roles.
OffHours nodes are for small sites that want to add work stations to Omega cluster to be used only during off
hour time. Each Off-Hours node (Linux) should have Omega, ocmcontroller installed.
External nodes are nodes that are shared by OCM and 3rd party scheduler. Uncheck if you don't have 3rd party
scheduler. We will talk about node sharing in the end of this guide.
Accounting Factor is for the charge rate for different nodes. It's recommend to use the CPU numbers of each type
of node as accounting factor.
If your site does not need this, you can just set it to be 1.0.
Please note that the OffHours nodes will be enabled automatically by OCM in off hours (working hours are
defined earlier in System Configuration).
The health check scripts are under /wgas/healthcheck directory on the OCM server host. If you don't have IB
nodes or you don't use Mellanox IB card, disable the IB check script by renaming it.
[root@triumph01 healthcheck]# ll
total 156
-rwxr-xr-x 1 ocm ocm 12574 Jul 18 16:18 custom_ib_health_check.sh
-r-xr-xr-x 1 ocm ocm 15230 Jul 18 16:18 ocm_healthcheck_generic.py
-r-xr-xr-x 1 ocm ocm 28810 Jul 18 16:18 ocm_healthcheck_mpi_compute1.py
-r-xr-xr-x 1 ocm ocm 42807 Jul 18 16:18 ocm_healthcheck_mpi_compute2.py
-r-xr-xr-x 1 ocm ocm 21953 Jul 18 16:18 ocm_healthcheck_oci_compute.py
-r-xr-xr-x 1 ocm ocm 22558 Jul 18 16:18 ocm_healthcheck_omega.py
[root@triumph01 healthcheck]# less custom_ib_health_check.sh
# disable IB check
mv custom_ib_health_check.sh custom_ib_health_check.sh.notused
You can also customize these healthcheck scripts using python scripts. The details can be found in OCM Admin
Manual section 3.6 Node Health Check Setup.
From OCM -> Node Groups -> right click on each node group, we can 'Change Max Resource Load' and 'Change
Max Allowed Jobs' for each node group.
Below is the recommended resource load table for different type of HPC nodes:
Max
Node Node Cores Node Memory (GB) Max Rload/Node
Jobs/Node
IvyBridge 20 128 39 13
SandyBridge 16 64 33 9
Westmere 12 48 28 7
Nehalem 8 24 17 5
Note: from OCM -> Scheduling -> Config -> '_Automation: Default Serial Job Load', we can specify default
serial job load. If we have this default value bigger than the max allowed resource loads on the serial nodes, then
serial jobs won't run. We will talk about this entry in below Scheduling part.
Detail instructions and explanations are in the OCM_User_Manual_V2.5.pdf section 4.7 Scheduling
Configuration and table 4-4.
Max Fraction of Nodes in Shareable Node Groups for Serial Jobs: value between 0 and 1. '0' means serial jobs can
not run on Shareable Node groups. '0.5' means serial jobs can use half of the nodes. '1' means serial jobs can use
all nodes. Detail explanation in OCM_User_Manual_V2.5.pdf section 3.4.8.
Fraction of Test Nodes in a Selected Node Group: value between 1 and 1. '0.5' means the Test mode jobs can use
half nodes of the selected group, defined in Test job rules. Test mode jobs have highest priority to run. Detail
information is in section 6.5 of OCM user guide.
Please note we leave 'Scheduling' -> Config -> Require Pre-defined Project' field to 'No', to make things easier.
This means OCM allow jobs without predefined rule to run. A default TEMPLATEPROJECT rule and a default
project rule will be created the first time when a job of any project is submitted.
Allow Users to Manage Their Jobs -> checked. This will ease initial burden of the Omega admin to monitor the
OCM job queue and allow users to manage their own jobs.
Apply Serial Job Load Automation -> checked. Let OCM determine rload of a serial job.
License for Dynamic OCI Jobs is updated automatically. OCM will check if we have OCMDynamic license on
the license server. If yet, this field will display checked.
However, there are many cases we need to create customized project rules to better use the compute resources.
The project rule creation window is attached below. The detail explanation of project rule can be found in OCM
user guide section 4.3 Project rule.
Please note Target Node Quota and Reserve Nodes are optional. These two fields can reserve nodes for the
project in selected node groups.
Job rules allow more specialized control of job scheduling and node allocation. Below is a screen shot. Detail
information can be found in OCM user guide section 4.4 Job Rule.
Please note processing mode 'Test' has higher priority. The 'Test' mode jobs allow test users to run their tests
faster on specified nodes in a busy center.
Tape rules are for centers that have multiple tape drives which connect to different nodes. In such case, we can
use tape rule to specify which tape drive can be used by which node group. Detail explanation can be found in
OCM user guide section 4.6 Tape Job Scheduling.
OCM can share nodes with 3rd party scheduler like PBS Torque.
OCM and 3rd party scheduler communicate through the 'ClusterSharing service' we started and read/generate
drop files in the 'Cluster Sharing: Communication Directory' we mentioned before.
Shared group and nodes need to have same name and configured on both systems.
Client needs to write a simple python script to communicate with OCM on node requests.
To troubleshoot OCM related issues, we can check /wglogs/OCMROOT/server0.log on the OCM server, or the
/wglogs/OCM/server0.log on each master or compute nodes.
OCM events logs can be viewed in OCM 'Events' web page. It's also very helpful in troubleshooting OCM issues.