You are on page 1of 20

CIS 210 February 2013

Sun/Oracle Grid Engine is:

A quick and easy way to set up a multi-
cluster system using existing hardware
Oracle Grid Engine is the most widely
deployed workload management solution in
the industry and offers unmatched
scalability. On top of a rich set of advanced
scheduling capabilities and the flexibility to
adapt to any computing environment and
application workload, Oracle Grid Engine
offers comprehensive support for the cloud
computing model.

How to Install
Install SGE on master node:
Install SGE on master node:
mpiuser@ub0:~$ sudo apt-get install
gridengine-client gridengine-common
gridengine-master gridengine-qmon
#remove gridengine-exec from the list if
master node is not supposed to run jobs
#during the installation, we need to set
the cluster CELL name (such as
Install SGE on other nodes:
Install SGE on other nodes:
mpiuser@ub1:~$ sudo apt-get install
gridengine-client gridengine-exec

The CELL name is set the same as that
of the master node
Set SGE_ROOT and
environment variables:
$SGE_ROOT refers to the installation path
of SGE
$SGE_CELL is cell name which is default
on our machine
Edit /etc/profile and /etc/bash.bachrc, add
the following two lines
export SGE_ROOT=/var/lib/gridengine
#this is the path on our machines
export SGE_CELL=default
Source the script: source /etc/profile
Configure SGE with qmon
Configure SGE with qmon (This section is
modified from a note by Junjun Mao)
Invoke qmon as superuser:
mpiuser@ub0:~$ sudo qmon
#On our machine, qmon failed to start due to
missing fonts -adobe-helvetica-
# To solve the fonts problem:
mpiuser@ub0:~$ sudo apt-get install xfs xfstt
mpiuser@ub0:~$ sudo apt-get install t1-
xfree86-nonfree ttf-xfree86-nonfree ttf-xfree86-
nonfree-syriac xfonts-75dpi xfonts-100dpi
mpiuser@ub0:~$ sudo reboot #after reboot,
the problem is gone
Configure hosts
Configure hosts
"Host Configuration" -> "Administration
Host" -> Add master node and other
administrative nodes
"Host Configuration" -> "Submit Host" ->
Add master node and other submit
"Host Configuration" -> "Execution Host"
-> Add slave nodes
->Click on "Done" to finish
Configure the user
Configure the user
Add or delete users that are allowed to
access SGE here. In this example, a user
is added to an existing group and later this
group will be allowed to submit jobs.
Everything else is left as default values.
"User Configuration" -> "Userset" ->
Highlight userset "arusers" and click on
"Modify" -> Input user name in
"User/Group" field
->Click "Done" to finish
Configure the queue
Configure the queue
While Host Configuration deals what
computing resources are available and
User Configuration defines who have
access to the resources, this Queue
Control defines ways to connect hosts
and users.
Queue Control
"Queue Control" -> "Hosts" -> Confirm the execution
hosts show up there.

"Queue Control" -> "Cluster Queues" -> Click on
"Add" -> Name the queue, add execution nodes to
"Use access" -> allow access to user group arusers;
"General Configuration" -> Field "Slots" -> Raise the
number to total CPU cores on slave nodes (ok to use
a bigger number than actual CPU cores).

"Queue Control" -> "Queue Instances" -> This is the
place to manually assign hosts to queues, and
control the state (active, suspend ...) of hosts.
Configure parallel environment
Configure parallel environment
"Queue Control" -> "Cluster Queues" -> Select a queue that will
run parallel jobs -> Click on "Modify" -> "Parallel Environment" -
> Click on icon "PE" below the right and left arrows -> Click on
"Add" -> Name the PE, slots = 999, start_proc_args =
$SGE_ROOT/mpi/ $pe_hostfile, stop_proc_args =
$SGE_ROOT/mpi/, allocation_rule=$fill_up, check
"Control slaves" to make this variable checked.

Make sure the configured PE is loaded from "Available PE" to
"Referenced PE".

Confirm and close all config windows and open "Queue Control"
-> "Cluster Queues" -> "Parallel Environment" again, the named
PE should show up.

Once created and linked to a queue, PE can be edited from
"Queue Control" -> "PE" too.
Check whether sge hosts are
running properly
Check whether sge hosts are running properly
mpiuser@ub0:~$ qhost #it should list the system info from all
mpiuser@ub0:~$ qconf -sel #it should list the hostnames of
mpiuser@ub0:~$ qconf -sql #it should list the queues
mpiuser@ub0:~$ ps aux | grep sge_qmaster | grep -v grep
#check master daemon
mpiuser@ub0:~$ ps aux | grep sge_execd | grep -v grep
#check execute daemon
mpiuser@ub1:~$ ps aux | grep sge_ execd | grep -v grep
#check execute daemon

#If sge_qmaster or sge_execd daemon is not running, try
starting by service
#mpiuser@ub1:~$ sudo service gridengine-master start
#mpiuser@ub1:~$ sudo service gridengine-exec start

#Reboot node(s) if sge_qmaster or sge_execd fails to start
Run a test script
Run a test script
Make a script named test with content:
### Request Bourne shell as shell for job
#$ -S /bin/bash
### Use current directory as working directory
#$ -CWD
### Name the job:
#$ -N test
echo Running environment:
echo =============================
###end of script
Job Submission
To submit the job: qsub test
#a job id returned if successful
Query the job status: qstat
#If the job is running successfully, there
will be two output files produced in the
current working directory with name
test.oXXX (the standard output) and
test.eXXX (the standard error), where
test is the job name and XXX is the job
Always check your logs
Check log messages if error occurs
mpiuser@ub0:~$ less
#master node
mpiuser@ub0:~$ less
es #exec node
Possible Errors
Question: My output file has a Warning: no
access to tty (Bad file descriptor).Thus no
job control in this shell.
Answer: This warning is caused if you are
using the tcsh or csh as shell for submitting
job. It is safe to ignore this warning.
Alternatively you can qsub -S /bin/bash to
run your program in different shell or add a
line of #$ -S /bin/bash in the job script.
Possible Errors
Question: Master host failed to respond properly. Error message is error: commlib
error: access denied (client IP resolved to host name ub0. This is not identical to
clients host name ub0) error: unable to contact qmaster using port 6444 on host ub0
Answer: Reboot the master node or install the SGE from source code on master node
(Solutions not confirmed yet). It also could be due to that the utility of gethostname (full
path is /usr/lib/gridengine/gethostname on our machines) returns a different hostname
to that from running command hostname -f. If this is the case (e.g., host having
multiple network interfaces), create a file named host_aliases under
$SGE_ROOT/$SGE_CELL/common and populate as follows,
# cat host_aliases
ub0 ub0-grid
ub1 ub1-grid
ub2 ub2-grid
ub3 ub3-grid
and then restart the gridengine daemon (see man page of sge_host_aliases for
details). Check the aliases:
mpiuser@ub0:~$ /usr/lib/gridengine/gethostname -aname ub0-grid
mpiuser@ub0:~$ /usr/lib/gridengine/gethostname -aname ub0
#both of them should return ub0