Вы находитесь на странице: 1из 7

Tutorial about Cluster

By: Showayb A A Zahda

This tutorial will teach you the basics of clusters, how to build it, and how to
monitor your jobs using the Grid Sun Engine.

What is a cluster?

Cluster can be defined as a group of computers running as one computer. There are
several types of clusters. However, in this tutorial we will be talking about High
Performance Computing (HPC) Cluster. Cluster is a replacement of supercomputers
which have very high computing performance as well as very high cost. To make a
cluster you need normal Personal Computers (PCs) and network cards on the machines
and switch to connect the machines. Definitely, this reduces the cost of having high
performance supercomputer.

The notion of a cluster is having more that two machines connected together by the
network. A master machine or front-end machine is basically like the God of the cluster.
The master node is controlling everything in the cluster. It has at least two network cards
one of them is connected to the outside world and the other is connected to the network
of the cluster. The other nodes which are called compute nodes are connected to the
network as well. Compute nodes receive commands to be executed from the master node.
After the nodes finish executing the tasks they return the output to the master node. The
master node is responsible of distributing the tasks to the nodes evenly.

The interesting point in running a cluster is that the cluster runs best under Linux
Operating System and Linux tools. So it is free.

How to build a cluster?

In this section, we will build a cluster from scratch. So, before we build it we need to
know our architecture of the cluster. We will build the structure using 3 computers 64-bit.

1
And we need a switch and network cables. What else? Of course we need operating
system to run the cluster i.e. based Linux.

The following figure depicts the architecture of our cluster

Master node
Outside
world

switch

Compute node 1 Compute node 2

The master node needs keyboard, mouse and monitor. And it is strongly recommended to
have CD-ROM. However, compute nodes do not have mice, keyboards and monitors.
Because we can control them from the master node. (We will talk about that soon).

The Rocks operating system is made especially for clusters. Rocks is a RedHat based
operating system. Rocks has a lot of interesting features which make your life easy. It is
very easy to be installed. To install Rocks visit the official website of Rocks Cluster
http://www.rocksclusters.org/wordpress/ where you can download Rocks and have the
handout of installation.

2
Some handy information about Rocks

Rocks configures everything for you. From the master node (as a root user) you can
control everything on the cluster. The followings are some important and handy
commands.

Note: Assume that all the machines in your cluster are one machine i.e. the master
node. This assumption makes it easy for you to handle the cluster.

Because the cluster is aimed to aid people to perform some tasks quickly you need to
create those people so as to have good security.

In order to add a user to the system use useradd name this command will create a user
that has some privileges on the system (to know more read about controlling user in
Linux). And because we assumed that all the machines are one we need to notify the rest
of the machine about the changes in the system. In this case, adding new user, to do so it
is very easy just type on the shell rocks-user-sync this command will synchronize all the
setting of all the machines in the cluster in order to make them perform as one machine.

Assume you want to execute one command on all the machines simultaneously. It is
ridiculous to go to each machine and execute the command. Moreover, do not forget that
in some compute nodes there are no mice, keyboards and monitors. So, how is that
possible?

That is possible in two ways:

1- To execute one command on all the machines at the same time. For instance,
reboot all the machines. Use the command cluster-fork command. This command
will send signals to the entire cluster except the master node asking them to
execute the command. The output will be displayed on your shell or can be
redirected to a file. (if you do not know what is output redirection google it).
#cluster-fork reboot : this command will reboot all the machines.
#cluster-fork hostname : this command will ask all the machines about their

3
hostname. You can check whether all the machines are on or not using this
command.
2- If you want to do some tasks on one particular machine only, you can use ssh
command in this format: ssh machine-name command

# ssh c0-0 reboot


# ssh c0-1 mkdir showayb
# ssh c0-0 ls –l

What are ssh, c0-0, c0-1, reboot, etc.?

ssh: is the secure shell which allows you to connect to other machines remotely. Read
about ssh if you want to know more.

c0-0 or c0-1: are the names of the compute nodes. These names are defined by rocks by
default. So if you have 10 computers. Computer number 7 is c0-7

The command is any Linux command like reboot, ls, mkdir, kill, etc. So, ssh c0-0 is like
your shell. And the command is Linux command. So, you do not need to go to the
machine and execute the commands there. Do it from the master node.

What comes after Rocks?

In fact, Rocks comes with a lot of software. One of them is the Sun Grid Engine (SGE)
(read more http://gridengine.sunsource.net) which is one of the main components of the
cluster. SGE does a lot of things like.

• Policy based allocation of distributed resources (CPU time, software licenses,


etc.)
• Batch queuing & scheduling
• Support diverse server hardware, OS and architectures
• Load balancing & remote job execution
• Detailed job accounting statistics
• Fine-grained user specifiable resources
• Suspend/resume/migrate jobs
• Tools for reporting Job/Host/Cluster status
• Job Arrays

4
• Integration & control of parallel jobs

Source: http://wiki.gridengine.info/wiki/index.php/Main_Page

Okay, now it is time to run jobs on our cluster. Make sure that have an account created
for you so as to have access to the Operating system i.e. Rocks. Login to your account
and from there you can execute the following commands based on your needs.

Before you submit a job! Do you know what is job? The job basically is a shell script
(google shell script to learn how to write it). So, a shell script is a file that executes Linux
commands. The shell script can range from a very basic script to a very sophisticated one
which needs a lot of care and efforts to write it. So, let’s see an example of a shell script
that can be executed on our cluster. The script will only execute the command date and
output it to us.

#!/bin/bas
#capture the date
date
#sleep for 20 seconds
sleep 20
#capture the date again
date

Save the script in a file called date.sh

The extension of the shell script is sh. Basically you can run this script on your machine
using the command sh date.sh or ./date.sh . In the second case make sure that you have
permission to execute the file use ls –l and chmod.

This way has just run the script on the master machine. But if the script is of 1,000 lines
which runs some programmes and does a lot of computation, it will definitely take a lot
of time. So, we need to send the job to the other machines (compute nodes) on our cluster
so as to get the output quickly. How to do that?

5
Sun Grid Engine is the solution. You need first to login in to the system either remotely
or locally. And only then you can submit a job (do not forget shell script). To submit the
file which has the shell scripts do the following.

#qsub filename.sh
qsub stands for queue submit which means submit the job to the queue.
filename.sh is your shell script file.
After the submission of the job a message will appear to indicate that your job was
submitted successfully. The message looks like “Your job 66 (filename.sh) has been
submitted” the number 66 is the id of your job and the filename.sh is the name of the
script you submitted.

After the completion of the job two files will be created usually in your home directory.
The files are filename.sh.eX and filename.sh.o.X

The name looks weird but in fact it is easy once you know what it means.

filename.sh.eX : filename.sh is the name of the shell script you submitted, e stands for
error, X is the jobid. Example: date.sh.e66

filename.sh.o.X: filename.sh is the name of the shell script you submitted, o stands for
output, X is the jobid. Example: date.sh.o66

After you execute the job you need to see what happens to your job.

#qstat –f
qstat stands for queue status
-f is an argument which means specify a full format display of information

The output of this command looks like a table. It has the id of the job, the name of the
job, the user who submitted the job, the state of the job, the queue and some more.

6
The states of the jobs are represented by letters by their initials.

• r: running
• qw: queued waiting
• s: suspended
• E: Error

To delete a job from the queue

#qdel jobid

qdel: queue delete and the jobid is numeric. Make sure you are the owner of that job
otherwise you will not b able to delete it simply because it is not yours. Only the root user
can delete other people’s job and do whatsoever he/she wants to do.

Вам также может понравиться