Вы находитесь на странице: 1из 59

Introduction to Calcul

Qubec
Daniel Stubbs
June 17, 2014

Summary

1. Brief History of Canadian HPC

2. Calcul Qubec: Organization, Machines and Staff

3. Obtaining an Account

4. Access, File Transfer and Storage

5. Modules

6. Job Submission

7. Job Monitoring

8. ANSYS

9. MATLAB

10. Getting Help

History of Canadian HPC



The first resources for scientific computing in Canada were
strictly at the university level, mainly the principal research
universities.

By the late 1990s this model was starting to reach its limits:
supercomputers were becoming more expensive, using the
software more complex and research more widespread.

CFI decided to start funding inter-university consortia across
Canada, with dedicated staff specialized in HPC and a range
of resources available to researchers at many institutions.

This led to the creation of seven consortia across the
country.

History of Canadian HPC, cont.



From west to east: Westgrid, SHARCnet, SCInet (U of T),
HPCVL, RQCHP, CLUMEQ and ACENet.

The two Quebec-based consortia were the RQCHP (UdeM,
UdeS, Concordia, Bishop's) and CLUMEQ (McGill, Laval, the
Universit du Qubec network).

The federal and provincial governments have continued to
push the consortia to work together more closely, leading in
2011 to the fusion of RQCHP and CLUMEQ to form Calcul
Qubec.

CFI has also funded a national umbrella organization,
Compute Canada, to promote cooperation and national
standards among the consortia.

4

History of Canadian HPC, cont.



The future is likely to include more "consolidation", with the
fusion of Ontario's three consortia into Compute Ontario
and a reduction in the number of machine rooms across the
country.

The density and size of future supercomputers mean it is
more logical to put them in sites where electricity is cheap
and cold water plentiful, e.g. Quebec.

With reliable and high bandwidth networking provided by
CANARIE and its provincial partners, this HPC infrastructure
should be transparent for Canadian researchers.

Calcul Qubec

Formed out of the merger of RQCHP and CLUMEQ, Calcul
Qubec now covers all of the province's universities.

Hardware is installed at the Universit de Montral, the
Universit de Sherbrooke, Universit Laval, Concordia and
the cole de technologie suprieure (McGill).

Calcul Qubec has staff at the UdeM, UdeS, Laval and McGill,
with the UdeM and McGill employees also supporting users
at Concordia and UQM.

Any researcher at any Quebec-based university has the right
to a free account with Calcul Qubec.

Calcul Qubec, cont.



Your account gives you access to Calcul Qubec servers and
a default allocation of 30 core-years on machines with 2000+
cores and 15 core-years on the rest.

Researchers who require more CPU time or more storage
can submit an annual allocation request which will be judged
on its scientific merits.

If you do exhaust your allocation you can continue to submit
and run jobs but their priority will be lower.

Calcul Qubec resources are of course intended for strictly
academic use.

Accounts should not be shared.

Calcul Qubec, cont.



Calcul Qubec isn't a legal person, so all of the hardware,
operating costs and staff salaries are the legal responsibility of
the member universities.

The organization is managed by an executive committee
representing researchers from each member institution.

Funding is approximately 40% federal, 40% provincial and 20%
vendor donations.

Calcul Qubec has 33 employees (28 of them technical staff)
and roughly 1100 active users, divided among 350 groups.

Any researcher at any Quebec-based university has the right
to a free account with Calcul Qubec.

Research Fields at Calcul Qubec


Research Groups per Field


Resource Usage per Field


Obtaining an Account

Getting an account on a Calcul Qubec machine like Briare
is a little more complicated than Cirrus.

We don't personally know the faculty at Concordia and the
granting agencies that fund us require us to gather a variety
of statistics on our users.

The first step is for the faculty member or PI to register with
the Compute Canada database.

There is an acceptable use agreement and a small Web-based
form to fill out which asks for a few personal details along
with some information about your research:

10

Obtaining an Account, cont.


11

Obtaining an Account, cont.



Within a day or so, the professor should receive an e-mail
confirming his/her registration with Compute Canada and
containing the CCRI.

This CCRI, a short combination of letters and numbers, can
then be given to students or post-docs whom the professor
wishes to sponsor.

These sponsored accounts follow much the same process to
register with Compute Canada, except that they use the
CCRI.

12

Obtaining an Account, cont.


13

Obtaining an Account, cont.



Once you've registered with Compute Canada, you can then
proceed to have a machine account opened.

You first login to the Compute Canada website with your
username and password and ask for a Calcul Qubec account
to be opened.

Once you receive an e-mail informing you that the account
has been created, you can go to the Calcul Qubec portal
and have an account opened on a particular machine:

14

Obtaining an Account, cont.


15

Obtaining an Account, cont.



Within 48 hours you should receive an e-mail with the details
of your username and password for the machine(s) that you
have chosen.

The characteristics of the various machines are listed on the
Calcul Qubec website but for Concordia users the most
attractive systems are probably Guillimin and Briare.

Both machines are quite large (~8000 CPU cores) and recent
(purchased in 2011), they're also located in Montreal which
will minimize latency for Concordia users.

Finally, Guillimin has a MATLAB license while Briare has
been configured to let Concordia users run ANSYS.

16

Cluster Access

The only way to connect to CQ clusters is by ssh (Secure
Shell).

Using ssh, your password is encrypted before being sent over
the network to the server.

To connect, you need to know the host name (the machine,
e.g. briaree.calculquebec.ca), your username and finally your
password.

The first time that you connect to a machine, ssh asks if you
would like to store the servers key and typically you would
answer yes.

17

Cluster Access, cont.



If you use OS X or Linux, you already have ssh installed, you
just need to open a terminal and type

ssh username@machine_name
Windows doesnt come with a default ssh client, but its easy
to download a free one.

One of the most commonly used is PuTTY, you can save the
binary on your desktop and when you double-click the icon
youll get the window:

18

Clustet Access, cont.


19

Cluster Access, cont.



Another possibility for Windows users is Cygwin, a free Unix
emulator.

Finally, you can also install a virtual machine, which allows you
to run Linux inside a Windows or OS X workstation.

If you want to use graphical applications on a CQ server, you
will need to establish an x11 connection and have an x11
server installed on your workstation.

For the ssh connection you need to add the option X or Y.

20

Cluster Access, cont.


21

Cluster Access, cont.


22

Cluster Access, cont.


23

File Transfer and Storage



When you connect to a CQ machine, you always start in
your home directory, which is typically /home/username in
Linux but on Briare it will be /RQusagers/username.

During your connection youll also see the machines Message
of the Day which contains useful information about the state
of the system, future maintenance work and so on.

The first time that you connect your home directory is of
course almost empty.

Your home directory will typically contain source code, job
submission scripts and parameter or input files, that is small
text-based files for the most part.

24

File Transfer and Storage, cont.



The home directories on Briare are backed up once a day.

Larger files and any intense I/O activity should be done using
$SCRATCH, which on Briare is /RQexec/username.

No backup is performed on this partition but you can use
much more space.

Even in $SCRATCH however you should try to restrict your
usage to under a terabyte, compress files using gnuzip and
after a certain period of time either delete files or transfer
them to a more secure location for archiving.

25

File Transfer and Storage, cont.



Finally, the compute nodes on Briare have their own local
disks which is even faster than $SCRATCH.

This space is accessible as $LSCRATCH and is limited to
around 180 GB.

Naturally it isn't backed up and is accessible only during the
lifetime of your job, so you will need to copy important data
back to $SCRATCH at the job's end.

The filesystems on Briare are designed to favour a few very
large files rather than tens of thousands of small files, so
whenever possible you should try to avoid this second kind
of use.

26

File Transfer and Storage, cont.



To exchange files between Calcul Qubecs servers and your
workstation you should use scp and sftp.

These two programs belong to the same family as ssh and
encrypt your password.

The command scp works like the Unix command cp (copy):

scp username@machine:research/out.dat result.dat

As for sftp, you use it like ftp: you can use cd to move
around and to transfer files, put or get.

For Windows there are also programs like WinSCP which are
very similar to Windows Explorer (graphical interface etc.).

27

File Transfer and Storage, cont.



Since its unlikely that your workstation has an ssh server
installed and running, you should always start all your file
transfers from your workstation.

If you are planning to transfer large amounts of data to one
of the Calcul Qubec servers, say 25 GB or more, we would
prefer if you discussed it first with the local support team to
ensure that theres enough space and that you store the data
in the appropriate location.

Remember to use dos2unix to convert your text files to
the right end-of-line character for Linux.


28

Modules

In general we would prefer that you ask a CQ analyst to
install the software that you need.

You can then use the command module, which will modify
the necessary environment variables so that you can use the
program in question.

The most common options are,

module list
module avail
module load module_name
module unload module_name
module purge
module swap old_module new_module
29

Modules, cont.

With the module command, you can choose a particular
version of a program.

You can automatically load modules by adding the module
load line to the end of the .bashrc file in your $HOME.

There may be dependencies for the modules you load, so
that you will need to first do module load A and then
module load B.

30

Modules, cont.

31

Modules, cont.

32

Modules, cont.

33

Modules, cont.

34

Modules, cont.

35

Job Submission

The machine that you connect to when you use ssh is called
the head node.

It functions as the gateway to the cluster and as such is
shared by everyone who connects to this machine.

You should not therefore be using it for any significant
computations, beyond compiling your software the real
work should take place on the cluster's compute nodes.

You first use a text editor to create a small "job script" that
specifies what resources the job needs (e.g. number of CPUs,
amount of memory, runtime) as well as what actions need to
be carried out, step by step.

36

Job Submission, cont.



Once you've created this file you can submit the job by
means of the command qsub script.pbs

To see the status of jobs on the cluster you use the
command qstat.
If you want to delete a job you can use

qdel jobid
You can use the command pbs_free to see how many
processors and nodes are free.

There are some constraints for jobs running on the CQ
machines at the UdeM, for example a job's runtime must be
less than 168 hours.

37

Job Submission, cont.



If your job asks for more than a certain number of CPU
cores (equivalent to four nodes), you'll need to demonstrate
to one of the analysts that it will use them efficiently.

There's also a limit on the number of jobs belonging to a
given user and group which can be run simultaneously.

You can however submit as many jobs as you want those
which can't be run immediately will simply wait in the queue.

Note that unlike Cirrus, you don't choose a queue one is
assigned to your job based on the resources it demands.

Briare is also a very active machine, with dozens of users
and hundreds of jobs.

38

Job Submission, cont.



#!/bin/bash
#PBS l walltime=52:00:00
#PBS l nodes=2:ppn=12
#PBS l mem=8gb
#PBS j oe
#PBS r n
#PBS o output.txt
module load software/2.3
cd research
mpiexec n 24 ./my_code parameter1 > output.dat

39

Job Submission, cont.



All of Briare's nodes have the same number of CPU cores,
twelve, and the same kind of CPU, the Intel Xeon 5650 with a
frequency of 2.667 GHz.

The amount of memory varies from one node to another,
however: there are nodes with 24 GB, 48 GB and 96 GB.

You can select a node with a particular amount of memory
by adding the switch,

#PBS l nodes=1:m48G:ppn=12

which in this case will ask for a single node with 24 GB.

There are significantly fewer 48 and 96 GB nodes on Briare
so you should only request them when you need it.

40

Job Submission, cont.


41

Job Submission, cont.


42

Job Submission, cont.


43

Job Monitoring

With qstat you can verify that your job is running (its state
is listed as "R") but that's about it.

If you have redirected its standard output to a file (i.e. you
have written ./my_code > result.txt), you can also look
there to see what progress the job has made.

Another technique is to go directly to the node(s) where the
job is running.

With qstat n job_id you can learn what node(s) your job
has been assigned and connect to it by ssh.

Once you're on the node, there are several commands for
observing what's happening.

44

Job Monitoring, cont.



The most common is top, which shows the processes which
are consuming the most CPU time.

This tool also shows the process's memory consumption.

To monitor the memory usage there is also the command
free.

If you would prefer something graphical, you can also use
Calcul Qubec's Ganglia site.

It allows you to see the state of each node of the two UdeM
clusters (cottos and briare).

45

Job Monitoring, cont.


46

Job Monitoring, cont.


47

Job Monitoring, cont.



Ideally, your Unix processes should use less than 95% of the
system memory and around 99% of the CPU.

If the CPU percentage is less than 99%, this may mean that
the process is spending too much time in system calls like I/O
operations.

48

ANSYS/Fluent

Commercial software widely used in engineering, here and at
other universities.

A couple of small research teams from cole Polytechnique
have been using ANSYS on the cluster at the UdeM for two
or three years now.

The ANSYS license is flexible enough that you can use your
license to run ANSYS on machines that are located at other
institutions.

To use ANSYS on Briare, you first need to ask us to add
your name to the ANSYS/Concordia group by sending us an
e-mail at briaree@calculquebec.ca

49

ANSYS/Fluent, cont.

Once we've confirmed your status at Concordia, we will add
your name and you can run ANSYS jobs on Briare using the
Concordia licenses.

Note that Concordia has a shared license pool for ANSYS,
including instances of ANSYS run on your workstations and
Briare jobs.

The job scheduler on Briare knows nothing about whether
there are enough licenses available it will run the job when
the CPUs become available and if the necessary licenses can't
be checked out then your job will crash immediately.

50

ANSYS/Fluent, cont.

#!/bin/bash
#PBS -o output.txt
#PBS -j oe
#PBS -l nodes=2:ppn=12
#PBS -l walltime=12:00:00
module load ANSYS_CONC/v145
cd my/research/directory
fluent 3ddp t24 -ssh -pib -mpi=pcmpi -g -i journal.jou > output1.dat

51

ANSYS/Fluent, cont.

In addition to the need to monitor ANSYS license usage,
another difference with Cirrus concerns interactive jobs.

On Cirrus you could use interactive jobs to run ANSYS and
its components using a graphical interface, much like you
would on a Windows workstation.

This practice is very strongly discouraged on Briare it is
possible to submit an interactive job using the -I option to
qsub but this is normally just for short (less than one hour)
tests and debugging.

When Briare is busy, as it often is, an interactive job that
asks for significant resources may wait for hours or even days
to start.
52

ANSYS/Fluent, cont.

What this means of course is that your interactive job could
well start at 3:40 AM on a Sunday or 8:30 PM on a Friday, so
that the resources sit there idle waiting for your input using
keyboard and mouse.

For this reason, you will need to become accustomed to
running ANSYS using a text file to control its behavior so
that it can run in a regular "batch mode" job on Briare.

This may require some adjustment in your workflow but it's
important for assuring that Briare's resources are used
efficiently and shared fairly among the many different
individuals on the system.

53

MATLAB

Various options exist on Calcul Qubec machines for using
this software, whose license is much more restrictive than
ANSYS.

The open source MATLAB clone octave is installed on all
Calcul Qubec machines and it duplicates a great deal of the
functionality of the MATLAB kernel.

It costs nothing to try using your MATLAB script with
octave, which also has a very wide variety of packages for
specialized calculations, similar to MATLAB's Toolkits.

However, it's certainly possible your MATLAB script won't
run in octave.

54

MATLAB, cont.

A further option is to use the copy of MATLAB installed on
your workstation here at Concordia to compile your script
into a standalone binary executable.

This binary file can then be copied to another computer
running the same operating system and executed using only
the freely available MATLAB runtime library.

This runtime library is installed on Briare as a module but
you will need to remember to do that MATLAB compilation
under Linux, not Windows, on your workstation.

Once again though, this solution isn't ideal for some users.

55

MATLAB, cont.

MATLAB is installed on the Guillimin cluster managed by
McGill but with a restricted license.

Only McGill users can run MATLAB in serial mode.

Outside users can run MATLAB on Guillimin using the
Distributed Computing Toolkit (DCT), assuming that they
have MATLAB installed on their workstation.

Using the DCT may require some changes to your MATLAB
script(s) and not every algorithm lends itself to running in
parallel.

A final alternative...

56

MATLAB, cont.

Rewrite your script, or at least the most compute-intensive
part of it, using open tools like C or Fortran in conjunction
with libraries such as BLAS/LAPACK, FFTW, GSL and so on.

This may seem like a lot of work but much depends on the
particular MATLAB script.

The staff at Calcul Qubec are here to help as well in making
a rough estimate of how much work it might be to convert a
script to C or Fortran and then in aiding the conversion.

After an initial time investment, you'll have software that can
run anywhere, on any number of processors, without paying a
cent to a foreign corporation.

57

Getting Help

The first place to look is the Calcul Qubec wiki, which you
can find at

https://wiki.calculquebec.ca

It has extensive documentation in English and French on
many different topics.

If however you can't find the answer there, you can write an
e-mail to

briaree@calculquebec.ca

assuming the problem is related to using Briare.

You'll normally get a response within a few hours from one of
the analysts or sysadmins at the UdeM.

58

Getting Help, cont.



This is also the address to use when asking for software to
be installed or to have your name added to the list of users
with access to ANSYS.

Many problems can be solved just by e-mail but when it's
necessary we are happy to visit you at your office to try and
find a solution.

We can also help with more elaborate projects, like learning
to program, optimizing a code or deciding how to parallelize
it.


59

Вам также может понравиться