Вы находитесь на странице: 1из 17

HPC, Grids, Clouds: A Distributed System from Top to Bottom

Kamlesh Jain, Priyank Shah, Prerna Shraff


B534-Distributed Systems, Indiana University, Bloomington.

Affiliations: Pervasive Technology Institute SALSA HPC School Of Informatics And Computing

1. Introduction
The project can be summarized into the following four steps: 1. Implementing the sequential PageRank algorithm followed by the Parallel PageRank algorithm using the MPI programming interface. 2. Running the Parallel PageRank code on two environments viz. Bare-Metal and Eucalyptus; and collecting the performance information (i.e. timing data). 3. Building a Resource Monitoring System that monitors and visualizes the resource utilization in a distributed set of nodes using message broker middleware. 4. Writing PBS job scripts that aim at setting up an automated system that does dynamic provisioning among clusters and gives the CPU and Memory utilization while running MPI PageRank.

2. Architecture and implementation


2.1 PageRank Algorithm
PageRank is a link analysis algorithm used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is referred to as the PageRank of E and denoted by PR(E).

Figure: PageRank

2.1.1 Sequential PageRank


In the World Wide Web, link based analysis of web graphs has been extensively explored. PageRank computation is one widely known approach which forms the basis of the Google search. PageRank assigns a global importance score to a web page based on the importance of other web pages pointing to it. The search results are generally presented in a list of results and are often called hits. The PageRank is a web graph ranking algorithm that helps the Internet user sort the hits by their importance. PageRank is an iterative algorithm applying on a massively connected graph corresponding to several hundred millions of nodes and hyper-links. PageRank calculates numerical value of each element of a hyperlinked set of web pages, which reflects the probability that the random surfer will access that page. The process of PageRank can be understood as a Markov Chain which needs iterative calculation to converge. One iteration of PageRank calculates the new access probability for each web page based on values calculated in the previous iteration. The iterating will not stop until the number of current iterations is bigger than predefined maximum iterations. The main() function in Class pagerank accepts the input arguments on the command line and parses them for any error, then it opens and reads the input file to obtained the input data. This in turn calls the function find_pagerank() which computes the pagerank which is then provides the computed data to the function sort_pagerank() to find the top-10 pagerank values. Technology used: The source code for the sequential page rank algorithm has been written in core java.

Flowchart: Sequential PageRank

2.1.2 Parallel PageRank


Developing parallel PageRank is an active research area in both industry and academia and numerous algorithms have been proposed. The key idea in developing parallel PageRank is to partition PageRank problem into N sub problems so that N processes solve each sub-problem concurrently. One of the simple approaches in partitioning is the vertex-centric approach. The graph of PageRank can be divided into groups of vertices and each group will be processed by a process. We take this approach for our MPI PageRank implementation. This part was developed using MPI environment, the implementation includes the C program files mpi_main.c, mpi_pagerank.c and mpi_io.c. The execution starts in mpi_main.c which reads the input file using mpi_io.c and then computes the pagerank in parallel using the file mpi_pagerank.c. The computed pagerank values are then sorted using the function pagerank_sort() in mpi_main.c and then stored into output file using a function in mpi_io.c Technology used: The source code for all the three files is written in C language.

Flowchart: Parallel PageRank

2.2 PageRank Performance Analysis on Academic Cloud


The project involved executing the Parallel PageRank code developed previously on two environments viz. Bare-Metal and Eucalyptus; and collecting the performance information (i.e. timing data). This collected information was than analyzed for understanding the distinguishing characteristics of distributed computation. The Eucalyptus environment uses Ethernet technology for communication between VMs on different nodes. The Ethernet technology for communication induces a high amount of delay. So in order to have an in-depth understanding of the communication overhead, we divide the execution here in two parts: Using Uniprocessor VMs (Here all the nodes in use have a single processor) Using Multi-processor VMs (Nodes have multiple processors, moreover the required processors are located within one or two nodes)

2.3 Resource Monitoring system


Distributed resource monitoring is an important part of distributed systems. In largescale computing environments, it is essential to understand the system behavior and resource utilization in order to manage the resources efficiently, to detect failures as well as to optimize the distributed application performance. We have implemented a system that monitors the CPU and memory utilization of the remote distributed set of compute nodes. This monitored information is then collected at the monitoring node of the system using a message broker; finally it is aggregated and the average utilization values are represented using dynamic graphs on the GUI running at the monitoring machine.

System Daemon: The System Daemon in this application runs in the background and implements the use of Sigar API for retrieving system resource utilization information and then implements a publisher client of a message broker system to send this information to the consumer client. This application first makes use of an object of class Calendar for getting the current system time and waits until the start of new second to send the first message. Once the program has been started successfully, a call to the user-defined function daemonize() is made which detaches the program from standard out and error. Next inside a loop which runs at an interval of one second, the getCpuPerc() function and getMem() function of class Sigar are used for capturing the current system CPU and memory utilization. Inside the same loop a Narada Broker publisher client has been setup using the classes ClientService, EventProducer and NBEvent provided by the API, which packs this data along with the machine IP provided by the shell script into a message and sends it. System Monitor: The System Monitor application cannot be thought of as divided into two parts: The first part includes a Narada Broker Consumer Client along with a linked-list data structure named MessageList for storing the list of IPs being monitored. A Consumer Client has been setup using the classes ClientService, EventConsumer and Profile provided by the Broker API. In function onEvent() which is called whenever a new message is received the originator IP address is looked for inside the message and if this IP matches an existing one in the linked-list created then the utilization values corresponding to that node are updated else a new entry is made for that node into the linked-list and the corresponding utilization values are stored. The second part involves averaging the received data and plotting graphs using the API JFreeChart. In a loop the average utilization of CPU and memory is calculated and in turn these values are updated into the chart using the updatechart() method of an object of a user-defined class GUI which extends the class ApplicationFrame. Next is the logic for dynamic update of disconnected nodes which after a timeout detects the absence of messages from a registered IP address and deletes it from the list of IPs being monitored. Features: Synchronization of the daemon process running on multiple computing nodes. Simple interface - Shell scripts to start, terminate and display the status of the daemon. Dynamic detection of compute nodes being monitored.

2.4 Dynamic switch/provision clusters on Academic cloud


We have used the Dynamic Provisioning system provided by FutureGrid System Admin team. This particular system can switch between Bare Metal and Virtual Machine compute cluster environments utilizing XCat, Moab and Torque job scheduler. Hence we have implemented a system whose goal is to use Future /grid dynamic infrastructure to switch between various clusters on the Academic cloud. The following is the diagram showing the interactions amongst the various components in the system:

Figure: User interactions with Dynamic provisioning system Bare-Metal: As a part of the dynamic provisioning on academic cloud we have implemented PBS job scripts for running MPI PageRank and Monitoring system applications developed in the previous projects. As we are using batch jobs to request resources the way in which the execution works is that the requested job would wait in the queue and get executed when the desired resources are made available. Apart from the PBS script start_bare_metal_script we have implemented other shell scripts for running the monitor application as daemon and executing the PageRank program. The flow of the execution in this environment would go as follows:

Flowchart: Bare-Metal Provisioning

Virtual Machine Enviornment: As in the previous environment here as well we use PBS job script named start_vm_script for setting up the environment and accomplishing the desired task. Moreover, even in this case the tasks are accomplished using a number of shell scripts that we have implemented; viz. start_SystemDaemon, status_systemDaemon, stop_SystemDaemon and run_vm_mpi. The flow of the execution in the virtual machine environment would go as follows:

Flowchart: Virtual Machine Provisioning

3. Experiments
3.1 Input and output files 3.1.1.1 Sequential PageRank Results
INPUT FILE: <pagerank.input.1000.4> NUMBER OF ITERATIONS: 100 DAMPING FACTOR: 0.85 TIME TAKEN: 1.123 SECONDS -------------------------------------------------------------------URL | PAGERANK VALUE -------------------------------------------------------------------4 | 0.1382042748741752 34 | 0.123027739664415 0 | 0.11257586324323682 20 | 0.0773730840780489 146 | 0.0571400845693035 2 | 0.04792632638502429 12 | 0.020066424192489902 14 | 0.017905732189139028 16 | 0.013028812374389034 6 | 0.012955807657490121

3.1.1.2 Parallel PageRank Results


-------------------------------------------------------------------------------------------****MPI PAGERANK**** -------------------------------------------------------------------------------------------Number of processes = 10 Input file = pagerank.input.1000.4 Total number of URLS = 1000 Number of iterations provided = 10 Number of iterations completed =5 Threshold = 0.0010000000 THE 10 HIGHEST RANKED URLS WITH THEIR VALUES -------------------------------------------------------------------------------------------URL PAGERANK VALUE -------------------------------------------------------------------------------------------4 0.1340601806 34 0.1250754776

0 0.1119975127 20 0.0842259538 146 0.0668477042 2 0.0489776400 12 0.0219372778 14 0.0173611889 16 0.0127714565 66 0.0120657004 -------------------------------------------------------------------------------------------TIMING INFORMATION -------------------------------------------------------------------------------------------Computation time = 0.01000 Secs. I/O time = 0.58000 Secs.

3.1.2 PageRank Performance Analysis on Academic Cloud


Bare-Metal: Attribute Number of Worker Nodes Number of Processes Size of Dataset (No. of Urls) Threshold Number of Iteration 2 8(on each node) 1000, 10000, 20000, 30000, 40000, 50000, 75000, 100000, 1000000, 2500000 0.0000000001 100 Value

Figure: Bare Metal Speed-Up Chart

Eucalyptus: TYPE 1 Attribute Instance Class Number of Worker Nodes Number of Processes Size of Dataset (No. of Urls) Threshold Number of Iteration Value c1.medium 8 1(on each node) 1000, 10000, 20000, 30000, 40000, 50000, 75000, 100000, 1000000 0.0000000001 100

Figure: Eucalyptus Type-1 Speed-Up Chart

Eucalyptus: TYPE 2 Attribute Instance Class Number of Worker Nodes Number of Processes Size of Dataset (No. of Urls) Threshold Number of Iteration Value x1.large 1 8(on each node) 1000, 10000, 20000, 30000, 40000, 50000, 75000, 100000, 1000000, 2500000 0.0000000001 100

Figure: Eucalyptus Type-2 Speed-Up Chart

3.1.3 Resource Monitoring system

Figure: Snapshot of Monitoring System

3.2 Analysis of results


Comparison of results between the Sequential Page Rank and MPI Page Rank Parameter Sequential Page Rank MPI Page Rank Number of Processes Number of Iterations Damping Factor Time taken 1 10 0.85 1.123 sec 10 10 0.85 0.01(computation time) 0.58(I/O time)

Finding, description and explanation about the results obtained for PageRank Performance Analysis on Academic Cloud: A) For fixed data size with increase in number of processes the speed-up increases provided the communication overhead component is not very large. This is the basic definition of parallel processing. Moreover the results show that the pagerank algorithm implemented does not have a linear speed-up, because of the fact that, with the increase in number of processes the speed increase is gradual (slow) and not linear. In the Bare-Metal and Eucalyptus Type-2 environments it was seen that the communication overhead is not large as compared to the computation gain obtained, thus the speed-up graph is increasing. But in the case of Eucalyptus Type-1 environment which involves VMs having single processor, the results obtained are opposite. The results show a negative speed-up i.e. with increase in number of processes the performance deteriorates. This is possibly because of the virtualization overhead as it involves a large number of VMs and also because of the communication medium (Ethernet) used by this setup. B) For a fixed number of processes the speed-up increases with increase in dataset size.

Figure: Speed-Up (Bare-Metal)

Figure: Speed-Up (Eucalyptus Type-1)

Figure: Speed-Up (Eucalyptus Type-2)

The figures above show the plots of speed-up of parallel execution versus the Data-set size on all the environments that our code was executed on. It shows that with the increase in data-set size the speed-up gradually increases; understandably so because at smaller sizes performance gain achieved by parallelism is hampered by the communication overhead. i.e. performance gain achieved is hindered by the increase in burden of communication. But as the data size increases the computation gain observed is higher because the gain obtained by the power of parallel computing starts having an edge over the loss occurred due to overhead. This phenomenon is seen in bare-metal and eucalyptus Type-2 environments but not in the eucalyptus Type-1 as the communication overhead in the last case is significantly higher in all the executions. Hence the speed-up obtained here is negative (decreasing).

C) For fixed dataset size with increase in number of processes the efficiency decreases while the communication overhead increases.

Figure: Efficiency and Overhead for 2500K Urls (Bare-Metal)

Figure: Efficiency and Overhead for 100K Urls (Eucalyptus Type-1)

Figure 9: Efficiency and Overhead for 2500K Urls (Eucalyptus Type-2)

The three figures above show the statistical data for efficiency and overhead encountered for a fixed dataset on the execution environments. Efficiency in context with distributed systems is defined as the value which estimates how well utilized the processes are in solving the problem. Whereas the overhead here is mainly the communication overhead faced when using multiple processes. i.e. nothing but the time required for communication between various processes. The results obtained are as expected in a normal scenario and thus nothing strange has been noticed here. From the results obtained it can be observed that as the number of processes increase the value of overhead increases because the amount of communication required increases, whereas theper processes efficiency decreases as now the amount of computation per processes decreases. But an important result to be noticed here is that in case of the eucalyptus Type-1 execution the value of the overhead is significantly high. The value of overhead observed on eucalyptus Type-1 execution for 100K urls with 8 processes is nine times that observed on eucalyptus Type-2 execution for 250K urls with same number of processes. This in turn supports the reasoning made in the beginning for the observed negative speed-up in eucalyptus Type-1 environment. Our understanding for it is that, in eucalyptus Type-2 setting all the processes are running on the same node so the communication time is quite less as compared to when they are running on different nodes. Moreover eucalyptus VMs use Ethernet technology as against Fiber in bare-metal for communication.

4. Conclusions
Comparison of Sequential and Parallel PageRank algorithms: On comparing the Sequential and Parallel PageRank algorithms we find out that the time required by the Parallel PageRank is very small as compared to sequential page rank. This shows the benefit of distributed systems and parallel programming. Performance Analysis on Academic Cloud: The results obtained for PageRank Performance Analysis on Academic Cloud have lead to the conclusion that except for the place where overhead is high, the parallel pagerank code executes smoothly giving expected performance results. Moreover deciding the type of communication technology to be used for communication amongst the nodes doing parallel computation is an important criterion and cannot be ignored. Synchronization of the daemon process running on multiple computing nodes: Our system does explicit synchronization of the utilization messages sent from the multiple remote nodes that are being monitored by making sure that the messages are being sent at the first millisecond of every new second. Virtualization cost: One interesting finding that was expected while carrying out the experiments was that; starting the virtual machines and making them reachable is very costly in the sense that it requires a large amount of time as compared to the Bare-Metal environment.

5. Acknowledgements:
We would like to acknowledge a lot of people for their support and patience throughout the semester. Special thanks to professor Judy Qiu and both the Associate Instructors Ikhyun Park and Pairoj Rattadilok for giving us a chance to take up such good work and always helping us out. We would like to thank the Salsa Hpc group, FutureGrid group and of course all the classmates for the good discussions and inputs on the Google group.

6. References
[1] http://en.wikipedia.org/wiki/Markov_chain [2] http://en.wikipedia.org/wiki/Adjacency_matrix [3] http://en.wikipedia.org/wiki/PageRank [4] NaradaBrokering: http://www.naradabrokering.org/ [5] Sigar Resource monitoring API: http://sourceforge.net/projects/sigar/ [6] JFreeChart: http://www.jfree.org/jfreechart/ [7] TORQUE Resource Manager: http://www.clusterresources.com/products/torque-resourcemanager.php [8] KVM Hypervisor: http://www.linux-kvm.org/page/Main_Page [9] libvirt: The virtualization API http://www.libvirt.org/ [10] Torque Qsub: http://www.clusterresources.com/torquedocs21/commands/qsub.shtml#I [11] Torque Job submission: http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml [12] Google group inputs

Вам также может понравиться