Parallel Computation

COSC 6374
Parallel Computation
Project I
Team 07
Neelima Paramatmuni
Padmaja Pentela
Shyamali Balasubramaniyan
Shravanthi Denthumdas
Samira Hemrajani.
Problem Description:
The objective of our project is to parallelize a sequential code such that it can be
executed for N number of processes. The original source code performs Image Analysis for an
image which is treated as a matrix of characters. NOTE : 1 character is a pixel of black and white
image.
The most computation intensive task of the code was FFT operations. These were
parallelized using the MPI based FFT routines.
The major challenge of this task was:
• Understand the sequential code and investigate parallelization approaches.
• Understand data distribution requirements of the FFTW routines.
• Synchronization of tasks to enforce ordering if and when required.
We focused mainly on our solution to be scalable. This is because we would be required to test
our parallelized code for bigger images which are impossible to be viewed on a single processor.
Parallelization Strategy:
The following steps unfold the various stages of our task parallelization strategy.
Setting up the ‘parallel environment’ and use of MPI FFTW routines:

Firstly, we initialize the Parallel environment using MPI_Init (). Next, We create plans for
the forward and inverse transforms, of type fftwnd_mpi_plan, using function:
fftwnd_mpi_plan fftw2d_mpi_create_plan (MPI_Comm comm,
int nx, int ny,
fftw_direction dir, int flags);
Distributing the problem/data:

In order to obtain a load balanced form of data distribution for the MPI FFTW routines we
use the following function:
fftwnd_mpi_local_sizes ( fftplan, &local_height,&local_start,
&local_ny_after_transpose,&local_y_start_after_transpose,
&total_local_size);
This allows us to divide the image horizontally such that each process gets x number of rows and
y columns where x<1st dimension of image and y always equal to 2nd dimension of image.
X dimesion along which the

image is divided
Y dimension (always constant)
Figure 1: Slab decomposition for data distribution
Parallelizing Reading the Image:

In accordance with our task parallelization strategy, every process will carry out its task
concurrently. The image read operation is also parallelized in a similar fashion. The Image is read
based on the local_start and local_height namely the parameters returned by our data layout
function. Thereby, each process reads only its own subset of data.
Padding the image:

Padding of the image and the filter is a major part of the image analysis task.
We need to introduce a ‘border’ around the original image due to symmetry requirements of the
FFT operation.
This is achieved by padding the image with zeros on the right and the bottom.
CASE 1
CASE 2
CASE 3
Figure 2: Image Padding
Since, each process gets a subset of the rows of the data, during parallelization the image
padding will have to be carried in one of the following 3 ways:
Case 1 : A process may have rows that require padding only on right side (refer to figure 2).
Case 2 : A process may have rows that require padding on the bottom and well as the right hand
side.
Case 3 : A process may have chunk of data which is only padding.
Padding the filter:

The filter used for the image analysis task needs to be blown up to the size (same
dimensions) of the padded image. The padding of the filter is done in a similar manner as image
padding i.e. appending zeros to the right and bottom.
FFT of image and filter:

The actual computation of the transform is performed by the function fftwnd_mpi.
We replaced the fftwnd_one() in the original sequential version with corresponding parallel
versions:
fftwnd_mpi(fftplan,1, padimg, NULL, FFTW_NORMAL_ORDER )
fftwnd_mpi(fftplan,1, padfilter, NULL, FFTW_NORMAL_ORDER)
Convolution:
FFT of the padded image and filter are performed first, then these are multiplied to get
the convoluted output. Since this task is now divided over multiple processes, we modify the
convolution loop to range from 0 to number of rows on each process (image height on each
process).
Inverse FFT:
This is the Inverse of the FFT operation. For purpose of parallelization we replace the
fftwnd_one() with fftwnd_mpi(ifftplan,1, convolved, NULL, FFTW_NORMAL_ORDER ).
Depadding:
The depadding operation is carried out on the output of the inverse FFT operation. It
separates the image from the padding that we had introduced during the FFT operation.
The following image shows the output of IFFT which needed to be depadded.
The depadding is defined by 3 main scenarios in our parallel environment.
CASE 1
CASE 2
CASE 3
Figure 3: De-padding operation
Case1: Process may have only padding and no useful image information.
Case2: Process may contain image information along with padding on the top and left. In this
case we will have to extract the image information accordingly.
Case3: Process may contain image information along with padding to the left.
Writing the output Image:

While writing the output image we need to address synchronization i.e. enforce an
ordering of processes. Hence, we check for the rank of the process, if the rank is zero it writes to
the output buffer and passes it on to the next process rank+1. However, the last process only
receives the buffer from the process rank-1.NOTE: that the remaining processes send to rank+1
and receive from rank-1.
Testing and Verification:

We followed the following approach for testing our parallelized code: We executed the
sequential code, and parallel code for different number of processes. These outputs were then
visualized in MATLAB and compared to ensure that our code was executing correctly.
Additionally, we used the timing functions provided by the MPI library to calculate the
speedup and efficiency of our parallel code.
Results:
We obtained readings on the cluster using the image file provided by Dr. Gabriel under
folder /pvfs2.
We obtained 3 readings for 1, 2, 4 and 8 processed respectively. These are tabulated below.
#
proc 1 2 4 8
Iter1 981.964 660.945 365.442 203.530
Iter2 987.536 682.134 366.828 204.916
Iter3 981.474 671.897 374.905 205.372
Avg 983.658 671.659 369.0586 204.606
Table 1: Number of Processes Vs Average Execution time over 3 iterations.

Please Note: We obtained the execution time for the parallel code by taking the Maximum
Execution time i.e. the execution time of the slowest process in the group.
Speedup and Efficiency: No. of

We calculated the speed up and efficiency as follows: processes 2 4 8
Speed up: Total time (1) / Total time (n)
Efficiency: Speed up/ No. of processes Speed up 1.4645 2.6653 4.8075
Efficiency 0.7322 0.666 0.6009
Table 2: Speedup and Efficiency
Figure 4: Number of processed Vs Speedup
Figure 5: Number of processed Vs Efficiency

Conclusion:
Implementation of parallel computations through use of MPI libraries in the above project
allowed us to greatly scale our solution. We can perform image analysis for images having sizes
in the range of 100s and 1000s GB’s which would not have been possible sequentially. We also
realized that the performance of our task improves with respect to the time taken to perform the
analysis which is reduced to a great extent. This can be inferred from the graphs plotted above.
Individual Contribution of team members:
Neelima Paramatmuni:
• Analysis of sequential code and identify parts of code that need to be parallelized
• Studied FFTW MPI routines
• Write code to parallelize the sequential code (Reading, Padding and Writing image).
• Debugging the code.
• Testing the code
• Calculating speedup and efficiency.
• Documentation
Padmaja Pentela:
• Analysis of code and FFTW MPI routines
• Helped in file reading , padding and writing to file functions
• Documentation
Shyamali Balasubramaniyan:
• Analyze the code
• Studied the FFTW MPI routines
• Recording measurements, graphs and documentation.
Shravanthi Denthumdas:
• Analysis of sequential code and studied the FFTW MPI routines
• Documentation
Samira Hemrajani:
• Contributed to the analysis of the sequential code
• Addressing questions like: What is Image Analysis? How FFT is carried out? What are
filters, Gabor filters, etc?
• Investigated different parallelization approaches, reviewed fftwnd_mpi_local_sizes for
Data distribution in MPI_FFT
• Incorporated the timing functions namely MPI_Wtime and gettimeofday in the parallel and
sequential codes respectively
• Carried out intensive testing
• Helped in documenting the results

Parallel Computation

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Parallel Computation

Загружено:

Авторское право:

Доступные форматы

COSC 6374

Setting up the ‘parallel environment’ and use of MPI FFTW routines:

Distributing the problem/data:

X dimesion along which the

Y dimension (always constant)

Figure 1: Slab decomposition for data distribution

Parallelizing Reading the Image:

Padding the image:

Padding the filter:

FFT of image and filter:

Figure 3: De-padding operation

Writing the output Image:

Testing and Verification:

Iter1 981.964 660.945 365.442 203.530

Iter2 987.536 682.134 366.828 204.916

Iter3 981.474 671.897 374.905 205.372

Avg 983.658 671.659 369.0586 204.606

Table 1: Number of Processes Vs Average Execution time over 3 iterations.

Speedup and Efficiency: No. of

Efficiency 0.7322 0.666 0.6009

Table 2: Speedup and Efficiency

Figure 4: Number of processed Vs Speedup

Figure 5: Number of processed Vs Efficiency

Individual Contribution of team members:

Вам также может понравиться