Вы находитесь на странице: 1из 7

Task Dependency Identification and Dependency-Aware Scheduling for Lazy-Parallel Calls Soumya S Chatterjee1, a, Jatin Jani1, b and Dr.

R. Gururaj1, c
1

Department of Computer Science and Information Systems, BITS-Pilani, Hyderabad Campus, Hyderabad, Andhra Pradesh, India 500 078
a

soum@live.in, jatin.s.jani@gmail.com, gururaj@bits-hyderabad.ac.in

Keywords: Lazy-Parallel calls, dependency-aware scheduling, parallel execution, dependencyidentification, automatic parallelization

Abstract. Parallel computations work best when the parallel tasks are totally independent. But in most practical applications, there are dependences, which schedulers generally do not take into account when scheduling the tasks. We propose a mechanism to identify the dependences between the tasks created when Lazy-Parallel function calls are executed, and schedule them satisfying their dependences so as to minimize the caller stalls as much as possible. Introduction When a program is parallelized, it is split into multiple, independent threads of execution. But, in many cases, concurrency exists along with certain precedence constraints. Allowing simple dependence constraints such as task Tn+1 executes after task Tn, allows concurrency to be exposed in a lot more places. However, most schedulers in use today do not care about such dependences. They assume parallel tasks are all independent and schedule them for the most possible performance. We aim to demonstrate the effects dependency-aware schedulers on the performance of a program. Previously, we had proposed a new function call mechanism [1], called Lazy-Parallel Function Calls (L-P calls), that allows a function to be executed either serially or in parallel without any change in code and guarantee the same result. L-P calls generate background tasks which are then scheduled using a user-mode scheduler onto OS-threads [1]. We presented compiler analysis [1, 2] for using L-P calls to automatically parallelize the code. Our implementation of L-P calls also suffered from the same problem the parallel tasks created were assumed to be independent. If, instead, the result of a parallel computation is needed elsewhere, the other task has to stall and wait for the parallel task to complete. These pauses result in unnecessary context switches, which could have been avoided if the scheduler was aware of the dependences between the tasks and had scheduled the other task only after the first task had finished. We propose new compiler analysis techniques that allow the dependences between the tasks to be identified. We also propose a new multi-level scheduler that can use the dependency information to schedule the tasks in a way such that the stalls are minimized. We then verify whether the scheduler is giving expected benefits or not using a prototype implementation in C#. The paper summarizes of our previous work and explains the problem in the Background section and follows it up with the Related Work section. Our proposal is in the Proposal section, and the prototype implementation and results obtained are presented and explained in the Results and Discussion section. It concludes while detailing the future work possible in the Conclusion and Future Work section and follows it with the References. Background We proposed a new function call mechanism, called Lazy-Parallel Function Calls (L-P calls) [1] to abstract parallelism using lazy evaluation semantics [3] for languages like C# or Java. Code parallelized with L-P calls exhibit same behavior under both serial and parallel execution. The

runtime system can switch between the two modes of execution depending on system load. When a function is invoked as an L-P call, its parameters are cached and bound with the function body to create a parameter-less closure. The closure is immediately returned to the caller, and the caller resumes execution. Simultaneously, the closure starts executing as a background task. On completion of execution, the result is put in the closure, to be read off when needed. If the result is not ready when the closure is accessed, the caller is blocked temporarily. From the programmers point of view, it is just like lazy evaluation whether the result is evaluated just-in-time or simultaneously in background makes no difference to the caller. The caching of parameters ensures that the function executes against the same values of parameters regardless of when it is actually evaluated. This allows the runtime to evaluate the function serially (just-in-time, in the same thread as the caller) or in parallel (simultaneously with the caller, on a background thread) while guaranteeing same result across both execution modes. L-P calls give most benefit when used on the costliest methods in a program. To do so, we proposed two compiler analysis techniques Call Graph Analysis (CG Analysis) and Weighted Control Flow Graph (WCFG) [1] to automatically identify which methods, would give the most benefits when invoked as L-P calls. The techniques estimate the cost of methods; those methods whose costs exceed a pre-determined cost threshold are invoked as L-P calls. At runtime, at an L-P call site, a stub is invoked instead of the target method. The stub checks the dispatch criteria [2] some performance metric like the task queue length to estimate the current overhead of managing the tasks. If the overhead is within acceptable ranges, the stub invokes the actual method as a parallel call, otherwise it is invoked serially. This keeps the overhead of managing the parallel tasks from negating the performance gained by parallelism. This mechanism brings out parallelism across independent tasks, but if there is some dependency across tasks, it incurs extra overhead. If task T2 depends on results calculated by T1, T2 is blocked till the results of T1 are available. In such scenario, T2 starts executing, accesses the closure of T1 but because the results are not available, T2 is blocked temporarily. This involves an unnecessary context switch that could have been avoided if the scheduler had identified that there was a dependency between T1 and T2 and had scheduled T2 only after T1 was complete. The problem can be demonstrated better with the code snippets in Fig. 1. It depicts code snippets across a bitonic sort program that sorts 2k numbers and a matrix inverse calculator [4]. The bolded lines indicate the L-P call sites. In Fig. 1a, (3) depends on (1) and (2). By the time (3) is invoked, most likely (1) and (2) wont finish execution, causing (3) to stall. Same goes for (5) in Fig. 1b, which is dependent on (4). ImArray BitonicSort(ImArray items, int lo, int n, bool incr) { var partOne = BitonicSort(items, lo, mid, true); (1) var partTwo = BitonicSort(items, lo + mid, mid, false); (2) return BitonicMerge(partOne, partTwo, lo, n, incr); (3) } (a) Matrix MatrixInverse(Matrix m) { Matrix lu = DecomposeMatrix(m, indices); for (j = 0; j < n; ++j) { var x = MatrixBackSub(lu, indices, col); for (i = 0; i < n; ++i) result.SetAt(i,j,x.At(i)); } } (b)

(4) (5)

Fig. 1: Code snippets for (a) Bitonic Sort and (b) Matrix Inversion. The bolded lines indicate L-P call sites

The bitonic sort and matrix inverter programs, along with modified versions of Gaussian Blur [5] and Histogram calculator were run with large data sets sorting 230 numbers for the bitonic sort, 1000 x 1000 matrix for inversion, 24 megapixel image for Gaussian blur, and 500,000 numbers for histogram generation. The execution times were averaged from 1000 runs of the code, executed on a dual-Core 1.83 GHz Intel Core 2 Duo system with 3 GB of RAM. The results are depicted in Fig. 2, with Fig. 2a. comparing the execution times (in ms) under serial and parallel execution. Fig. 2b. depicts the waiting time the time spent by at least one of the parallel tasks in stalled state while waiting for some result to be computed in another parallel task. The waiting time is presented as a percentage of the total execution time of the program. We can see that a significant time is spent waiting for other tasks to complete execution. We propose using a dependency-aware scheduler to bring down the time tasks spend in waiting for results from other parallel tasks.

(a)

(b)

Fig. 2: (a) Comparing execution times (in milliseconds) of the four programs and (b) The waiting times

Related Work Scheduling techniques has been proposed by emerging programming systems such as Cilk, .NET Task Parallel Library (TPL), Charm++ etc., as well as in literature. Cilk, TPL and Charm++ use a scheduler [6] where tasks are scheduled onto a user-mode scheduler, that maintains a global task queue and multiple processor-local queues. Tasks are taken off in batches from the global queue and put in the local queue. Work stealing is used to balance the load across the processors. Tasks from the local queues are scheduled one by one on to OS-threads, and scheduled by the OS for execution. Our initial implementation of L-P calls scheduled the tasks using a similar scheduler [1], provided as a part of .NET Framework Task Parallel Library (TPL). Multi-level schedulers have been proposed in [7], which makes use of dynamic thread grouping. The scheduler is designed to schedule DAG-structured computations [7] onto manycore systems. At the top-level, a super manager separates threads into groups with one group-manager and several worker threads. The groups are dynamically partitioned and merged, to balance the load. Group managers collaboratively schedule the worker threads in the group, at the second level. At the third level, the worker threads self-schedule the tasks they generate. Such a scheduler that groups the tasks according to the dependences can be adapted for scheduling dependent L-P calls. Proposal We propose compiler analysis techniques to identify the dependences between the tasks generated by L-P calls. We also present a scheduler design that takes into account the dependencies of the tasks while scheduling them, so as to minimize the total waiting time for the program. That, in turn, means that the tasks are in execution for longer time sequences without stalling, and thus decrease the overall execution time for the program.

Task Dependency Identification. The Task Dependency Identification (TDI) method presented here identified the dependences across the parallel tasks generated at L-P call sites. The CG Analysis and WCFG methods [1], are used to mark the L-P call sites. In addition, the compiler maintains a Task-Result Generation Table (TRG table) that tracks the id of the task generated, the variables used by the task, the result generated by the task, and its dependences. At each L-P call site, the compiler adds the corresponding entry for the task generated at that site into the TRG table. While doing so, it checks if any of the variables used by the task are generated as the result of another task. If so, it marks the current task as being dependent on the other task. This analysis is done on the Single Static Assignment (SSA) [9] form of the code. Consider the example in Table 1. In line (2), task T2 has s1 as parameter, which is generated by task T1. So T2 is dependent on T1, which is represented by T2 T1. Similarly in (3), T3 (T1, T2). This information is given to the scheduler in the form of a Directed Acyclic Graph (DAG) as shown in Fig. 3.
Table 1: The TRG table demonstrating the dependences at various L-P call sites in a code sample

TRG Table Original Code


void Foo() { var x = Bar(); var y = Par(x); var z = Gen(x, y); }

SSA Form
void Foo() { var s1 = Bar(); var s2 = Par(s1); var s3 = Gen(s1, s2); }

id

Params

Result

Dep.

T1 T2 T3

--s1 s1, s2

s1 s2 s3

--T1 T1, T2

(1) (2) (3)

Dependency-Aware Scheduler. The scheduler has to schedule the tasks generated at the L-P call sites, so as to minimize the time the dependent tasks spend on waiting for the dependencies to finish execution. The scheduler is passed the dependency information as a DAG as shown in Figure 3a. To schedule the tasks according to the dependence specified in the DAG, it uses a multi-level scheduling mechanism. The tasks are arranged into scheduling groups as in Fig. 3b, with the tasks within one group being totally independent. The dependences between the tasks are then translated into dependences between the groups, where group Gi+1 depends on Gi (represented as Gi+1 Gi). T1 T3 T5
T3 T5 T6 G0

T2

T4

G1

T2

T4 (a)

T6
T1 G2

(b)

Fig. 3: A sample DAG that is given to the scheduler with (a) representation of the dependences between the tasks and (b) the scheduling groups generated from the DAG provided to the scheduler

The scheduler tries to schedule the tasks into groups as early as possible. The groups created are identified in increasing order of Group ID, where Gi+1 depends on Gi. To start with, all independent tasks go into group G0, and their state in the DAG is marked as scheduled. As other tasks are created, they are put into the earliest available group that satisfies all its dependences. In general, this can be any group that satisfies the condition group(Ti) > max(group(Tdep)), where Tdep represents the dependences of task Ti. In other words, the task can be put in a group with an id greater than the group ids of all of its dependent tasks. The scheduler tries to put the task in the first group that satisfies this requirement,

which generally is the next group. But in case that group has already been executed, the task is put in the first group awaiting execution. The working of the scheduler can be explained with the following example. Consider tasks T3 (T2, T1) and T2 T1. T1 has to go to group G0 as it is independent. For T2, its dependences are scheduled in G0, so it is put in G1. The highest group to contain a dependency of T3 is G1, so T3 is scheduled in G2. The algorithm is summarized in Table 2.
Table 2: Algorithm for the group generate phase of the scheduler detailing the (a) initialization phase, (b) adding one task to the group of tasks and (c) statically grouping a list of parallel tasks into scheduling groups Input: A suitable representation of the DAG {Initialization} empty list groups count = number of tasks in tasks Output: List of scheduling groups {Scheduling group of tasks) while (exists(unprocessed task in tasks)) groupi = new group() foreach (task in tasks) if (predecessor(task) is nil) mark task as scheduled add task to groupi end if if (predecessor(task) is scheduled) mark task as scheduled add task to groups end if end foreach add groupi to groups end while

(a)
{Scheduling single task ti} deps = list of dependent tasks(ti) grpid = max (groupid (all tasks in deps ) ) if (groups[grpid] is not executed) add ti to groups[grpid] else grpid = group in groups that is not executed add ti to groups[grpid] end if

(b)

(c)

Once the groups are formed, they are scheduled one after the other in increasing order of group IDs. Within a group, the tasks are scheduled by the TPL scheduler. The group scheduler schedules one group by moving all the tasks in the group to the TPL task queue. Only when a group is fully executed, the group scheduler schedules the execution of the next group. This guarantees that once a group starts executing, its dependences have all finished executing, and that there will be no stalls during its execution. Results and Discussion We verified our proposal using a prototype. The prototype implemented the TDI analysis as a separate pass of the Mono compiler we built in [1]. After the L-P call site identification analysis was done, the compiler made another pass over the identified sites, creating the TRG table entries, which is stored in an XML file. The scheduler is implemented in C#, with the task creation and management API modeled after the .NET Framework Task API itself [6]. On first invocation, the scheduler reads the XML file and creates the DAG and the scheduling groups. The states of the tasks are tracked by updating their state in the DAG. Recursively generated tasks are handled by duplicating portions of the DAG, to represent multiple invocations of a task. The four programs that were profiled earlier are again executed with the same data set, but this time using the scheduler we designed. The results are given in Fig. 4. We can see that there are performance gains ranging from 9% in the Bitonic Sort program to 16% for the Gaussian Blur program. There is also a corresponding reduction of the task stall times for the programs by as much as 40%. Repeating across a wider range of programs, we see that 5% - 10% increase in performance, while an average of 35% reduction in stall times can be easily obtained. The results are indeed in line with expectations. But the results have only been verified on a single system configuration. While the initial results are promising, there are opportunities for improvement. In addition, more testing is needed with varying hardware and software combinations,

including system with more processing cores and varying memory availability. With further testing of the system, the results can be generalized across varying hardware profiles.

(a)

(b)

Fig. 4: Comparing the (a) execution times (in ms) of the four programs and (b) the waiting time (as percentage of total execution time). The results show significant decrease in both execution times and stall times.

Conclusion and Future Work We proposed a method to identify dependences across tasks generated at L-P call sites, and designed a multi-level scheduler that can schedule these tasks while taking into account the dependences. We have demonstrated that dependency-aware scheduling can help avoid unnecessary stalls and can improve performance, sometimes significantly. While the results are promising, more testing is needed to validate the results across a much wider range of hardware configurations and implement in a full scale compiler. References [1] Soumya S Chatterjee and Dr. R. Gururaj, Lazy-Parallel Function Calls for Automatic Parallelization, Procedings of the International Conference on Computational Intelligence and Information Technology, Springer-LNCS, November 2011 [2] Soumya S Chatterjee and Dr. R. Gururaj, Overhead-Aware Composition of Parallel Components Using Lazy-Parallel Function Calls, International Conference on Advanced in Mobile Network Communication and its Applications, IEEE CPS, August 2012, In Press [3] John Launchbury, A Natural Semantics for Lazy Evaluation, Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of Programming Langauges, pp. 144-154 [4] James D. McCaffrey, Matrix Inversion in C# Using Decomposition, Information on http://jamesmccaffrey.wordpress.com/2011/08/15/matrix-inversion-in-c-using-decomposition/ [5] Bharath M, Gaussian Blur Using C++ AMP, Information on http://blogs.msdn.com /b/nativeconcurrency/archive/2012/03/14/gaussian-blur-using-c-amp.aspx [6] Jeffrey Juday, Understanding the .NET Task Parallel Library TaskScheduler, Information on http://www.codeguru.com/csharp/article.php/c18931/Understanding-the-NET-Task-ParallelLibrary-TaskScheduler.htm [7] Yinglong Xia, Viktor K. Prasanna, James Li, Hierarchical Scheduling of DAG Structured Computations on Manycore Processors with Dynamic Thread Grouping, University of Southern California

[8] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman and F. Kenneth Zadeck, An Efficient Method of Computing Static Single Assignment Form, Conference Record of the 16th ACM Symposium on Principles of Programming Languages

Вам также может понравиться