Algorithmic skeletonIn computing, algorithmic skeletons, or parallelism patterns, are a high-level parallel programming model for parallel and distributed computing. Algorithmic skeletons take advantage of common programming patterns to hide the complexity of parallel and distributed applications. Starting from a basic set of patterns (skeletons), more complex patterns can be built by combining the basic ones. OverviewThe most outstanding feature of algorithmic skeletons, which differentiates them from other high-level parallel programming models, is that orchestration and synchronization of the parallel activities is implicitly defined by the skeleton patterns. Programmers do not have to specify the synchronizations between the application's sequential parts. This yields two implications. First, as the communication/data access patterns are known in advance, cost models can be applied to schedule skeletons programs.[1] Second, that algorithmic skeleton programming reduces the number of errors when compared to traditional lower-level parallel programming models (Threads, MPI). Example programThe following example is based on the Java Skandium library for parallel programming. The objective is to implement an Algorithmic Skeleton-based parallel version of the QuickSort algorithm using the Divide and Conquer pattern. Notice that the high-level approach hides Thread management from the programmer. // 1. Define the skeleton program
Skeleton<Range, Range> sort = new DaC<Range, Range>(
new ShouldSplit(threshold, maxTimes),
new SplitList(),
new Sort(),
new MergeList());
// 2. Input parameters
Future<Range> future = sort.input(new Range(generate(...)));
// 3. Do something else here.
// ...
// 4. Block for the results
Range result = future.get();
The functional codes in this example correspond to four types Condition, Split, Execute, and Merge. public class ShouldSplit implements Condition<Range>{
int threshold, maxTimes, times;
public ShouldSplit(int threshold, int maxTimes){
this.threshold = threshold;
this.maxTimes = maxTimes;
this.times = 0;
}
@Override
public synchronized boolean condition(Range r){
return r.right - r.left > threshold &&
times++ < this.maxTimes;
}
}
The ShouldSplit class implements the Condition interface. The function receives an input, Range r in this case, and returning true or false. In the context of the Divide and Conquer where this function will be used, this will decide whether a sub-array should be subdivided again or not. The SplitList class implements the split interface, which in this case divides an (sub-)array into smaller sub-arrays. The class uses a helper function public class SplitList implements Split<Range, Range>{
@Override
public Range[] split(Range r){
int i = partition(r.array, r.left, r.right);
Range[] intervals = {new Range(r.array, r.left, i-1),
new Range(r.array, i+1, r.right)};
return intervals;
}
}
The Sort class implements and Execute interface, and is in charge of sorting the sub-array specified by public class Sort implements Execute<Range, Range> {
@Override
public Range execute(Range r){
if (r.right <= r.left) return r;
Arrays.sort(r.array, r.left, r.right+1);
return r;
}
}
Finally, once a set of sub-arrays are sorted we merge the sub-array parts into a bigger array with the MergeList class which implements a Merge interface. public class MergeList implements Merge<Range, Range>{
@Override
public Range merge(Range[] r){
Range result = new Range( r[0].array, r[0].left, r[1].right);
return result;
}
}
Frameworks and librariesASSISTASSIST[2][3] is a programming environment which provides programmers with a structured coordination language. The coordination language can express parallel programs as an arbitrary graph of software modules. The module graph describes how a set of modules interact with each other using a set of typed data streams. The modules can be sequential or parallel. Sequential modules can be written in C, C++, or Fortran; and parallel modules are programmed with a special ASSIST parallel module (parmod). AdHoc,[4][5] a hierarchical and fault-tolerant Distributed Shared Memory (DSM) system is used to interconnect streams of data between processing elements by providing a repository with: get/put/remove/execute operations. Research around AdHoc has focused on transparency, scalability, and fault-tolerance of the data repository. While not a classical skeleton framework, in the sense that no skeletons are provided, ASSIST's generic parmod can be specialized into classical skeletons such as: farm, map, etc. ASSIST also supports autonomic control of parmods, and can be subject to a performance contract by dynamically adapting the number of resources used. CO2P3SCO2P3S (Correct Object-Oriented Pattern-based Parallel Programming System), is a pattern oriented development environment,[6] which achieves parallelism using threads in Java. CO2P3S is concerned with the complete development process of a parallel application. Programmers interact through a programming GUI to choose a pattern and its configuration options. Then, programmers fill the hooks required for the pattern, and new code is generated as a framework in Java for the parallel execution of the application. The generated framework uses three levels, in descending order of abstraction: patterns layer, intermediate code layer, and native code layer. Thus, advanced programmers may intervene the generated code at multiple levels to tune the performance of their applications. The generated code is mostly type safe, using the types provided by the programmer which do not require extension of superclass, but fails to be completely type safe such as in the reduce(..., Object reducer) method in the mesh pattern. The set of patterns supported in CO2P3S corresponds to method-sequence, distributor, mesh, and wavefront. Complex applications can be built by composing frameworks with their object references. Nevertheless, if no pattern is suitable, the MetaCO2P3S graphical tool addresses extensibility by allowing programmers to modify the pattern designs and introduce new patterns into CO2P3S. Support for distributed memory architectures in CO2P3S was introduced in later.[7] To use a distributed memory pattern, programmers must change the pattern's memory option from shared to distributed, and generate the new code. From the usage perspective, the distributed memory version of the code requires the management of remote exceptions. Calcium & SkandiumCalcium is greatly inspired by Lithium and Muskel. As such, it provides algorithmic skeleton programming as a Java library. Both task and data parallel skeletons are fully nestable; and are instantiated via parametric skeleton objects, not inheritance. Calcium supports the execution of skeleton applications on top of the ProActive environment for distributed cluster like infrastructure. Additionally, Calcium has three distinctive features for algorithmic skeleton programming. First, a performance tuning model which helps programmers identify code responsible for performance bugs.[8] Second, a type system for nestable skeletons which is proven to guarantee subject reduction properties and is implemented using Java Generics.[9] Third, a transparent algorithmic skeleton file access model, which enables skeletons for data intensive applications.[10] Skandium is a complete re-implementation of Calcium for multi-core computing. Programs written on Skandium may take advantage of shared memory to simplify parallel programming.[11] EdenEden[12] is a parallel programming language for distributed memory environments, which extends Haskell. Processes are defined explicitly to achieve parallel programming, while their communications remain implicit. Processes communicate through unidirectional channels, which connect one writer to exactly one reader. Programmers only need to specify which data a processes depends on. Eden's process model provides direct control over process granularity, data distribution and communication topology. Eden is not a skeleton language in the sense that skeletons are not provided as language constructs. Instead, skeletons are defined on top of Eden's lower-level process abstraction, supporting both task and data parallelism. So, contrary to most other approaches, Eden lets the skeletons be defined in the same language and at the same level, the skeleton instantiation is written: Eden itself. Because Eden is an extension of a functional language, Eden skeletons are higher order functions. Eden introduces the concept of implementation skeleton, which is an architecture independent scheme that describes a parallel implementation of an algorithmic skeleton. eSkelThe Edinburgh Skeleton Library (eSkel) is provided in C and runs on top of MPI. The first version of eSkel was described in,[13] while a later version is presented in.[14] In,[15] nesting-mode and interaction-mode for skeletons are defined. The nesting-mode can be either transient or persistent, while the interaction-mode can be either implicit or explicit. Transient nesting means that the nested skeleton is instantiated for each invocation and destroyed Afterwards, while persistent means that the skeleton is instantiated once and the same skeleton instance will be invoked throughout the application. Implicit interaction means that the flow of data between skeletons is completely defined by the skeleton composition, while explicit means that data can be generated or removed from the flow in a way not specified by the skeleton composition. For example, a skeleton that produces an output without ever receiving an input has explicit interaction. Performance prediction for scheduling and resource mapping, mainly for pipe-lines, has been explored by Benoit et al.[16][17][18][19] They provided a performance model for each mapping, based on process algebra, and determine the best scheduling strategy based on the results of the model. More recent works have addressed the problem of adaptation on structured parallel programming,[20] in particular for the pipe skeleton.[21][22] FastFlowFastFlow is a skeletal parallel programming framework specifically targeted to the development of streaming and data-parallel applications. Being initially developed to target multi-core platforms, it has been successively extended to target heterogeneous platforms composed of clusters of shared-memory platforms,[23][24] possibly equipped with computing accelerators such as NVidia GPGPUs, Xeon Phi, Tilera TILE64. The main design philosophy of FastFlow is to provide application designers with key features for parallel programming (e.g. time-to-market, portability, efficiency and performance portability) via suitable parallel programming abstractions and a carefully designed run-time support.[25] FastFlow is a general-purpose C++ programming framework for heterogeneous parallel platforms. Like other high-level programming frameworks, such as Intel TBB and OpenMP, it simplifies the design and engineering of portable parallel applications. However, it has a clear edge in terms of expressiveness and performance with respect to other parallel programming frameworks in specific application scenarios, including, inter alia: fine-grain parallelism on cache-coherent shared-memory platforms; streaming applications; coupled usage of multi-core and accelerators. In other cases FastFlow is typically comparable to (and is some cases slightly faster than) state-of-the-art parallel programming frameworks such as Intel TBB, OpenMP, Cilk, etc.[26] HDCHigher-order Divide and Conquer (HDC)[27] is a subset of the functional language Haskell. Functional programs are presented as polymorphic higher-order functions, which can be compiled into C/MPI, and linked with skeleton implementations. The language focus on divide and conquer paradigm, and starting from a general kind of divide and conquer skeleton, more specific cases with efficient implementations are derived. The specific cases correspond to: fixed recursion depth, constant recursion degree, multiple block recursion, elementwise operations, and correspondent communications[28] HDC pays special attention to the subproblem's granularity and its relation with the number of Available processors. The total number of processors is a key parameter for the performance of the skeleton program as HDC strives to estimate an adequate assignment of processors for each part of the program. Thus, the performance of the application is strongly related with the estimated number of processors leading to either exceeding number of subproblems, or not enough parallelism to exploit available processors. HOC-SAHOC-SA is an Globus Incubator project. JaSkelJaSkel[30] is a Java-based skeleton framework providing skeletons such as farm, pipe and heartbeat. Skeletons are specialized using inheritance. Programmers implement the abstract methods for each skeleton to provide their application specific code. Skeletons in JaSkel are provided in both sequential, concurrent and dynamic versions. For example, the concurrent farm can be used in shared memory environments (threads), but not in distributed environments (clusters) where the distributed farm should be used. To change from one version to the other, programmers must change their classes' signature to inherit from a different skeleton. The nesting of skeletons uses the basic Java Object class, and therefore no type system is enforced during the skeleton composition. The distribution aspects of the computation are handled in JaSkel using AOP, more specifically the AspectJ implementation. Thus, JaSkel can be deployed on both cluster and Grid like infrastructures.[31] Nevertheless, a drawback of the JaSkel approach is that the nesting of the skeleton strictly relates to the deployment infrastructure. Thus, a double nesting of farm yields a better performance than a single farm on hierarchical infrastructures. This defeats the purpose of using AOP to separate the distribution and functional concerns of the skeleton program. Lithium & MuskelLithium[32][33][34] and its successor Muskel are skeleton frameworks developed at University of Pisa, Italy. Both of them provide nestable skeletons to the programmer as Java libraries. The evaluation of a skeleton application follows a formal definition of operational semantics introduced by Aldinucci and Danelutto,[35][36] which can handle both task and data parallelism. The semantics describe both functional and parallel behavior of the skeleton language using a labeled transition system. Additionally, several performance optimization are applied such as: skeleton rewriting techniques [18, 10], task lookahead, and server-to-server lazy binding.[37] At the implementation level, Lithium exploits macro-data flow[38][39] to achieve parallelism. When the input stream receives a new parameter, the skeleton program is processed to obtain a macro-data flow graph. The nodes of the graph are macro-data flow instructions (MDFi) which represent the sequential pieces of code provided by the programmer. Tasks are used to group together several MDFi, and are consumed by idle processing elements from a task pool. When the computation of the graph is concluded, the result is placed into the output stream and thus delivered back to the user. Muskel also provides non-functional features such as Quality of Service (QoS);[40] security between task pool and interpreters;[41][42] and resource discovery, load balancing, and fault tolerance when interfaced with Java / Jini Parallel Framework (JJPF),[43] a distributed execution framework. Muskel also provides support for combining structured with unstructured programming[44] and recent research has addressed extensibility.[45] MallbaMallba[46] is a library for combinatorial optimizations supporting exact, heuristic and hybrid search strategies.[47] Each strategy is implemented in Mallba as a generic skeleton which can be used by providing the required code. On the exact search algorithms Mallba provides branch-and-bound and dynamic-optimization skeletons. For local search heuristics Mallba supports: hill climbing, metropolis, simulated annealing, and tabu search; and also population based heuristics derived from evolutionary algorithms such as genetic algorithms, evolution strategy, and others (CHC). The hybrid skeletons combine strategies, such as: GASA, a mixture of genetic algorithm and simulated annealing, and CHCCES which combines CHC and ES. The skeletons are provided as a C++ library and are not nestable but type safe. A custom MPI abstraction layer is used, NetStream, which takes care of primitive data type marshalling, synchronization, etc. A skeleton may have multiple lower-level parallel implementations depending on the target architectures: sequential, LAN, and WAN. For example: centralized master-slave, distributed master-slave, etc. Mallba also provides state variables which hold the state of the search skeleton. The state links the search with the environment, and can be accessed to inspect the evolution of the search and decide on future actions. For example, the state can be used to store the best solution found so far, or α, β values for branch and bound pruning.[48] Compared with other frameworks, Mallba's usage of skeletons concepts is unique. Skeletons are provided as parametric search strategies rather than parametric parallelization patterns. MarrowMarrow[49][50] is a C++ algorithmic skeleton framework for the orchestration of OpenCL computations in, possibly heterogeneous, multi-GPU environments. It provides a set of both task and data-parallel skeletons that can be composed, through nesting, to build compound computations. The leaf nodes of the resulting composition trees represent the GPU computational kernels, while the remainder nodes denote the skeleton applied to the nested sub-tree. The framework takes upon itself the entire host-side orchestration required to correctly execute these trees in heterogeneous multi-GPU environments, including the proper ordering of the data-transfer and of the execution requests, and the communication required between the tree's nodes. Among Marrow's most distinguishable features are a set of skeletons previously unavailable in the GPU context, such as Pipeline and Loop, and the skeleton nesting ability – a feature also new in this context. Moreover, the framework introduces optimizations that overlap communication and computation, hence masking the latency imposed by the PCIe bus. The parallel execution of a Marrow composition tree by multiple GPUs follows a data-parallel decomposition strategy, that concurrently applies the entire computational tree to different partitions of the input dataset. Other than expressing which kernel parameters may be decomposed and, when required, defining how the partial results should be merged, the programmer is completely abstracted from the underlying multi-GPU architecture. More information, as well as the source code, can be found at the Marrow website MuesliThe Muenster Skeleton Library Muesli[51][52] is a C++ template library which re-implements many of the ideas and concepts introduced in Skil, e.g. higher order functions, currying, and polymorphic types [1]. It is built on top of MPI 1.2 and OpenMP 2.5 and supports, unlike many other skeleton libraries, both task and data parallel skeletons. Skeleton nesting (composition) is similar to the two tier approach of P3L, i.e. task parallel skeletons can be nested arbitrarily while data parallel skeletons cannot, but may be used at the leaves of a task parallel nesting tree.[53] C++ templates are used to render skeletons polymorphic, but no type system is enforced. However, the library implements an automated serialization mechanism inspired by[54] such that, in addition to the standard MPI data types, arbitrary user-defined data types can be used within the skeletons. The supported task parallel skeletons[55] are Branch & Bound,[56] Divide & Conquer,[57][58] Farm,[59][60] and Pipe, auxiliary skeletons are Filter, Final, and Initial. Data parallel skeletons, such as fold (reduce), map, permute, zip, and their variants are implemented as higher order member functions of a distributed data structure. Currently, Muesli supports distributed data structures for arrays, matrices, and sparse matrices.[61] As a unique feature, Muesli's data parallel skeletons automatically scale both on single- as well as on multi-core, multi-node cluster architectures.[62][63] Here, scalability across nodes and cores is ensured by simultaneously using MPI and OpenMP, respectively. However, this feature is optional in the sense that a program written with Muesli still compiles and runs on a single-core, multi-node cluster computer without changes to the source code, i.e. backward compatibility is guaranteed. This is ensured by providing a very thin OpenMP abstraction layer such that the support of multi-core architectures can be switched on/off by simply providing/omitting the OpenMP compiler flag when compiling the program. By doing so, virtually no overhead is introduced at runtime. P3L, SkIE, SKElibP3L[64] (Pisa Parallel Programming Language) is a skeleton based coordination language. P3L provides skeleton constructs which are used to coordinate the parallel or sequential execution of C code. A compiler named Anacleto[65] is provided for the language. Anacleto uses implementation templates to compile P3 L code into a target architecture. Thus, a skeleton can have several templates each optimized for a different architecture. A template implements a skeleton on a specific architecture and provides a parametric process graph with a performance model. The performance model can then be used to decide program transformations which can lead to performance optimizations.[66] A P3L module corresponds to a properly defined skeleton construct with input and output streams, and other sub-modules or sequential C code. Modules can be nested using the two tier model, where the outer level is composed of task parallel skeletons, while data parallel skeletons may be used in the inner level [64]. Type verification is performed at the data flow level, when the programmer explicitly specifies the type of the input and output streams, and by specifying the flow of data between sub-modules. SkIE[67] (Skeleton-based Integrated Environment) is quite similar to P3L, as it is also based on a coordination language, but provides advanced features such as debugging tools, performance analysis, visualization and graphical user interface. Instead of directly using the coordination language, programmers interact with a graphical tool, where parallel modules based on skeletons can be composed. SKELib[68] builds upon the contributions of P3L and SkIE by inheriting, among others, the template system. It differs from them because a coordination language is no longer used, but instead skeletons are provided as a library in C, with performance similar as the one achieved in P3L. Contrary to Skil, another C like skeleton framework, type safety is not addressed in SKELib. PAS and EPASPAS (Parallel Architectural Skeletons) is a framework for skeleton programming developed in C++ and MPI.[69][70] Programmers use an extension of C++ to write their skeleton applications1 . The code is then passed through a Perl script which expands the code to pure C++ where skeletons are specialized through inheritance. In PAS, every skeleton has a Representative (Rep) object which must be provided by the programmer and is in charge of coordinating the skeleton's execution. Skeletons can be nested in a hierarchical fashion via the Rep objects. Besides the skeleton's execution, the Rep also explicitly manages the reception of data from the higher level skeleton, and the sending of data to the sub-skeletons. A parametrized communication/synchronization protocol is used to send and receive data between parent and sub-skeletons. An extension of PAS labeled as SuperPas[71] and later as EPAS[72] addresses skeleton extensibility concerns. With the EPAS tool, new skeletons can be added to PAS. A Skeleton Description Language (SDL) is used to describe the skeleton pattern by specifying the topology with respect to a virtual processor grid. The SDL can then be compiled into native C++ code, which can be used as any other skeleton. SBASCOSBASCO (Skeleton-BAsed Scientific COmponents) is a programming environment oriented towards efficient development of parallel and distributed numerical applications.[73] SBASCO aims at integrating two programming models: skeletons and components with a custom composition language. An application view of a component provides a description of its interfaces (input and output type); while a configuration view provides, in addition, a description of the component's internal structure and processor layout. A component's internal structure can be defined using three skeletons: farm, pipe and multi-block. SBASCO's addresses domain decomposable applications through its multi-block skeleton. Domains are specified through arrays (mainly two dimensional), which are decomposed into sub-arrays with possible overlapping boundaries. The computation then takes place in an iterative BSP like fashion. The first stage consists of local computations, while the second stage performs boundary exchanges. A use case is presented for a reaction-diffusion problem in.[74] Two type of components are presented in.[75] Scientific Components (SC) which provide the functional code; and Communication Aspect Components (CAC) which encapsulate non-functional behavior such as communication, distribution processor layout and replication. For example, SC components are connected to a CAC component which can act as a manager at runtime by dynamically re-mapping processors assigned to a SC. A use case showing improved performance when using CAC components is shown in.[76] SCLThe Structured Coordination Language (SCL)[77] was one of the earliest skeleton programming languages. It provides a co-ordination language approach for skeleton programming over software components. SCL is considered a base language, and was designed to be integrated with a host language, for example Fortran or C, used for developing sequential software components. In SCL, skeletons are classified into three types: configuration, elementary and computation. Configuration skeletons abstract patterns for commonly used data structures such as distributed arrays (ParArray). Elementary skeletons correspond to data parallel skeletons such as map, scan, and fold. Computation skeletons which abstract the control flow and correspond mainly to task parallel skeletons such as farm, SPMD, and iterateUntil. The coordination language approach was used in conjunction with performance models for programming traditional parallel machines as well as parallel heterogeneous machines that have different multiple cores on each processing node.[78] SkePUSkePU[79] SkePU is a skeleton programming framework for multicore CPUs and multi-GPU systems. It is a C++ template library with six data-parallel and one task-parallel skeletons, two container types, and support for execution on multi-GPU systems both with CUDA and OpenCL. Recently, support for hybrid execution, performance-aware dynamic scheduling and load balancing is developed in SkePU by implementing a backend for the StarPU runtime system. SkePU is being extended for GPU clusters. SKiPPER & QUAFFSKiPPER is a domain specific skeleton library for vision applications[80] which provides skeletons in CAML, and thus relies on CAML for type safety. Skeletons are presented in two ways: declarative and operational. Declarative skeletons are directly used by programmers, while their operational versions provide an architecture specific target implementation. From the runtime environment, CAML skeleton specifications, and application specific functions (provided in C by the programmer), new C code is generated and compiled to run the application on the target architecture. One of the interesting things about SKiPPER is that the skeleton program can be executed sequentially for debugging. Different approaches have been explored in SKiPPER for writing operational skeletons: static data-flow graphs, parametric process networks, hierarchical task graphs, and tagged-token data-flow graphs.[81] QUAFF[82] is a more recent skeleton library written in C++ and MPI. QUAFF relies on template-based meta-programming techniques to reduce runtime overheads and perform skeleton expansions and optimizations at compilation time. Skeletons can be nested and sequential functions are stateful. Besides type checking, QUAFF takes advantage of C++ templates to generate, at compilation time, new C/MPI code. QUAFF is based on the CSP-model, where the skeleton program is described as a process network and production rules (single, serial, par, join).[83] SkeToThe SkeTo[84] project is a C++ library which achieves parallelization using MPI. SkeTo is different from other skeleton libraries because instead of providing nestable parallelism patterns, SkeTo provides parallel skeletons for parallel data structures such as: lists, trees,[85][86] and matrices.[87] The data structures are typed using templates, and several parallel operations can be invoked on them. For example, the list structure provides parallel operations such as: map, reduce, scan, zip, shift, etc... Additional research around SkeTo has also focused on optimizations strategies by transformation, and more recently domain specific optimizations.[88] For example, SkeTo provides a fusion transformation[89] which merges two successive function invocations into a single one, thus decreasing the function call overheads and avoiding the creation of intermediate data structures passed between functions. SkilSkil[90] is an imperative language for skeleton programming. Skeletons are not directly part of the language but are implemented with it. Skil uses a subset of C language which provides functional language like features such as higher order functions, curring and polymorphic types. When Skil is compiled, such features are eliminated and a regular C code is produced. Thus, Skil transforms polymorphic high order functions into monomorphic first order C functions. Skil does not support nestable composition of skeletons. Data parallelism is achieved using specific data parallel structures, for example to spread arrays among available processors. Filter skeletons can be used. STAPL Skeleton FrameworkIn STAPL Skeleton Framework [91][92] skeletons are defined as parametric data flow graphs, letting them scale beyond 100,000 cores. In addition, this framework addresses composition of skeletons as point-to-point composition of their corresponding data flow graphs through the notion of ports, allowing new skeletons to be easily added to the framework. As a result, this framework eliminate the need for reimplementation and global synchronizations in composed skeletons. STAPL Skeleton Framework supports nested composition and can switch between parallel and sequential execution in each level of nesting. This framework benefits from scalable implementation of STAPL parallel containers[93] and can run skeletons on various containers including vectors, multidimensional arrays, and lists. T4PT4P was one of the first systems introduced for skeleton programming.[94] The system relied heavily on functional programming properties, and five skeletons were defined as higher order functions: Divide-and-Conquer, Farm, Map, Pipe and RaMP. A program could have more than one implementation, each using a combination of different skeletons. Furthermore, each skeleton could have different parallel implementations. A methodology based on functional program transformations guided by performance models of the skeletons was used to select the most appropriate skeleton to be used for the program as well as the most appropriate implementation of the skeleton.[95] Frameworks comparison
See alsoReferences
|
Portal di Ensiklopedia Dunia