Fortran DVM — a Language for Portable Parallel Program Development

N.A.Konovalov, V.A.Krukov, S.N.Michailov, A.A.Pogrebtsov
Keldysh Institute of Applied Mathematics
Russian Academy of Sciences

Abstract

This report introduces the new language model which allows one to express functional and data parallelism in scientific andengineering applications on massively parallel computers.

The model combines the major advantages of PCF Fortran and HPF models and is designed for development of portable applications which can beficiently executed on distributed and shared memory computers. The model supports development of library of standard parallel programs and construction of parallel applications from parallel modules. The model also provides the means for dynamic load balancing.

Key Words: Portable Parallel Program, Massively ParallelComputers, Distributed Memory, Shared Memory, Load Balancing.

1. Introduction
2. The Main ideas of the FDVM Language
3. The Description of FDVM Mode

3.1. The Abstract Parallel Machine (APM) Design
3.2. Development of a Program for APM
3.3. Declaration of Data Mapping
3.4. Managing Access to Non-Local Data
3.5. The Specification of Mapping of the APM onto Virtual Parallel Machine (VPM)

4. Comparison of FDVM Model with Major Existing Models

4.1. Message Passing Models
4.2. Shared Memory Models
4.3. High Performance Fortran Model

Conclusion
References

1.Introduction

Keldysh Institute of Applied Mathematics develops the programming language Fortran DVM (Distributed Virtual Memory) and related tools as a part of RAMPA-project [Kr93]. The main goal of this project is to automate the development ofportable scientific and engineering applications for massively parallel computers with distributed memory.

The development of applicational programs for distributed systems encounters a few obstacles. Portability is the major one. We would like to emphasize two aspects of this problem:

Concurrent programs, developed with conventional language tools for distributed systems, require serious efforts for their porting on computers with different architectures (sequential, SIMD processor arrays, MIMD shared memory machines), or different configurations.
The other two problems that require solutions are linking concurrent program using simple modules and accumulating a bank of common concurrent programs. Until there is no solution for the portability problem distributed systems are not likely to be widely used.

The main goal of the development of Fortran DVM language (FDVM) is to provide portability of parallel programs. Besides that the FDVM language must provide thedevelopment of efficient programs for massively parallel computers. In this case the performance is determined mainly by the load balancing and overheads for the interprocessor communication and synchronization.

2. The Main Ideas of the FDVM Language

The concept of FDVM language is based on the following main ideas:

* The execution of a parallel program must correspond with its potential parallelism, explicitly declared by a user by means of special directives or parallel constructs. Even enhanced with specification of the data distribution the method of automatic parallelizing has a principal defect: there is no explicit and clear model of parallel execution.This makes impossible to provide a user with convenient tools for analysis of execution and improving performance of his program.

We believe that distributed shared memory computers are going to become the most popular among massive parallel systems. Since that to describe the potential parallelism of a program it is natural to use the means suggested in the PCF Fortran standard [PCF90]. These means represent the extensive experience of utilizing multiprocessing computers with shared memory. Certainly, it is necessary to adopt these means to
distributed memory computers.

* In case of non-uniform memory computers (among them aredistributed systems with logically-shared but physically distributed memory) it is necessary to provide a user with means to determine the data mapping and to distribute the computations among the processors. The suggested means of the HPF language [HPF93] are proper for this goal.

* In case of computers with non-uniform memory it is imperative to provide a user with means of control of access to virtual memory. Such means were suggested and successfully used for the control of virtual memory in the Fortran programs for BESM-6 computer [Kon77].

* Finally, it is necessary to provide a user with means for the dynamic load balancing.

3. The Description of FDVM Model

For the purpose of simplification let us suppose that programming in Fortran DVM is done in several stages as follows.

3.1. The Abstract Parallel Machine (APM) Design

The description of the parallel machine is necessary for a user to control mapping of computations on processors. In case of distributed system APM is also needed to control the data location.

Since PCF Fortran is geared to uniform multiprocessor computers with shared memory, a user have only capabilities to determine a number of processors for execution of the specific parts of a program.

In case of distributed systems mapping of computations on processors demands more accurate planning and could not be performed as dynamically as it is possible on the shared memory computers.

Abstract parallel machine is introduced to provide a user with an opportunity to specify thoroughly internal potential parallelism of a program. Since that the APM is hierarchy of abstract parallel subsystems, which are multidimensional array of subsystems of the next level of hierarchy. The simultaneous existence of different variants of a description of each subsystem is allowed.

The APM may be described statically or constructed dynamically during execution of a program. The dynamic construction is necessary for work with dynamic arrays, as well as for development of a library of standard parallel programs, which are stored as object modules and do not require special preliminary setting.

3.2. Development of a Program for APM

To develop a program the following model of its execution is used.

At the beginning of execution of a program there is only one branch (the control flow) which is started on the APM from the first program statement. After entering the parallel construct (parallel loop or group of parallel sections) the branch is divided into several parallel branches, which all are executed on specified APM subsystems. The branching is
carried out according to specified variant of representation of current subsystem as an array of subsystems of the next level of hierarchy.

On exit from the parallel construct all branches join the same branch, which existed prior to entering the parallel construct. At this moment all shared variables, updated by theparallel branches, become available to all processors executing this branch.

To avoid formal data dependence of parallel branches, which is an obstacle to independent execution, some variables (and specially described common-blocks) may be declared as private. In this case a copy of these variables (non-initialized) is created in each parallel branch, and that copy is used independently from the others. Private data definition also reduces the scope of availability of data, and eliminates the needless work to provide data consistency.

A mode of access to shared data in parallel branches must be specified. There are following modes:

read only access;
distributed access (for example, different loop iterations use different elements of the array). In this case only implicit barrier synchronization on exit from the parallel construct is required;
random access.

In last case explicit synchronization of parallel branches should be used. The synchronization is available in several forms: events, arithmetic sequence synchronizers, and critical sections (PCF Fortran).
Now let us determine the meaning of the phrase “branch is executed on the subsystem”. These words could be interpreted in two ways, depending upon what architecture of APM is needed for a user.

In case of shared memory machine the sequential parts of program (before the enter to parallel construct and after the exit from it) are executed on the base processor of the subsystem. In case of distributed memory machine each processor of the subsystem executes the same sequence of statements to have its own copy of variables.

3.3. Declaration of Data Mapping

In case of distributed system a user must determine data location in the local memory of the APM processors. It is implemented through the aligning of the arrays and their mapping to APM subsystems (ALIGN in the HPF). Collapsed mapping (when many elements of an array may be mapped to the same subsystem) and replicated mapping (when one element of an array may be mapped to many subsystems) are also allowed.

Data replication is used when calculation can be performed more efficiently repeatedly on different processors than transferring the data between the processors. By default all data which are not mapped by user, are duplicated on all processors of the subsystem where the branch owning these data (the data are private for this branch), is executed.

3.4. Managing Access to Non-Local Data

A user can indicate that when accessing some arrays the access to virtual memory should be performed in bulk preliminary transfers (after the computation of required values by a corresponding processors) rather than element-wise transfers (according to requests from individual processors).

In case of pipeline computation, a programmer must find the balance between the number of transfers and synchronization events on one hand and the productivity of the pipeline on the other.

A user can also specify the buffering mode of non-local variables in the processors memory, which permits to reduce the interprocessor data exchange during repeated access.

3.5. The Specification of Mapping of the APM onto Virtual Parallel Machine (VPM)

Virtual Parallel Machine (VPM) is a machine which is provided to the task by architecture and underlying system software. The structure and number of processors of this machine should be close to real parallel computer or part of it. For distributed computer MPI-machine [MPI93] could be such an example.

VPM as well as APM may be represented as an hierarchy of subsystems. In this case the subsystem should be precisely defined (numbers of its all virtual processors). A subsystem must include only those processors, which belong to the mother subsystem.

Mapping is a correspondence between APM and VPM when each subsystem of APM is mapped onto a subsystem or a processor of VPM. We suggest to provide three ways for such mapping.

Mapping may be specified by the language constructs, processed during the compilation of the program (DISTRIBUTE in the HPF). This way is the least flexible, but the most effective in terms of overheads.
Mapping may be specified through the programming environment (for example, as a UNIX command-line argument).
Mapping may be made dynamically during program execution by special statements (REDISTRIBUTE in the HPF).

This way of mapping as well as means for use of allocatable assumed-shape arrays and means of dynamic creation of APM and VPM are necessary for dynamic load balancing.

4. Comparison of FDVM Model with Major Existing Models

Let us compare the FDVM model with the other models, which can be used in distributed systems.

4.1. Message Passing Models

Currently these are the most widely used models. In those models a user specifies the precise data and computation mapping on the processors and provides the processors cooperation by message passing. Although this approach seems to grant the maximum efficiency of a program execution it is not always true affect efficiency of a program. In general their optimal programming on the applicational level is impossible. There are nointernal reduction functions in most of the models, including PVM [Gei93], though such functions exist in MPI and EXPRESS [EXP88].

It is difficult to organize an effective file access in application if the distributed system has many channels to discs.
Efficiency of a program on massively parallel computers depends on the load balancing which is extremely difficult to provide in these models.

Although the models are quite expedient in expressing of functional parallelism, in case of data parallelism there are serious problems due to the lack of the global address and name space. The main defect of these models is a strict tie of the program to hardware architecture, including the number of processors and their links. Development of libraries of standard programs is practically impossible.

The unquestionable advantage of the models is absence of the need to develop special compilers and possibility to use the models in conventional languages extended by library functions.

4.2. Shared Memory Models

These models use global address space. PCF Fortran and Fortran S [Bod93] are the most characteristic among these models.

The standard of PCF Fortran noted that this model is acceptable not only for the shared memory computers, but also for distributed memory computers. But this is mostly true for such systems as CRAY T3D and CONVEX SPP1000 (distributed shared memory computers) which provide access to non-local data nearly as fast as to local data. The use of the model on the majority of distributed systems is practically impossible due to the lack of means to control data and computations mapping.

Fortran S model provides a user with means of mapping of computations onto processors. Unfortunately the use of program-simulated virtual memory in the model without means to control it unenviably reduces the efficiency of serious applications.

The advantage of the models, as well as FDVM model, is relatively simple compilation.

4.3. High Performance Fortran Model

This model uses global name space and user control over data mapping onto local memories of processors. The model supports the data parallelism and does not support functional parallelism.

There are also two principal differences in the concepts of Fortran DVM and High Performance Fortran languages.

First, a user of FDVM has the possibility of total control over the distribution of computations among processors, while a user of HPF is even not allowed to know how a particular compiler will perform such distribution.

Second, a user of FDVM can fully control the transfer of data between processors and data buffering. A user of HPF has to rely on the “smartness” of the compiler in these matters.

The main effect of differences in approaches mentioned above is that at the cost of certain additional efforts the user of FDVM is always able to make his program as efficient as it could be when conventional means for parallelization and message passing were used. Yet this does not involve any decrease in portability.

The following means of tuning performance of FDVM programs are provided, which are absent in HPF:

Means of processor load balancing:

possibility to define groups of processors, varying in shape and composition;
possibility to map loop iterations and parallel section onto specified processors.

Means of optimizing interprocessor communication:

definition of data portion sizes;
definition of buffer sizes and buffering modes.

Differences in approaches between FDVM and HPF have lead to different views on the role of compile-time optimizations.

The lack of an optimizing compiler from FDVM does not stop the user from achieving required efficiency, although the use of such compiler might in certain cases free him from providing the above annotations.

5. Conclusion

We believe that the success of FDVM language depends on the successful development of FDVM model. This model on one hand must meet the portability and efficiency requirements and on the other hand must be clear for an applicational programmer as well as relatively easy to implement. Since that developers were focused on designing of the model, which was three times significantly modified during 1993-1994.

Presently the design of the run-time system (LIB-DVM) for this model is close to completion. This run-time system is a backbone of the implementation of FDVM language.

The next stage of FDVM language development is going to be manual programming of several applications (probably using the simple experimental compiler-preprocessor) on the Fortran 77 or C languages extended with LIB-DVM means.

Only after the completion of the above stage the formal FDVM language will be described. If by that time the means of functional-parallelism and dynamic load balancing are introduced to HPF language , it would be reflected in FDVM language.

This work is supported in part by the grant of Russian Fundamental Researches Foundation, No. 93-012-628.

References

[Kr93] V.A.Krukov, L.A.Pozdnjakov, I.B.Zadykhailo. RAMPA — CASE for portable parallel programs development. Proceedings of the International Conference “Parallel Computing Technologies”, Obninck, Russia, Aug 30-Sept 4, 1993.
[PCF90] PCF Fortran.Version 3.1, Aug.1, 1990.
[HPF93] High Performance Fortran. Language Specification. Version 1.0, Jan.25, 1993.
[Kon77] N.A. Konovalov, V.A. Krukov, E.Z. Ljubimskij. Controlled virtual memory. Programming 1, 1977 (In Russian).
[MPI93] MPI: A Message Passing Interface. The MPI Forum, August 1993.
[Gei93] A.Geist, A.Beguelin, J.Dongarra, W.Jiang, R.Manchek, V.Sunderam. PVM 3 User’s Guide and Reference Manual. Oak Ridge National Laboratory ORNL/TM-12187, May 1993.
[EXP88] EXPRESS: A Communication Environment for Parallel Computers. ParaSoft Corporation, Release 2.4, 1988.
[Bod93] F.Bodin, L.Kervella, T.Priol. Fortran S: a Fortran Interface for Shared Virtual Memory Architectures. Int. Conf. on Supercomputing, 274-283, July 1993.