Several areas in linear algebra must be addressed to allow wide use of parallel computing. These areas include the development of LAPACK for distributed-memory machines, dense nonsymmetric eigenvalue problems, parallel algorithms for large-scale eigenvalue problems, sparse linear least squares, multigrid algorithms, sparse linear systems, and linear algebra for signal processing.

*
Jack Dongarra holds a joint appointment with the University of Tennessee
(UT) and Oak Ridge National Laboratory (ORNL) under the UT/ORNL Science
Alliance Program. He specializes in numerical algorithms in linear algebra,
parallel computing, use of advanced computers, programming methodology, and
tools for parallel computers. Current research also involves the
development, testing, and documentation of high-quality mathematical
software. He was involved in the design and implementation of the software
packages EISPACK, LINPACK, the BLAS, LAPACK, and PVM and is currently
involved in the design of algorithms and techniques for high-performance
computer architectures. He is also co-editor-in-chief of the International
Journal of Supercomputer Applications and an editor for Parallel Computing,
the Journal of Distributed and Parallel Computing, the Journal of
Supercomputing, IMPACT of Computing in Scientific Applications,
Communications of the ACM, Research Monographs on Parallel and Distributed
Computing, and the Journal of Numerical Linear Algebra with Applications.
*

Previous efforts have focused on the development of a set of linear algebra communication routines for the specification and manipulation of distributed data structures in algorithms. The result has been a proposed standard for the implementation of a scalable communication library, which is the first step in seeking a community consensus on such standards.

The project provides parallel hardware vendors with a clearly defined base set of routines that can be efficiently implemented. As a result, the vendors can build upon this collection of standard low-level routines to create higher-level routines for the distributed-memory communication environments supplied with their parallel machines. Researchers are developing a simple-to-use interface that is powerful enough to allow programmers to use the high-performance message-passing operations available on advanced machines.

In an effort to create a "true" standard for message passing, researchers incorporated into the message-passing interface the most useful features of several systems. Rather than choosing one system to adopt as a standard, features were used from systems by IBM, Intel, nCUBE, Express, p4, and PARMACS. The message paradigm will be attractive because it is widely portable and can be used in communications for distributed-memory and shared-memory multiprocessors, networks of workstations, and any combination of these elements. The paradigm will not be made obsolete by increases in network speeds or by architectures combining shared and distributed-memory components.

The Arnoldi process is a well-known technique for approximating a few eigenvalues and corresponding eigenvectors of a general square matrix. Researchers have developed ARPACK, a software package based upon an implicit restarting scheme that may be viewed as a truncation of the standard implicitly shifted QR-iteration for dense problems. Numerical difficulties and storage problems normally associated with Arnoldi and Lanczos processes are avoided. The algorithm is capable of computing a few (k) eigenvalues with user specified features such as largest real part or largest magnitude using n*2k + O(k^2) storage. No auxiliary storage is required. A set of Schur basis vectors for the k-dimensional eigenspace is computed that is numerically orthogonal to working precision.

ARPACK, which is based upon an implementation of this algorithm, has been designed to be efficient on a variety of high-performance computers. Parallelism within the scheme is obtained primarily through the matrix-vector operations that comprise the majority of the work in the algorithm. The software is capable of solving large-scale symmetric, nonsymmetric, and generalized eigenproblems from significant application areas. ARPACK has been used to solve problems from quantum chemistry, reactive scattering, stability of fluid flow, design of dielectric wave guides, and structural analysis with some problems as large as 250K degrees of freedom.

Continuing research activities involve adapting these techniques to the solution of large (sparse) nonsymmetric and symmetric-indefinite linear systems. Researchers are also developing preconditioning techniques for eigenvalue calculations. The use of preconditioners has proven highly effective in accelerating iterative methods for sparse linear systems; however, it has not received the attention it deserves in the context of large scale nonsymmetric eigenvalue problems. Polynomial preconditioning has been one popular alternative for eigenproblems, but recent work has indicated that other kinds of preconditioners like relaxation, incomplete factorization, or multigrid might also be very useful in solving some nonsymmetric eigenvalue problems.

Among the important design goals of ScaLAPACK++ are scalability, portability, flexibility, and ease-of-use. Such goals present serious challenges, as the layout of an application's data within the hierarchical memory of a concurrent computer is critical in determining the performance and scalability of the parallel code. To enhance the programmability of the library, details of the parallel implementation will be hidden as much as possible, while providing the user with the capability to control the data distribution. The ScaLAPACK++ release will include a general two-dimensional matrix decomposition that supports the most common block/scattered schemes currently used. ScaLAPACK++ can be extended to support arbitrary matrix decompositions by providing the specific parallel BLAS library to operate on such matrices.

Decoupling the matrix operations from the details of the decomposition not only simplifies the encoding of an algorithm but also allows the possibility of postponing the decomposition until runtime. Often, the optimal matrix decomposition is dependent on how the matrix is utilized in other parts of the driving application code. Furthermore, it may be necessary to dynamically alter the matrix decomposition at runtime to accommodate special routines.

The currently supported decomposition scheme defines global matrix objects that are distributed across a logical grid of processors. Matrices are mapped to processors using a block scattered class of decompositions that allows a wide variety of matrix mappings while enhancing scalability and maintaining good load balance for various dense factorization algorithms. At each node of the multi-computer, there is a sequential LAPACK++ library that provides the object-oriented framework to describe block algorithms on conventional matrices in each individual processor.

Parallelism is exploited through the use of distributed-memory versions of the Basic Linear Algebra Subprogram (BLAS) that perform the basic computational units of the block algorithms. Thus, at a higher level, the block algorithms used look the same for the parallel or sequential versions, and only one version of each needs to be maintained.