Terascale Computing

The ability to compute on a terascale with trillions of operations per second and exploiting memory capacities of trillions of bytes has been demonstrated on selected problems. However this achievement has required the use of experimental hardware and software systems, great expertise and much labor. The associated challenges have restricted terascale computing to a small fraction of its potential impact on the scientific enterprise. A primary TSTT goal therefore is to address grid related challenges in terascale computing and provide a coherent, usable, scalable and efficient set of tools enabling terascale computing for mainstream scientific applications.

Hierarchical Design

Comprehensive hierarchical design of both data structures and algorithms is the essential challenge in achieving efficient use of computing resources at the terascale. Interoperability alone is not sufficient as tools with this property may nevertheless have computational complexities and memory footprints that render them useless at the terascale. A single failure in this regard anywhere in the tool chain will incapacitate the over-all system. Hence the hierarchical design concepts introduced in the interoperable meshing section must be applied throughout the entire software system in use. This has been recognized by the scientific simulation community, and the best-in-class tools in most of the relevant sub-disciplines rely on hierarchical design. Examples include CAD systems that organize data in different layers of resolution, adaptivity in meshing, multi-level graph partitioning schemes for static problem partitioning, multigrid solvers and visualization systems that nimbly present data at many levels of resolution. It is our belief, however, that although these tools are nearly optimized for their particular application, their union is not optimized for the over-all problem of terascale simulation. Our intention is to actively consider various design trade-offs across the whole system. We will not try to design better solver or visualization routines. Rather we will consider what hierarchical information known or easily generated within our scope of geometry and meshing technology can be exploited by downstream tools. For example, information about the geometric domain model and its decomposition may be useful in computing optimal decompositions shown in Figure 1. Similarly, hierarchical information generated previously in the process may be preserved and potentially exploited in preconditioning of iterative solvers.

Load Balancing

A critical step in parallel computation is the assignment of application data to processors. The ideal distribution assigns data so that per-processor workloads are the same over all processors (eliminating idle time) and inter-processor communication is minimized. For both structured and unstructured grids this distribution is often done through a serial pre-processing step, using a static partitioning tool such as Chaco or METIS.

A Partition obtained using the RPI partition model

Partition

For adaptive terascale computing, serial static partitioning is insufficient. Dynamic partitioning techniques must be employed from the start of the simulation process. The specific dynamic load balancing algorithms used are often different for structured vs. unstructured grids. Often graph based algorithms are used for unstructured grids while either graph based or geometric methods may be appropriate for structured grids. Our proposal includes parallel mesh generation steps, constructing the discrete contributions to the global system, performing global iterations, and adapting the discretization. The procedures performing these operations are responsible for the construction and dynamic control of the parallel decomposition and its interactions with solution processes. Although this decomposition may be independent of the other three levels, information on how operations and communications are performed within and between those levels is central to the control of the dynamic load balancing procedures.

For the proposed work, we will use Zoltan to redistribute data after parallel meshing and adaptive mesh refinement. The Zoltan Dynamic Load-Balancing Library was developed to provide parallel, dynamic repartitioning to such applications. Zoltan’s design is data-structure neutral, so that it can be used in many different applications without imposing restrictions on the data structures those applications use or requiring application developers to understand Zoltan’s data structures. This makes it easy to support control with respect to the entities at the various levels in the hierarchic domain decomposition and also easy to integrate with target applications. Moreover, Zoltan includes a suite of algorithms, allowing developers to experiment easily with different algorithms to see which is best for their applications. Zoltan currently includes geometric algorithms (Recursive Coordinate Bisection, Recursive Inertial Bisection), tree-based algorithms (Octree Partitioning/SFC, Refinement Tree Partitioning), and graph-based algorithms (through interfaces to the popular ParMETIS and Jostle packages). We will provide interfaces from the TSTT software to Zoltan so that applications using the TSTT software can use Zoltan seamlessly. We will add support to Zoltan for hierarchical machine models of heterogeneous parallel computers. These models will account for various processor speeds, memory capacities, cache structures, and network speeds and topologies. This capability will be incorporated using RPM [TerBea00, FlaTer00] as a prototype for unstructured grids and PADRE as a prototype for structured overlapping grids. We will also add to Zoltan the specific load balancing methods MLB developed for parallel adaptive overlapping structured grids in Overture. Our collective research will provide crucial support for adaptive methods on advanced architectures employing both distributed memory (using MPI) and shared memory (using threads). This will facilitate investigation of optimal partitioning strategies for linear and eigen solvers and preconditioners of interest in the application domains of interest. An aspect of parallel computing important to the effectiveness of dynamic load balancing methods are the languages and mechanisms for supporting inter-processor communications. As part of our efforts in this area we plan to examine the modifications needed for our methods to take advantage of advances in this area.