As a member of the Parallel Programming Lab (PPL), I work on the Charm++ programming model and runtime system. Charm++ is an object-based parallel programming model, mainly used in the construction of high performance computing (HPC) applications (usually involving scientific simulations). I'm involved with a variety of efforts within PPL. However, my particular area of focus is enabling the execution of applications on heterogeneous systems through various programming model, runtime system, and build process enhancements/modifications.

In addition to working on Charm++ itself, I also work on one of the applications that Charm++ has been used to create, called NAMD. NAMD is used throughout the world to simulate biomolecular systems at the atomic scale using classical mechanics. Scientists have used NAMD for genome sequencing, understanding anesthetics, combating bird flu, understanding the brain, pinpointing the causes of Parkinson's and Alzheimer's, and much more.


My research interests focus mainly on programming model and runtime system support for heterogeneous systems. In other words, what support can (should) programming languages and their associated runtime systems play in helping the programmers create applications that target heterogeneous systems? There has been an increased interest in using heterogeneous systems, especially for computation intensive codes, in recent years since power limitations effectively ended the clock speed race. We keep the term "heterogeneous" as open as possible. It may include systems with simple differences such as the amount of RAM per node in a cluster to more complex systems that include multiple host core architectures along with multiple accelerator technologies. Currently, the work focuses on supporting Cell and MIC, however, we are also looking into methods for extending support to GPGPU hardware.

Please note, since this research is in the context of the Charm++ programming model, the remainder of this discussion assumes that the reader is somewhat familiar with the Charm++ programming model. If this is not the case, please see the publications section below. The paper "Towards a Framework for Abstracting Accelerators in Parallel Applications: Experience with Cell" has a good introduction to the research along with a brief high-level introduction to the Charm++ programming model.

We are extending the Charm++ programming model and modifying the Charm++ runtime system to support accelerator technologies and heterogeneous clusters in general. In short, we have introduced accelerator entry methods into the Charm++ programming model. Entry methods, in general, can be thought of as tasks. Accelerated entry methods are entry methods that may or may not execute on an accelerator. The underlying runtime system then takes care of automatically moving data as required to the core, host or accelerator, that is tasked to execute the entry method. Entry methods, including accelerated entry methods, and data movement all occur asynchronously under the direction of the runtime system. Given the clear boundaries between the entry methods, we have further modified runtime system to handle some of the mundane details of executing an application on a heterogeneous system. For example, with knowledge of the data types, array lengths, and so on that make up the application's data, the runtime system can modify the data to correct for architecture differences, such as endianness, as data passes between cores. In addition to accelerated entry methods, we have also introduced accelerated blocks and a SIMD Instruction Abstraction. For more details on our research, please see the publications section for relevant papers.

When programming for accelerator technologies, it is quite common for programmers to have to include architecture specific code within their application code. This increases the burden placed on programmers in that they not only have to structure their application towards a specific type of core, but it also decreases the portability of the code itself. Our modifications to the Charm++ programming model and runtime system help to divorce the application code from the architecture specific details. However, it is clear that these architecture specific details are important, especially when it comes to the performance of an application running on the given architecture. Thus, a balance must be struck to make sure performance is good while still assisting the programmer.

Perhaps more importantly, given a unified programming model and portable code, the runtime system can start doing some more interesting things on the programmer's behalf. One such activity is automatic dynamic load balancing. Given a heterogeneous application (that is, an application with multiple different calculations going on, with task variations within a given calculation), spreading the application's workload across the available cores, host and accelerator alike, may not be straight forward for the programmer to do (especially at compile time). The Charm++ load balancing framework has already makes runtime measurements to load balance applications executing on homogeneous clusters. This research intends to extend this work to load balancing on heterogeneous systems by having the runtime system dynamically migrated work between the host cores and any available accelerators.