Research Directions

        We have started/continued working on a number of areas including topology-aware communication in MPI, communication/computation overlap and message progression, high-performance networking for data centers, collective offloading, MPI message queues, hybrid CPU/GPU computing,  RMA interface, QoS-aware message-passing, RDMAoE and converged networking, congestion control aware communication, interaction between network stacks and multi-core systems, among others.  We will report in a near future. 

Other areas of research include (in no particular order), but not limited to:

Evaluation of High-Performance Interconnects and Their Impact on System Performance

Several high-performance interconnects have been recently introduced.  It is very important for the research community to have a systematic assessment of the new features and performance of such interconnects.  One may then design efficient communication layers on top of such interconnects, and devise better parallel algorithms and applications.  We have evaluated the Myrinet-2000 networks (HSLN-2004, NCA-2004), the Sun Fire Link interconnect (PDCS-2003, HPCS-2004, IJHPCN-2006), the Quadrics QsNetII networks (CAC-2006), and Mellanox InfiniBand, Myrinet-10G and NetEffect 10-Gigabit iWARP Ethernet (CAC-2007, HotI-2007) at the user-level, message-passing interface (MPI), and application levels.  There are works on converged VPI fabrics and 10GigE for data centers that are currently under review.

Efficient Point-to-Point and Collective Communications on Single-Rail/Multi-Rail Clusters

Efficient design and implementation of  point-to-point communications, and collective communication algorithms is one of the keys to the performance of clusters.  We have designed efficient collective communication algorithms for multi-rail Quadrics QsNetII clusters (CAC-2006, HPCS-2007).  We have devised gather, scatter, all-to-all broadcast, and all-to-all personalized exchange on such systems using the advanced features of QsNetII such as Remote Direct Memory Access (RDMA).  We have extended our work to SMP-aware collectives (ICPP-2007, Cluster Computing-2008).  Recently, we have proposed multi-connection and multi-core aware collectives on top of InfiniBand (PDCS-2008), as well as process arrival pattern aware collectives (EuroPVMMPI 2009).  In the past, we devised collectives for reconfigurable optical interconnects (PPL-2002, OPODIS-97).

Design and Evaluation of Speculative Communication Subsystems and Network-Aware Architectures

One approach to deal with communication latency is to tolerate it; that is, hide the latency from the processor’s critical path by overlapping it with other high latency events or computations.  Although prediction techniques have been proposed in the past by other researchers to predict the future sharing patterns and coherence activities in shared memory systems, we were the first to propose a number of novel prediction techniques for the send side (IPDPS-99) and receiver side (HPCS-2002, CCPE-2002, CANPC-00) in message-passing systems.  Different pattern-based predictors dynamically learn the application behaviour and predict future communications at both ends.  Thus, communication latency can be hidden or reduced by accurately predicting, and performing or preparing the communication operations in advance.  We have recently proposed a speculative Rendezvous protocol for MPI (HPCS-2008, IJPP-2009) that maximizes the communication/computation overlap and communication progress.  Our work on bypassing message copies in RDMA-based interconnects have been published in CAC-2009.  The extended version of this work including one-sided and send-recv based communication is in process for a journal submission.  Work on asymmetric communication is under way.

Performance, QoS, and Network Virtualization for Data Centers

We have started looking into different mechanisms to boost the performance of modern data centers on top of InfiniBand, iWARP, and 10-Gigabit Ethernet.  Our first work in this area is on the QoS provisioning in InfiniBand socket protocols (SDP and IPoIB) used in data center applications (P2S2-2008).  We have also shown the effectiveness of converged fabrics for data centers.  Work in under way on enhancing the iWARP protocol for data centers.

Power, Performance Efficiency for Multi-Threaded Workloads on SMT/CMP -based SMP Servers

Shared-memory multiprocessors (SMPs) are the backbone of SMP clusters.  Simultaneous multithreading (SMT), chip-multiprocessing (CMP) and chip multithreading (CMT) SMPs are the emerging architectures.  Meanwhile, OpenMP has emerged as the standard for parallel programming on SMPs.  It is interesting to discover how multi-threaded OpenMP workloads behave on hybrid SMT-based/CMP-based/CMT-based SMPs in terms of power and performance.  We have recently proposed that asymmetric multiprocessors coupled with efficient scheduling algorithms could effectively reduce the power consumption of commercial multiprocessors while sustaining performance (HPPAC-2006, CCPE-2009).  Another work attempts at analyzing the architectural bottleneck as well as identifying the impact of operating system on the performance of OpenMP applications including those in the NAS OpenMP benchmarks, and SPEComp2001 applications on Hyper-Threaded Intel SMPs (IOSCA-2005).  We have extended our work to the emerging multi-way dual-core SMP platforms (MTAAP-2007).  Our earlier work looked at the performance of OpenMP applications on a large symmetric multiprocessor (ICS-2003)

Power-Aware, High-Performance Computing

Because the storage, networking, computing, and cooling components of a large-scale cluster consume a significant amount of power, power/energy conservation has become increasingly a critical issue.  Recent studies have shown that power consumption represents a significant fraction of costs in high-performance computing (HPC) and data centers.  The energy consumption is critical as it also affects the cost of cooling, uninterruptible power supplies and backup power generation.  One should also take into account the energy for air circulation and power delivery.  Therefore, it is vital for researchers to come up with novel ideas to manage power and energy consumption in such clusters.  We have been working on devising power-aware schemes and run-time libraries to deliver significant power and energy savings while sustaining high performance when running MPI applications over clusters (Cluster-2007).

Workload Characterization of MPI Scientific Applications

Communication performance is an important factor that affects the performance of parallel applications.  A proper understanding of the communication patterns of message-passing applications will help application developers to maximize their application performance in a given environment.  It will also help system designers to come up with better communication architectures as well as optimized MPI libraries in the future.  We have quantified the characteristics of applications in the NAS-MZ and the SPEChpc2002 suites (PDCS-2005, IPDPS-99, CANPC-00).

Impact of System Noise on the Performance of HPC Applications

System noise including operating system noise has been shown as one of the impediments to the performance of HPC applications on clusters.  We have been quantifying the sources of such noises, and trying to minimize their impact.  The work on efficient OS noise-free scheduling for asymmetric multiprocessor systems is our initial work in this area (HPPAC-2006)

Design and Evaluation of Low-Level and High-Level Communication Protocols/layers

The objective is to develop or optimize middleware and communication libraries on top of high-performance networks.  Contemporary networks have some excellent features that would enable one to design optimal low-level communication protocols and layers such as OpenFabrics, GM, MX, Tports, ELAN, GASNet, OpenIB, iWARP and IP, as well as efficient middleware such as MPI, SDP, kDAPL, uDAPL, ARMCI, sockets, OpenMPI, put/get, Global Array, VIA, MPI-IO, DSM, OpenMP and PVFS, among others.