Technical Program



Saturday, 7 June 2008
Full day 1st International Workshop on Super Visualization
Meeting Room: Acesso
Full day International Workshop on Scalable Tools for High-End Computing
Meeting Room: Melambus


Sunday, 8 June 2008
Full day Tutorial: Programming emerging architectures: GPUs and multicores
Meeting Room: Melambus


Monday, 9 June 2008
08:30-08:45 Opening
08:45-09:45 Keynote 1: Many-core GPU Computing with NVIDIA CUDA
Mark Harris, NVIDIA
10:45-11:00 Break
10:00-11:30 Session 1: Algorithms and Applications I

Implementing Wilson-Dirac Operator on the Cell Broadband Engine.
Khaled Z. Ibrahim (IRISA/INRIA, Rennes, France)
F. Bodin (IRISA/INRIA, Rennes, France)

Biomedical Image Analysis on a Cooperative Cluster of GPUs and Multicores.
Timothy D.R. Hartley (Department of Biomedical Informatics and Electrical and Computer Engineering, Ohio State University, USA)
Umit Catalyurek (Department of Biomedical Informatics and Electrical and Computer Engineering, Ohio State University, USA)
Antonio Ruiz (Computer Architecture Department, University of Malaga, Spain)
Manuel Ujaldon (Computer Architecture Department, University of Malaga, Spain)
Francisco Igual (Department of Computer Engineering and Computer Science, University Jaume I, Spain)
Rafael Mayo (Department of Computer Engineering and Computer Science, University Jaume I, Spain)

Data Mining on the Cell Broadband Engine
Gregory Buehrer (Ohio State University)
Srinivasan Parthasarathy (Ohio State University)
Matthew Goyder (Ohio State University)
11:30-12:30 Session 2: Performance Evaluation I

Accurate Memory Signatures and Synthetic Address Traces for HPC Applications
Jonathan Weinberg (University of California, San Diego)
Allan Snavely (San Diego Supercomputer Center)

Preserving Time in Large-Scale Communication Traces
Prasun Ratn (North Carolina State University)
Frank Mueller (North Carolina State University)
Bronis R. de Supinski (Lawrence Livermore National Laboratory)
Martin Schulz (Lawrence Livermore National Laboratory)
12:30-14:00 Break
14:00-15:30 Session 3: Architecture I

A Vacuum Microelectronic Freespace Crossbar
Michel Victor (Exaconnect Inc)
Aris Silzars (Exaconnect Inc)
Edward Davidson (University of Michigan)

Adaptive DRAM Temperature and Power Management
Song Liu (Northwestern University)
Seda Memik (Northwestern University)
Yu Zhang (Northwestern University)
Gokhan Memik (Northwestern University)

The Shared-Thread Multiprocessor
Jeffery A. Brown (UC San Diego)
Dean M. Tullsen (UC San Diego)
15:30-15:45 Break
15:45-17:15 Session 4: Communication and Synchronization I

Advanced Collective Communication in Aspen
Qasim Ali (Purdue University, West Lafayette)
Sam Midkiff (Purdue University, West Lafayette)
Vijay Pai (Purdue University, West Lafayette)

The Deep Computing Messaging Framework: Generalized Scalable Message Passing on the Blue Gene/P Supercomputer
Sameer Kumar (IBM Research)
Gabor Dozsa (IBM Research)
Gheorghe Almasi (IBM Research)
Dong Chen (IBM Research)
Mark E. Giampapa (IBM Research)
Philip Heidelberger (IBM Research)
Michael Blocksome (IBM Rochester)
Ahmad Faraj (IBM Rochester)
Jeff Parker (IBM Rochester)
Joseph Ratterman (IBM Rochester)
Brian Smith (IBM Rochester)
Charles Archer (IBM Rochester)

A Projection-Based Optimization Framework for Abstractions with Application to the Unstructured Mesh Domain
Brian S. White (Cornell University)
Sally A. McKee (Cornell University)
Daniel Quinlan (Lawrence Livermore National Laboratory)


Tuesday, 10 June 2008
08:30-09:30 Keynote 2: Challenges on the Road to Exascale Computing
Tilak Agerwala, IBM Research
09:30-09:45 Break
09:45-11:15 Session 5: File Systems

CprFS: A User-level File System to Support Consistent File State for Checkpoint and Restart
Ruini Xue (Department of Computer Science and Techonology, Tsinghua University)
Wenguang Chen (Department of Computer Science and Techonology, Tsinghua University)
Weimin Zheng (Department of Computer Science and Techonology, Tsinghua University)

Timely Offloading of Result-Data in HPC Centers
Henry Monti (Virginia Tech.)
Ali R. Butt (Virginia Tech.)
Sudharshan S. Vazhkudai (Oak Ridge National Laboratory)

Shifted Declustering: An Ideal-placement Layout Scheme for Multi-way Replication Storage Architecture
Huijun Zhu (University of Central Florida)
Peng Gu (University of Central Florida)
Jun Wang (University of Central Florida
11:15-11:30 Break
11:30-12:30 Session 6: Fault Tolerance

Can Software Reliability Outperform Hardware Reliability on High Performance Interconnects? : A Case Study with MPI over InfiniBand
Matthew J. Koop (The Ohio State University)
Rahul Kumar (The Ohio State University)
Dhabaleswar K. Panda (The Ohio State University)

Soft Error Vulnerability of Iterative Linear Algebra Methods
Greg Bronevetsky (Lawrence Livermore National Laboratory)
Bronis deSupinski (Lawrence Livermore National Laboratory)
12:30-14:00 Break
14:00-15:30 Session 7: Operating Systems

Evaluating the Effect of Replacing CNK with Linux on the Compute-Nodes of Blue Gene/L
Edi Shmueli (IBM)
George Almasi (IBM)
Jose Brunheroto (IBM)
Jose Castanos (IBM)
Gabor Dozsa (IBM)
Sameer Kumar (IBM)
Derek Lieber (IBM)

Power-aware Dynamic Placement of HPC Applications
Puneet Ahuja (IIT Delhi)
Anindya Neogi (IBM India Research Lab)
Akshat Verma (IBM India Research Lab)

Autonomous Learning for Efficient Resource Utilization of Dynamic VM Migration
Hyung Won Choi (New Jersey Institute of Technology)
Hukeun Kwak (Soongsil University)
Andrew Sohn (New Jersey Institute of Technology)
Kyusik Chung (Soongsil University)
15:30-15:45 Break
15:45-17:15 Session 8: Algorithms and Applications II

Adaptive Runtime Tuning of Parallel Sparse-Vector Multiplication on Distributed Memory Systems
Seyong Lee (School of ECE, Purdue University, West Lafayette, IN 47907, USA)
Rudolf Eigenmann (School of ECE, Purdue University, West Lafayette, IN 47907, USA)

Fast Scan Algorithms on Graphics Processors
Yuri Dotsenko (Microsoft Corporation)
Naga Govindaraju (Microsoft Corporation)
Peter-Pike Sloan (Microsoft Corporation)
Charles Boyd (Microsoft Corporation)
John Manferdelli (Microsoft Corporation)

Three-Dimensional Delaunay Refinement for Multi-Core Processors
Andrey N. Chernikov (College of William and Mary)
Nikos P. Chrisochoides (College of William and Mary)


Wednesday, 11 June 2008
08:30-09:30 Keynote 3: Petaflop/s, Seriously
David E. Keyes, Columbia University
09:30-09:45 Break
09:45-11:15 Session 9: Code Performance Tuning

A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs
Muthu Manikandan Baskaran (Department of CSE, The Ohio State University)
Uday Bondhugula (Department of CSE, The Ohio State University)
J Ramanujam (Department of ECE, Louisiana State University)
Atanas Rountev (Department of CSE, The Ohio State University)
P Sadayappan (Department of CSE, The Ohio State University)

Rotating Register Allocation with Multiple Rotating Branches
Suhyun Kim (KIST)
Soo-Mook Moon (SNU)

Automatic SIMD Vectorization of Chains of Recurrences
Yixin Shou (Department of Computer Science, Florida State University, Tallahassee, FL 32306-4530)
Robert A. van Engelen (School of Computational Science, Florida State University, Tallahassee, FL 32306-4530)
11:15-11:30 Break
11:30-13:00 Session 10: Communication and Synchronization II

Optimizing Irregular Shared-Memory Applications for Clusters
Seung-Jai Min (Purdue University, USA)
Rudolf Eigenmann (Purdue University, USA)

Performance Portable Optimizations for Loops Containing Communication Operations
Costin Iancu (Lawrence Berkeley National Laboratory)
Wei Chen (Intel)
Kathy Yelick (Lawrence Berkeley National Laboratory)

Phasers: a Unified Deadlock-Free Construct for Collective and Point-to-point Synchronization
Jun Shirako (Waseda University)
David Peixotto (Rice University)
Vivek Sarkar (Rice University)
William Scherer (Rice University)
13:00-21:00 Excursion


Thursday, 12 June 2008
08:30-10:00 Session 11: Memory Management

Orchestrating Data Transfer for the Cell/B.E. Processor
Tong Chen (IBM T.J. Waston Research Lab)
Haibo Lin (IBM China Research Lab)
Tao Zhang (IBM T. J. Waston Research Lab)
Kathryn O'Brien (IBM T. J. Watson Research Lab)
Kevin O'Brien (IBM T. J. Waston Research Lab)

CUBA: An Architecture for Efficient CPU/Co-processor Data Communication
Isaac Gelado (Universitat Politecnica de Catalunya)
John Kelm (University of Illinois at Urbana-Champaign)
Shane Ryoo (University of Illinois at Urbana-Champaign)
Steve Lumetta (University of Illinois at Urbana-Champaign)
Nacho Navarro (Univesitat Politecnica de Catalunya)
Wen-mei Hwu (University of Illinois at Urbana-Champaign)

Efficient computation of sum-products on GPUs through software-managed cache
Mark Silberstein (Technion - Israel Institute of Technology, Israel)
Dan Geiger (Technion - Israel Institute of Technology, Israel)
Assaf Schuster (Technion - Israel Institute of Technology, Israel)
Anjul Patney (University of California in Davis, USA)
John D. Owens (University of California in Davis, USA)
10:00-10:15 Break
10:15-11:45 Session 12: Architecture II

Exploiting Idle Register Classes for Fast Spill Destination
Fang Lu (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences,China)
Lei Wang (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences,China)
Xiaobing Feng (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences,China)
Zhiyuan Li (Department of Computer Sciences,Purdue University,U.S.A)
Zhaoqing Zhang (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences,China)

Analysis of Dynamic Power Management on Multi-Core Processors
W. Lloyd Bircher (Advanced Micro Devices)
Lizy John (University of Texas at Austin)

Focused Prefetching: Performance Oriented Prefetching Based On Commit Stalls
R. Manikantan (Dept of Computer Science and Automation, Indian Institute of Science)
R. Govindarajan (Dept of Computer Science and Automation, Indian Institute of Science)
11:45-13:00 Break
13:00-14:30 Session 13: Performance Evaluation II

Automatic Analysis of Speedup of MPI Applications
Marc Casas (Universitat Politècnica de Catalunya)
Rosa M. Badia (Universitat Politècnica de Catalunya)
Jesús Labarta (Universitat Politècnica de Catalunya)

Analyzing Memory Access Intensity in Parallel Programs on Multicore
Lixia Liu (Department of Computer Science, Purdue University, USA)
Zhiyuan Li (Department of Computer Science, Purdue University, USA)
Ahmed Sameh (Department of Computer Science, Purdue University, USA)

A Regression-Based Approach to Scalability Prediction
Brad Barnes (University of Georgia)
Barry Rountree (University of Georgia)
David Lowenthal (University of Georgia)
Jaxk Reeves (University of Georgia)
Bronis de Supinski (Lawrence Livermore National Laboratory)
Martin Schulz(Lawrence Livermore National Laboratory)
14:30-14:45 Closing





Keynote Talks





Many-core GPU Computing with NVIDIA CUDA

Mark Harris - NVIDIA

In the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will describe NVIDIA's scalable, highly parallel many-core GPU architecture and how CUDA software for GPU computing delivers high throughput for data-intensive processing. I will discuss how CUDA is reinvigorating research on data-parallel algorithms, reducing time to scientific discovery, and enabling a variety of compute-intensive industrial applications of GPUs beyond computer graphics.





Challenges on the Road to Exascale Computing

Tilak Agerwala - Vice President of Systems, IBM Research

Supercomputing systems have made great strides in recent years as the extensive computing needs of cutting-edge engineering work and scientific discovery have driven the development of more powerful systems. The first teraflop computer, ASCI Red, came on the scene in late 1996, and now a machine must achieve 5.9 teraflops to gain entry to the very bottom of the Top500 supercomputing list.

With the emergence of petascale supercomputers expected in 2008 or 2009, we have set our sights on the increased capacity and expected muscle of exascale computing. This has also been the focus of organizations such as the Institute for Advanced Architectures jointly launched at Sandia and Oak Ridge National Laboratories. However, the challenges of exascale computing will not be solved by the technologies of today.

If today's most power and energy efficient supercomputer was linearly scaled to the exascale level, it would consume around 200MWatts of power, contain over 60 million cores, and require over 400M dollars of memory. Such a system will present significant problems in management and programmability as current bounds of parallelism are tested. Data accessibility will also be a significant issue as our ability to sense, generate and calculate on data is growing faster than our ability to access, manage and even "store" that data. These problems only get worse as the systems computational power scales up. In this talk, I will discuss exascale computing challenges to be overcome in the areas of power, architecture, programmability, management, and data accessibility.





Petaflop/s, Seriously

David E. Keyes - Fu Foundation Professor of Applied Mathematics, Columbia University, and Acting Director of Institute for Scientific Computing Research (ISCR) at Lawrence Livermore National Laboratory.

Sustained floating-point rates on real applications, as tracked by the Gordon Bell Prize, have increased by over five orders of magnitude from 1988, when 1 Gigaflop/s was reported on a structural simulation, to 2006, when 200 Teraflop/s were reported on a molecular dynamics simulation. Various versions of Moore's Law over the same interval provide only two to three orders of magnitude of improvement for an individual processor; the remaining factor comes from concurrency, which is of order 100,000 for the BlueGene/L computer, the platform of choice for the majority of recent Bell Prize finalists. As the semiconductor industry begins to slip relative to its own roadmap for silicon-based logic and memory, concurrency will play an increasing role in attaining the next order of magnitude, to arrive at the long-awaited milepost of 1 Petaflop/s sustained on a practical application, which should occur around 2009. Simulations based on Eulerian formulations of partial differential equations can be among the first applications to take advantage of petascale capabilities, but not the way most are presently being pursued. Only weak scaling can get around the fundamental limitation expressed in Amdahl's Law and only optimal implicit formulations can get around another limitation on scaling that is an immediate consequence of Courant-Friedrichs-Lewy stability theory under weak scaling of a PDE. Many PDE-based applications and other lattice-based applications with petascale roadmaps, such as quantum chromodynamics, will likely be forced to adopt optimal implicit solvers. However, even this narrow path to petascale simulation is made treacherous by the imperative of dynamic adaptivity, which drives us to consider algorithms and queueing policies that are less synchronous than those in common use today. Drawing on the SCaLeS report (2003-04), the latest ITRS roadmap, some back-of-the-envelope estimates, and numerical experiences with PDE-based codes on recently available platforms, we will attempt to project the pathway to Petaflop/s for representative applications.