Technical Program
| Saturday, 7 June 2008 | |
| Full day | 1st International Workshop on Super Visualization Meeting Room: Acesso |
| Full day | International Workshop on Scalable Tools for High-End Computing Meeting Room: Melambus |
| Sunday, 8 June 2008 | |
| Full day | Tutorial: Programming emerging architectures: GPUs and multicores Meeting Room: Melambus |
| Monday, 9 June 2008 | |
| 08:30-08:45 | Opening |
| 08:45-09:45 |
Keynote 1: Many-core GPU Computing with NVIDIA CUDA Mark Harris, NVIDIA |
| 10:45-11:00 | Break |
| 10:00-11:30 | Session 1: Algorithms and Applications I |
|
Implementing Wilson-Dirac Operator on the Cell Broadband Engine. Khaled Z. Ibrahim (IRISA/INRIA, Rennes, France) F. Bodin (IRISA/INRIA, Rennes, France) |
|
|
Biomedical Image Analysis on a Cooperative Cluster of GPUs and Multicores. Timothy D.R. Hartley (Department of Biomedical Informatics and Electrical and Computer Engineering, Ohio State University, USA) Umit Catalyurek (Department of Biomedical Informatics and Electrical and Computer Engineering, Ohio State University, USA) Antonio Ruiz (Computer Architecture Department, University of Malaga, Spain) Manuel Ujaldon (Computer Architecture Department, University of Malaga, Spain) Francisco Igual (Department of Computer Engineering and Computer Science, University Jaume I, Spain) Rafael Mayo (Department of Computer Engineering and Computer Science, University Jaume I, Spain) |
|
|
Data Mining on the Cell Broadband Engine Gregory Buehrer (Ohio State University) Srinivasan Parthasarathy (Ohio State University) Matthew Goyder (Ohio State University) |
|
| 11:30-12:30 | Session 2: Performance Evaluation I |
|
Accurate Memory Signatures and Synthetic Address Traces for HPC Applications Jonathan Weinberg (University of California, San Diego) Allan Snavely (San Diego Supercomputer Center) |
|
|
Preserving Time in Large-Scale Communication Traces Prasun Ratn (North Carolina State University) Frank Mueller (North Carolina State University) Bronis R. de Supinski (Lawrence Livermore National Laboratory) Martin Schulz (Lawrence Livermore National Laboratory) |
|
| 12:30-14:00 | Break |
| 14:00-15:30 | Session 3: Architecture I |
|
A Vacuum Microelectronic Freespace Crossbar Michel Victor (Exaconnect Inc) Aris Silzars (Exaconnect Inc) Edward Davidson (University of Michigan) |
|
|
Adaptive DRAM Temperature and Power Management
Song Liu (Northwestern University) Seda Memik (Northwestern University) Yu Zhang (Northwestern University) Gokhan Memik (Northwestern University) |
|
|
The Shared-Thread Multiprocessor Jeffery A. Brown (UC San Diego) Dean M. Tullsen (UC San Diego) |
|
| 15:30-15:45 | Break |
| 15:45-17:15 | Session 4: Communication and Synchronization I |
|
Advanced Collective Communication in Aspen Qasim Ali (Purdue University, West Lafayette) Sam Midkiff (Purdue University, West Lafayette) Vijay Pai (Purdue University, West Lafayette) |
|
|
The Deep Computing Messaging Framework: Generalized Scalable Message Passing on the Blue Gene/P Supercomputer Sameer Kumar (IBM Research) Gabor Dozsa (IBM Research) Gheorghe Almasi (IBM Research) Dong Chen (IBM Research) Mark E. Giampapa (IBM Research) Philip Heidelberger (IBM Research) Michael Blocksome (IBM Rochester) Ahmad Faraj (IBM Rochester) Jeff Parker (IBM Rochester) Joseph Ratterman (IBM Rochester) Brian Smith (IBM Rochester) Charles Archer (IBM Rochester) |
|
|
A Projection-Based Optimization Framework for Abstractions with Application to the Unstructured Mesh Domain Brian S. White (Cornell University) Sally A. McKee (Cornell University) Daniel Quinlan (Lawrence Livermore National Laboratory) |
|
| Tuesday, 10 June 2008 | |
| 08:30-09:30 |
Keynote 2: Challenges on the Road to Exascale Computing Tilak Agerwala, IBM Research |
| 09:30-09:45 | Break |
| 09:45-11:15 | Session 5: File Systems |
|
CprFS: A User-level File System to Support Consistent File State for Checkpoint and Restart Ruini Xue (Department of Computer Science and Techonology, Tsinghua University) Wenguang Chen (Department of Computer Science and Techonology, Tsinghua University) Weimin Zheng (Department of Computer Science and Techonology, Tsinghua University) |
|
|
Timely Offloading of Result-Data in HPC Centers Henry Monti (Virginia Tech.) Ali R. Butt (Virginia Tech.) Sudharshan S. Vazhkudai (Oak Ridge National Laboratory) |
|
|
Shifted Declustering: An Ideal-placement Layout Scheme for Multi-way Replication Storage Architecture Huijun Zhu (University of Central Florida) Peng Gu (University of Central Florida) Jun Wang (University of Central Florida |
|
| 11:15-11:30 | Break |
| 11:30-12:30 | Session 6: Fault Tolerance |
|
Can Software Reliability Outperform Hardware Reliability on High Performance Interconnects? : A Case Study with MPI over InfiniBand Matthew J. Koop (The Ohio State University) Rahul Kumar (The Ohio State University) Dhabaleswar K. Panda (The Ohio State University) |
|
|
Soft Error Vulnerability of Iterative Linear Algebra Methods Greg Bronevetsky (Lawrence Livermore National Laboratory) Bronis deSupinski (Lawrence Livermore National Laboratory) |
|
| 12:30-14:00 | Break |
| 14:00-15:30 | Session 7: Operating Systems |
|
Evaluating the Effect of Replacing CNK with Linux on the Compute-Nodes of Blue Gene/L Edi Shmueli (IBM) George Almasi (IBM) Jose Brunheroto (IBM) Jose Castanos (IBM) Gabor Dozsa (IBM) Sameer Kumar (IBM) Derek Lieber (IBM) |
|
|
Power-aware Dynamic Placement of HPC Applications Puneet Ahuja (IIT Delhi) Anindya Neogi (IBM India Research Lab) Akshat Verma (IBM India Research Lab) |
|
|
Autonomous Learning for Efficient Resource Utilization of Dynamic VM Migration Hyung Won Choi (New Jersey Institute of Technology) Hukeun Kwak (Soongsil University) Andrew Sohn (New Jersey Institute of Technology) Kyusik Chung (Soongsil University) |
|
| 15:30-15:45 | Break |
| 15:45-17:15 | Session 8: Algorithms and Applications II |
|
Adaptive Runtime Tuning of Parallel Sparse-Vector Multiplication on Distributed Memory Systems Seyong Lee (School of ECE, Purdue University, West Lafayette, IN 47907, USA) Rudolf Eigenmann (School of ECE, Purdue University, West Lafayette, IN 47907, USA) |
|
|
Fast Scan Algorithms on Graphics Processors Yuri Dotsenko (Microsoft Corporation) Naga Govindaraju (Microsoft Corporation) Peter-Pike Sloan (Microsoft Corporation) Charles Boyd (Microsoft Corporation) John Manferdelli (Microsoft Corporation) |
|
|
Three-Dimensional Delaunay Refinement for Multi-Core Processors Andrey N. Chernikov (College of William and Mary) Nikos P. Chrisochoides (College of William and Mary) |
|
| Wednesday, 11 June 2008 | |
| 08:30-09:30 |
Keynote 3: Petaflop/s, Seriously David E. Keyes, Columbia University |
| 09:30-09:45 | Break |
| 09:45-11:15 | Session 9: Code Performance Tuning |
|
A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs Muthu Manikandan Baskaran (Department of CSE, The Ohio State University) Uday Bondhugula (Department of CSE, The Ohio State University) J Ramanujam (Department of ECE, Louisiana State University) Atanas Rountev (Department of CSE, The Ohio State University) P Sadayappan (Department of CSE, The Ohio State University) |
|
|
Rotating Register Allocation with Multiple Rotating Branches Suhyun Kim (KIST) Soo-Mook Moon (SNU) |
|
|
Automatic SIMD Vectorization of Chains of Recurrences Yixin Shou (Department of Computer Science, Florida State University, Tallahassee, FL 32306-4530) Robert A. van Engelen (School of Computational Science, Florida State University, Tallahassee, FL 32306-4530) |
|
| 11:15-11:30 | Break |
| 11:30-13:00 | Session 10: Communication and Synchronization II |
|
Optimizing Irregular Shared-Memory Applications for Clusters Seung-Jai Min (Purdue University, USA) Rudolf Eigenmann (Purdue University, USA) |
|
|
Performance Portable Optimizations for Loops Containing Communication Operations Costin Iancu (Lawrence Berkeley National Laboratory) Wei Chen (Intel) Kathy Yelick (Lawrence Berkeley National Laboratory) |
|
|
Phasers: a Unified Deadlock-Free Construct for Collective and Point-to-point Synchronization Jun Shirako (Waseda University) David Peixotto (Rice University) Vivek Sarkar (Rice University) William Scherer (Rice University) |
|
| 13:00-21:00 | Excursion |
| Thursday, 12 June 2008 | |
| 08:30-10:00 | Session 11: Memory Management |
|
Orchestrating Data Transfer for the Cell/B.E. Processor
Tong Chen (IBM T.J. Waston Research Lab) Haibo Lin (IBM China Research Lab) Tao Zhang (IBM T. J. Waston Research Lab) Kathryn O'Brien (IBM T. J. Watson Research Lab) Kevin O'Brien (IBM T. J. Waston Research Lab) |
|
|
CUBA: An Architecture for Efficient CPU/Co-processor Data Communication
Isaac Gelado (Universitat Politecnica de Catalunya) John Kelm (University of Illinois at Urbana-Champaign) Shane Ryoo (University of Illinois at Urbana-Champaign) Steve Lumetta (University of Illinois at Urbana-Champaign) Nacho Navarro (Univesitat Politecnica de Catalunya) Wen-mei Hwu (University of Illinois at Urbana-Champaign) |
|
|
Efficient computation of sum-products on GPUs through software-managed cache
Mark Silberstein (Technion - Israel Institute of Technology, Israel) Dan Geiger (Technion - Israel Institute of Technology, Israel) Assaf Schuster (Technion - Israel Institute of Technology, Israel) Anjul Patney (University of California in Davis, USA) John D. Owens (University of California in Davis, USA) |
|
| 10:00-10:15 | Break |
| 10:15-11:45 | Session 12: Architecture II |
|
Exploiting Idle Register Classes for Fast Spill Destination
Fang Lu (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences,China) Lei Wang (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences,China) Xiaobing Feng (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences,China) Zhiyuan Li (Department of Computer Sciences,Purdue University,U.S.A) Zhaoqing Zhang (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences,China) |
|
|
Analysis of Dynamic Power Management on Multi-Core Processors
W. Lloyd Bircher (Advanced Micro Devices) Lizy John (University of Texas at Austin) |
|
|
Focused Prefetching: Performance Oriented Prefetching Based On Commit Stalls R. Manikantan (Dept of Computer Science and Automation, Indian Institute of Science) R. Govindarajan (Dept of Computer Science and Automation, Indian Institute of Science) |
|
| 11:45-13:00 | Break |
| 13:00-14:30 | Session 13: Performance Evaluation II |
|
Automatic Analysis of Speedup of MPI Applications
Marc Casas (Universitat Politècnica de Catalunya) Rosa M. Badia (Universitat Politècnica de Catalunya) Jesús Labarta (Universitat Politècnica de Catalunya) |
|
|
Analyzing Memory Access Intensity in Parallel Programs on Multicore
Lixia Liu (Department of Computer Science, Purdue University, USA) Zhiyuan Li (Department of Computer Science, Purdue University, USA) Ahmed Sameh (Department of Computer Science, Purdue University, USA) |
|
|
A Regression-Based Approach to Scalability Prediction
Brad Barnes (University of Georgia) Barry Rountree (University of Georgia) David Lowenthal (University of Georgia) Jaxk Reeves (University of Georgia) Bronis de Supinski (Lawrence Livermore National Laboratory) Martin Schulz(Lawrence Livermore National Laboratory) |
|
| 14:30-14:45 | Closing |
Keynote Talks
Many-core GPU Computing with NVIDIA CUDA
Mark Harris - NVIDIA
In the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will describe NVIDIA's scalable, highly parallel many-core GPU architecture and how CUDA software for GPU computing delivers high throughput for data-intensive processing. I will discuss how CUDA is reinvigorating research on data-parallel algorithms, reducing time to scientific discovery, and enabling a variety of compute-intensive industrial applications of GPUs beyond computer graphics.
Challenges on the Road to Exascale Computing
Tilak Agerwala - Vice President of Systems, IBM ResearchSupercomputing systems have made great strides in recent years as the extensive computing needs of cutting-edge engineering work and scientific discovery have driven the development of more powerful systems. The first teraflop computer, ASCI Red, came on the scene in late 1996, and now a machine must achieve 5.9 teraflops to gain entry to the very bottom of the Top500 supercomputing list.
With the emergence of petascale supercomputers expected in 2008 or 2009, we have set our sights on the increased capacity and expected muscle of exascale computing. This has also been the focus of organizations such as the Institute for Advanced Architectures jointly launched at Sandia and Oak Ridge National Laboratories. However, the challenges of exascale computing will not be solved by the technologies of today.
If today's most power and energy efficient supercomputer was linearly scaled to the exascale level, it would consume around 200MWatts of power, contain over 60 million cores, and require over 400M dollars of memory. Such a system will present significant problems in management and programmability as current bounds of parallelism are tested. Data accessibility will also be a significant issue as our ability to sense, generate and calculate on data is growing faster than our ability to access, manage and even "store" that data. These problems only get worse as the systems computational power scales up. In this talk, I will discuss exascale computing challenges to be overcome in the areas of power, architecture, programmability, management, and data accessibility.
Petaflop/s, Seriously
David E. Keyes - Fu Foundation Professor of Applied Mathematics, Columbia University, and Acting Director of Institute for Scientific Computing Research (ISCR) at Lawrence Livermore National Laboratory.
Sustained floating-point rates on real applications, as tracked by the
Gordon Bell Prize, have increased by over five orders of magnitude from
1988, when 1 Gigaflop/s was reported on a structural simulation, to 2006,
when 200 Teraflop/s were reported on a molecular dynamics simulation.
Various versions of Moore's Law over the same interval provide only two to
three orders of magnitude of improvement for an individual processor; the
remaining factor comes from concurrency, which is of order 100,000 for the
BlueGene/L computer, the platform of choice for the majority of recent Bell
Prize finalists. As the semiconductor industry begins to slip relative to
its own roadmap for silicon-based logic and memory, concurrency will play an
increasing role in attaining the next order of magnitude, to arrive at the
long-awaited milepost of 1 Petaflop/s sustained on a practical application,
which should occur around 2009. Simulations based on Eulerian formulations
of partial differential equations can be among the first applications to
take advantage of petascale capabilities, but not the way most are presently
being pursued. Only weak scaling can get around the fundamental limitation
expressed in Amdahl's Law and only optimal implicit formulations can get
around another limitation on scaling that is an immediate consequence of
Courant-Friedrichs-Lewy stability theory under weak scaling of a PDE. Many
PDE-based applications and other lattice-based applications with petascale
roadmaps, such as quantum chromodynamics, will likely be forced to adopt
optimal implicit solvers. However, even this narrow path to petascale
simulation is made treacherous by the imperative of dynamic adaptivity,
which drives us to consider algorithms and queueing policies that are less
synchronous than those in common use today. Drawing on the SCaLeS report
(2003-04), the latest ITRS roadmap, some back-of-the-envelope estimates, and
numerical experiences with PDE-based codes on recently available platforms,
we will attempt to project the pathway to Petaflop/s for representative
applications.


