[ieeetcsc-discuss] CFP: Resilience 2008 @ CCGRID

Christian Engelmann engelmannc at ornl.gov
Fri Nov 9 15:30:43 PST 2007


The 2008 International workshop on Resiliency in High Performance 
Computing (Resilience 2008)
http://xcr.cenit.latech.edu/resilience2008/
In conjunction with the 8th IEEE Intentional Symposium on Cluster 
Computing and Grid (CCGRID 2008), May 18-22, 2008, Lyon, France.

Important Dates:
• Paper Submission Deadline : December 1, 2007
• Notification Deadline : January 15, 2008
• Camera Ready Deadline : January 30, 2008
Author(s) of selected papers will be invited to submit the paper for 
publication in the special issue of the International Journal of Grid 
and High Performance Computing (IJGHPC)” pressed by IGI publishing by 
September 15, 2008.

Overview:
Recent trends in high-performance computing (HPC) systems have clearly 
indicated that future increases in performance, in excess of those 
resulting from improvements in single-processor performance, will be 
achieved through corresponding increases in system scale, i.e., using a 
significantly larger component count. As the raw computational 
performance of the world's fastest HPC systems increases from today’s 
current tera-scale to next-generation peta-scale capability and beyond, 
their number of computational, networking, and storage components will 
grow from the ten-to-one-hundred thousand compute nodes of today’s 
systems to several hundreds of thousands of compute nodes and more in 
the foreseeable future. This substantial growth in system scale, and the 
resulting component count, poses a challenge for HPC system and 
application software with respect to reliability, availability and 
serviceability (RAS). Serviceability aims toward effective means by 
which corrective and preventive maintenance can be performed on a 
system. Higher serviceability improves availability and helps retaining 
quality, performance and continuity of services at expected levels. 
Together, the combination of HA, Serviceability, and HPC will clearly 
lead to even more benefits to critical shared major HEC resource 
environments.

A recent study performed at Los Alamos National Laboratory estimates the 
System Mean Time To Failure (SMTTF) for a next-generation peta-scale HPC 
system. Extrapolating from current HPC system performance, scale, and 
SMTTF, this study suggests that the system mean-time between failures 
(SMTBF), i.e., the actual time spent for useful computation between full 
system recovery and the next failure, will fall to only 1.25 hours on a 
petaflop machine. The same study also estimates the overhead of the 
current state-of-the-art fault tolerance strategy, checkpoint/restart, 
for such a system. The results of this analysis show that a 
computational job that could normally complete in 100 hours on a 
failure-free peta-scale HPC system will actually take 251 hours to 
complete, once cost of failure recovery is included. What this analysis 
implies is startling: more than 60% of the cycles (and investment) on 
next-generation peta-scale HPC systems may be lost due to the overhead 
of dealing with reliability issues, unless something happens to 
drastically change the current course.

To address the question of computing resiliency, fault tolerance and 
high availability becomes a critical research topic. The goal of this 
workshop is to bring together the community in an effort to increase the 
resiliency of modern computing platforms such that the application mean 
time to interrupt (MTTI) is significantly greater than the 
hardware/software mean time between failures (MTBF). More simply put 
MTTI >> MTBF so that applications will have an opportunity to run to 
completion without experiencing a significant impact as a result of a 
computer failure.

Submission Guidelines:
Original, unpublished work is required. The manuscript shall be a 
maximum of 6 IEEE style pages (two columns, single space, 10 point 
font), including tables and illustrations. Accepted contributions will 
be published in the proceedings website and CD which will be available 
at the workshop. Please send all your submissions by email, in 
Postscript or PDF format to Dr. Box Leangsuksun, box at latech.edu.

Resilience 2008 topics of interest include, but are not limited to:
• Hardware for fault detection and resiliency.
• System-level resiliency for HPC.
• Statistical methods to improve system resiliency.
• Fault tolerance mechanisms experiments
• Resource management for system resiliency and availability.
• Resilient system based on hardware probes.
• Reliability and Robustness in Grid Computing
• Failure Recovery Strategies in Grid and HPC
• Reliable Communication in Grid and HPC

Workshop General Co-Chairs:
• Stephen L. Scott
   Computer Science & Mathematics Division
   Oak Ridge National Laboratory
   scottsl at ornl.gov

• Chokchai (Box) Leangsuksun
   SWEPCO Endowed Associate Professor of Computer Science,
   Louisiana Tech University, USA
   box at latech.edu

Program Committee:
• Box Leangsuksun, Louisiana Tech University
• Christian Engelmann, Oak Ridge National Laboratory
• Dan Katz, Louisiana State University
• Daniel Stanzione, Jr., Arizona State University
• Frank Mueller, North Carolina State University
• Geoffroy Vallee, Oak Ridge National Laboratory
• George Ostrouchov, Oak Ridge National Laboratory
• Hong Ong, Oak Ridge National Laboratory
• John West ERDC Major Shared Resource Center
• Mihaela Paun,Louisiana Tech University
• Stephen Scott, Oak Ridge National Laboratory
• Thomas Naughton, Oak Ridge National Laboratory
• Xain-He Sun, Illinois Institute of Technology
• Xubin (Ben) He, Tennessee Tech University
• Yung-chin Fang, Dell
• Zhiling Lan, Illinois Institute of Tech

-- 
-----------------------------------------------------------------------
Christian Engelmann                            Phone: +1 (865) 574-3132
Research Staff Member                            Fax: +1 (865) 576-5491
Oak Ridge National Laboratory                    One Bethel Valley Road
mailto:engelmannc at ornl.gov                       P.O. Box 2008, MS-6173
http://www.csm.ornl.gov/~engelman              Oak Ridge, TN 37831, USA
-----------------------------------------------------------------------




More information about the Discuss mailing list