[ieeetcsc-discuss] CFP: Resilience 2008 @ CCGRID
Christian Engelmann
engelmannc at ornl.gov
Fri Nov 9 15:30:43 PST 2007
The 2008 International workshop on Resiliency in High Performance
Computing (Resilience 2008)
http://xcr.cenit.latech.edu/resilience2008/
In conjunction with the 8th IEEE Intentional Symposium on Cluster
Computing and Grid (CCGRID 2008), May 18-22, 2008, Lyon, France.
Important Dates:
• Paper Submission Deadline : December 1, 2007
• Notification Deadline : January 15, 2008
• Camera Ready Deadline : January 30, 2008
Author(s) of selected papers will be invited to submit the paper for
publication in the special issue of the International Journal of Grid
and High Performance Computing (IJGHPC)” pressed by IGI publishing by
September 15, 2008.
Overview:
Recent trends in high-performance computing (HPC) systems have clearly
indicated that future increases in performance, in excess of those
resulting from improvements in single-processor performance, will be
achieved through corresponding increases in system scale, i.e., using a
significantly larger component count. As the raw computational
performance of the world's fastest HPC systems increases from today’s
current tera-scale to next-generation peta-scale capability and beyond,
their number of computational, networking, and storage components will
grow from the ten-to-one-hundred thousand compute nodes of today’s
systems to several hundreds of thousands of compute nodes and more in
the foreseeable future. This substantial growth in system scale, and the
resulting component count, poses a challenge for HPC system and
application software with respect to reliability, availability and
serviceability (RAS). Serviceability aims toward effective means by
which corrective and preventive maintenance can be performed on a
system. Higher serviceability improves availability and helps retaining
quality, performance and continuity of services at expected levels.
Together, the combination of HA, Serviceability, and HPC will clearly
lead to even more benefits to critical shared major HEC resource
environments.
A recent study performed at Los Alamos National Laboratory estimates the
System Mean Time To Failure (SMTTF) for a next-generation peta-scale HPC
system. Extrapolating from current HPC system performance, scale, and
SMTTF, this study suggests that the system mean-time between failures
(SMTBF), i.e., the actual time spent for useful computation between full
system recovery and the next failure, will fall to only 1.25 hours on a
petaflop machine. The same study also estimates the overhead of the
current state-of-the-art fault tolerance strategy, checkpoint/restart,
for such a system. The results of this analysis show that a
computational job that could normally complete in 100 hours on a
failure-free peta-scale HPC system will actually take 251 hours to
complete, once cost of failure recovery is included. What this analysis
implies is startling: more than 60% of the cycles (and investment) on
next-generation peta-scale HPC systems may be lost due to the overhead
of dealing with reliability issues, unless something happens to
drastically change the current course.
To address the question of computing resiliency, fault tolerance and
high availability becomes a critical research topic. The goal of this
workshop is to bring together the community in an effort to increase the
resiliency of modern computing platforms such that the application mean
time to interrupt (MTTI) is significantly greater than the
hardware/software mean time between failures (MTBF). More simply put
MTTI >> MTBF so that applications will have an opportunity to run to
completion without experiencing a significant impact as a result of a
computer failure.
Submission Guidelines:
Original, unpublished work is required. The manuscript shall be a
maximum of 6 IEEE style pages (two columns, single space, 10 point
font), including tables and illustrations. Accepted contributions will
be published in the proceedings website and CD which will be available
at the workshop. Please send all your submissions by email, in
Postscript or PDF format to Dr. Box Leangsuksun, box at latech.edu.
Resilience 2008 topics of interest include, but are not limited to:
• Hardware for fault detection and resiliency.
• System-level resiliency for HPC.
• Statistical methods to improve system resiliency.
• Fault tolerance mechanisms experiments
• Resource management for system resiliency and availability.
• Resilient system based on hardware probes.
• Reliability and Robustness in Grid Computing
• Failure Recovery Strategies in Grid and HPC
• Reliable Communication in Grid and HPC
Workshop General Co-Chairs:
• Stephen L. Scott
Computer Science & Mathematics Division
Oak Ridge National Laboratory
scottsl at ornl.gov
• Chokchai (Box) Leangsuksun
SWEPCO Endowed Associate Professor of Computer Science,
Louisiana Tech University, USA
box at latech.edu
Program Committee:
• Box Leangsuksun, Louisiana Tech University
• Christian Engelmann, Oak Ridge National Laboratory
• Dan Katz, Louisiana State University
• Daniel Stanzione, Jr., Arizona State University
• Frank Mueller, North Carolina State University
• Geoffroy Vallee, Oak Ridge National Laboratory
• George Ostrouchov, Oak Ridge National Laboratory
• Hong Ong, Oak Ridge National Laboratory
• John West ERDC Major Shared Resource Center
• Mihaela Paun,Louisiana Tech University
• Stephen Scott, Oak Ridge National Laboratory
• Thomas Naughton, Oak Ridge National Laboratory
• Xain-He Sun, Illinois Institute of Technology
• Xubin (Ben) He, Tennessee Tech University
• Yung-chin Fang, Dell
• Zhiling Lan, Illinois Institute of Tech
--
-----------------------------------------------------------------------
Christian Engelmann Phone: +1 (865) 574-3132
Research Staff Member Fax: +1 (865) 576-5491
Oak Ridge National Laboratory One Bethel Valley Road
mailto:engelmannc at ornl.gov P.O. Box 2008, MS-6173
http://www.csm.ornl.gov/~engelman Oak Ridge, TN 37831, USA
-----------------------------------------------------------------------
More information about the Discuss
mailing list