HPC Systems of the ZID - Statement of Service Levels for Systems

To maximize the benefit from its investments in HPC systems, the ZID will try to continue operation of existing systems for some time even after warranties or maintenance contracts have expired. If you use our systems, you should understand the implications and risks of this strategy, so you will be able to plan your research projects accordingly.

In the following paragraphs, we inform you about the current and planned maintenance status of each of our systems, and what it means to use systems that are no longer under maintenance.

Risks of using systems not under maintenance

Modern hardware is relatively stable. Individual failure probabilities, though, multiplied by the number of machines in a cluster, result in a significant failure rate at least for certain components.

An individual failure may, depending on its nature, result in one or more of the following consequences:

  • Termination of individual jobs,
  • Reduction of processing capacity (loss of individual nodes),
  • Downgraded communication bandwidth,
  • Temporary or permanent loss of access to data or system functionality,
  • Complete termination of system operation, possibly including the permanent loss of certain data.

Depending on the failed component and its replacement cost, the ZID may or may not decide to repair a system. In the first case, the time to repair may be significantly longer than with a system that is under maintenance, resulting in outages that may last for days or even several weeks. We estimate this risk to be relatively low, but we definitely cannot rule it out.

Recommendations and precautions

  • Regularly back up important data in SCRATCH areas. Data in HOME directories are backed up by the ZID and thus are at significantly lower risk. As with all systems, you may still decide to keep a backup of your own.
  • Try to not depend on continuous system availability to meet important deadlines or research goals.

Maintenance status of individual systems

Leo 1

Was turned off on Aug 14 2015. HOME data and SCRATCH data can still be accessed via LEO3 and LEO3e.

Leo 2

Was turned off on Feb 20 2015. HOME data can still be accessed via Leo3.

Leo 3

Currently under maintenance contract through Oct, 2016. Extension for another year is likely. End Of Life: TBD.

Leo 3e

In production since October 2015. All Leo3 user accounts are automatically activated for Leo3e.

Mach

Maintenance contract ended on March 31, 2014. SCRATCH storage is still under maintenance. Single points of failure, potentially leading to complete loss of operation and data: Main memory interconnect, Scratch storage, power supplies. Planned End Of Life: est. 3Q2018. Please note that due to its centralized architecture, the risk of complete inoperability is higher for Mach than for other systems.

Mach2

Successor system to MACH. Under test operation as of 1Q2018.

VSC3

In regular operation including maintenance since March 2015.