Leo3e Message Of The Day

Announcement - June 2015 - Expansion of Leo3 in Test Operation

Dear Colleagues,

As announced in December 2014, a substantial expansion of the existing Leo3 compute cluster has been procured. This acquisition has been financed by the Research Area Scientific Computing, contributions by several Institutes of the LFU, and a special grant by the Rektor of the LFU. The system was delivered in March 2015 and is going into preliminary

"Friendly User Test Operation"

as of June 2015.

In October 2015, Leo3e went into regular productive operation.

The following information should help making optimal use of the new system:

  • We are using current Intel Haswell processors. These have a much higher performance than the Leo3 Gulftown CPUs and feature an extended instruction set. To avoid problems arising from architecture incompatibilities, the expansion system has been set up as a separate standalone cluster, named Leo3e. This includes separate $HOME and $SCRATCH directories and a separate SGE job scheduler instance. As far as possible, setup and policies of Leo3e are similar to Leo3; please note the changes given below.

  • All existing Leo3 accounts have been activated for Leo3e. To access Leo3e, connect with
    ssh leo3e.uibk.ac.at
    and use your Initial Password (Anfangspaßwort) printed on your Usage Authorization (Benutzungsbewilligung). Please do not forget to change your password.

  • "Friendly User Test Operation" means:

    • You may use the machine as desired for any non-critical work.
    • Operation of the system or individual components may be interrupted or modified by the ZID HPC team at any time without prior notice.
    • Software is incomplete.
    • Please contact the HPC team for any apparent errors, unusual observations, or software installation requests.

  • Leo3e consists of a login node and 45 distributed-memory worker nodes with 20 CPU-cores (two ten-core sockets of the Intel Xeon E5-2650-v3 Haswell-EP microarchitecture) totaling 900 cores. The cores are set to run at a constant frequency of 2.6GHz to enable optimal performance for statically and dynamically balanced workloads.

  • Nodes n001..n043 have 64GB of memory each; n044 and n045 have 512GB of memory, yielding a total of approx 3.7TB of memory. In addition, the login node has 188GB of memory.

  • The SGE parallel environments on Leo3e support up to 20 processes per node. Please use qconf -spl to get a list of defined PEs and update your job definitions to reflect the new architecture.

  • Important: lower default memory allocation
    To better accomodate jobs with more than average memory requirements, the default value for the -l h_vmem resource parameter has been lowered from 2GB to 1GB per job slot, decreasing memory usage by jobs that do not specify their memory requirements. If on Leo3, your jobs need more than 1GB per job slot but did not specify memory requirement, you will need to add the -l h_vmem=n{m|g} parameter to your job options (e.g. -l h_vmem=2g to mimick the old behavior). Jobs that exceed their memory allocation may - depending on how memory is allocated - fail silently.

  • The 56GB/s Infiniband network (used for MPI and GPFS) consists of two units with an inter-unit blocking factor of 2:1. It has a measured latency of 1.34µs and a point-to-point bandwidth of 6300MB/s per direction (including MPI overhead).

  • Scratch storage is 54TB. Default quota have been substantially increased (2.5 GB on HOME, 1 TB on SCRATCH).

  • Leo3e has a new generation of Intel Xeon processors, and many software packages have been installed at their current versions. Consequently, your compiled Leo3 programs will in general not run on Leo3e, or if they do, they will run at low efficiency. So, please re-compile all your CPU-intensive programs.

  • For the HPL benchmark running on the entire 900 core machine (n=531840), we have measured a performance of 28.5 TFlops/s at 2.6 GHz.

  • Transition to normal operation will be announced when we are confident that the system performs well under full mixed user load.

Additional information

Please consult the Leo3e documentation page for detailed usage information.

See the Comparison of HPC Systems page to get an overview of the most important differences between the HPC systems currently available to University of Innsbruck users.

Known Problems with Leo3e

  • Intermittently, highly parallel communication intensive MPI test jobs crash in the setup phase or during phases of massive data interchange. Since it is not easy to pinpoint these failures, we wonder if they are relevant for your applications. Reports are welcome.

Known Changes With Respect To Leo3

  • qstat with no command line arguments now defaults to   qstat -u $USER  
    To get the previous behaviour like on Leo3, use   qstat -u '*'
  • As noted above, SGE's default memory allocation (-l h_vmem=n) has been lowered from 2 GB to 1 GB per job slot. If a job's effective memory use exceeds the allocated h_vmem, it may fail silently.
  • Programs compiled for other systems such as Leo3 must be recompiled due to changes in the hardware and the software library infrastructure.

See also

Leo3e Message Of The Day

Reminder

In addition to Leo3e, the Vienna Scientific Cluster, which has been co-financed by the University of Innsbruck, is available for medium to large scale computing projects. Please make use of this important resource.