Using Anaconda for Python and R

What is Anaconda? Why Anaconda?

Anaconda is a comprehensive, curated, high quality, and high performance distribution and package manager for open source software such as Python, R, and many associated packages, intended for use by scientists. It is available for Linux, Windows, and MacOS. With the help of dedicated Conda environments, extensive collections of R and Python tools can be installed easily. Conda's version and dependency management makes sure that in an environment, the individual components are compatible. Moreover, for users, it is easy to install their own specialized toolsets into their private Conda environments.

Anaconda is also the name of one of these environments, which is a collection of more than 400 packages suitable for many data analysis and science tasks. See below for details.

Starting with 2020-03, we have been basing our Anaconda installation on the Miniconda installer, and with 2022-01 and later, we are using the Mamba installer from Conda-Forge.

To avoid version conflicts and in keeping with Conda's environment concept, we install individual toolsets into separate environments.

Pre-Installed (Ana)conda modules and environments

Discover all our pre-installed versions of (Ana)conda modules by issuing

module avail Anaconda3

or enter

module avail Anaconda3/2023.03

to get a list of the newest components:

miniconda-base-2023.03: Conda's base environment - your starting point for individual user extensions.
python-3.10.9-anaconda-2023.03: The Python 3.10.9 Anaconda metapackage with an extensive collection of more than 400 packages from the Anaconda default channel. Among these are NumPy, Scipy, Pandas, and Matplotlib; frontends including Ipython, Jupyterlab, and Spyder, and compilers/optimizers (e.g. Cython and Numba) and many more packages for science and data analysis. The included Intel MKL numerical library ensures optimal performance of numerical methods. This environment might be all you need if you plan to use Python for general purpose scientific data processing and simulations. Our tests have shown that Anaconda's Python outperforms locally compiled versions as well as versions installed from Conda-Forge.
~~python-3.10.10-anaconda+mpi-conda-2023.03~~ obsolete (*)
python-3.10.10-anaconda-mpi-conda-2023.03: Essentially the same as above, but version Python 3.10.10 with MPI functionality added by including the mpi4py package.
Note: The MPI functionality supplied with this package can run distributed MPI jobs only on clusters with the Slurm load manager.
To use MPI on Leo5, you must use the srun --mpi=mpi2 command to start your parallel tasks across multiple nodes.
(*) Note: Module names containing a "+" character no longer work with newer versions of the Environment Modules software (e.g. as installed on LEO5). All affected module names have been duplicated, with "+" replaced by "-".
python-3.11.0-numpy-conda-2023.03: A collection of 320+ packages similar to Anaconda, but from the Conda-Forge channel.
~~python-3.11.0-numpy+mpi-conda-2023.03~~ obsolete
python-3.11.0-numpy-mpi-conda-2023.03: Same as above, but with MPI functionality added. Note: This MPI installation only works on Slurm clusters with Slurm.
To use MPI on Leo5, you must use the srun --mpi=mpi2 command to start your parallel tasks across multiple nodes.
r-3.6.0-conda-2023.03: The R statistics software with 318 R libraries and the RStudio IDE for R from the Anaconda default channels. This appears to be a bit dated but is the most recent R implementation available in the Anaconda default channel as of March 2023.
r-4.2-conda-2023.03: The 4.2 version of the R statistics software with 455 packages including the newest version of Rstudio from the Conda-Forge channel.
tensorflow-*: Various versions of the Tensorflow machine learning tool in CPU and GPU variants from Anaconda and Conda-Forge. We recommend to test these for functionality and performance before doing production work.
pytorch-1.13.1-conda-2023.03: Pytorch (machine learning and GPU capable replacement of Numpy) downloaded from the pytorch Conda channel - this version runs on GPU as well as CPU.
...: More environments can be easily added to your own account (see below) or centrally upon request if of sufficient general interest.

Using Anaconda On The Leo Systems

The preinstalled Anaconda collections for Python and R are quite comprehensive and may satisfy many of your needs. This is covered in the following section on Using Pre-Installed Environments.

If you need to install your own packages, proceed to the Section describing conda extensions.

Using Pre-Installed Environments

To discover which Anaconda environments are installed, issue the command

module avail Anaconda3

For new projects, usually the most recent version is appropriate. The 5.x versions are now considered legacy. Python2 is no longer supported.

To use any of the Anaconda environments, upon login or at the beginning of a batch job, first issue

module load Anaconda3/yyyy.mm/module-name-yyyy.mm

This will let you use the software provided with the respective module, and a restricted version of the conda command.

After loading your module, issue

conda list

to see which packages are available in the current environment.

Extending Anaconda For Your Own Needs: Create and Install Your Own Environments

If the pre-installed environments do not meet your needs, you should create one or more Conda environments and install your desired packages into these environments.

Conda environments are similar to Python virtual environments, but they extend their functionality and completely replace them. Attempting to mix these two types of environment can lead to conflicts and inconsistencies. So if you use our Anaconda installation, it is best to consistently stick with Conda's environment management to avoid these problems.

Important: One-time preparation before you start

Environments are downloaded and built in your $HOME/.conda directory. This can easily grow to many gigabytes, overflowing your $HOME quota. Therefore we recommend that you create .conda in your $SCRATCH and set up a symbolic link $HOME/.conda -> $SCRATCH/.conda

First make sure that $HOME/.conda does not exist yet:

 $ ls -ld $HOME/.conda
ls: cannot access /home/cxxx/ccxxxyyyy/.conda: No such file or directory

If the symlink exists, you are done. Else, if $HOME/.conda is a directory, decide if you want to rename it (mv -i $HOME/.conda $HOME/.conda-backup), remove it (rm -rf $HOME/.conda), or move it to $SCRATCH (mv -i $HOME/.conda $SCRATCH).

Now, create the new Conda directory in $SCRATCH if it does not exist yet

mkdir -p $SCRATCH/.conda

Finally, create the symbolic link

ln -s $SCRATCH/.conda $HOME/.conda

At this point, you may also want to redirect your .cache directory to $SCRATCH (assuming it currently contains no valuable data):

cd $HOME
rm -r .cache
mkdir -p $SCRATCH/.cache
ln -s $SCRATCH/.cache $HOME/.cache

Note that you are responsible for backup of your own data in $SCRATCH. Since it is easy to recreate environments, it is usually sufficient to record the necessary steps for successfully creating your environment(s).

At the beginning of each shell session or batch job

All of the following requires that you load and activate the Anaconda3/miniconda-base module. Note that other Anaconda3 modules supply the conda command with reduced functionality and cannot be used to successfully build reliable environments, even if at first glance it may seem that everything works fine.

module load Anaconda3/yyyy.mm/miniconda-base-yyyy.mm
eval "$($UIBK_CONDA_DIR/bin/conda shell.bash hook)"

This loads and activates the Miniconda base environment and its shell integration. Now conda is a shell function which allows manipulation of your session's environment variables, creation of new environments, and activation of existing environments with conda mechanisms.

Note:

Never use conda init because this will modify your $HOME/.bashrc in a way which is incompatible with loading and unloading modules from our modules environment.
To obtain a list of all environments (preinstalled and your own) issue the command
conda env list
To obtain a list of installed packages in the currently active environment, issue
conda list
To obtain a list of packages in any given environment, issue
conda list -n environment
If you regularly use the Anaconda3 installation, you might want to add the following definition to your $HOME/.bashrc file:
```
py(){
module load Anaconda3/yyyy.mm/miniconda-base-yyyy.mm
eval "$($UIBK_CONDA_DIR/bin/conda shell.bash hook)"
}
```
This allows you to automate these steps by just entering py as a command in your shell session or batch scripts.

Steps To Create And Install Your Own Environments

To keep installations apart and minimize possible version conflicts, we recommend to create separate environments for different projects requiring disparate packages.

For the following, make sure that you have performed the steps described above. The steps described below will yield reliable results only if you load one of our provided miniconda-base environment modules.

As a basic rule, try to install as many missing packages using conda to get optimized versions. The remaining packages can be installed into your environment using the Conda version of Python's pip command or R's install.packages() function (see below).

For all of the following, first, load and activate the Miniconda base environment as described above.

Create Your New Conda Environment

You typically should have a good notion of the packages necessary to run your software, either from the prerequisites section of the software documentation, or from the import statements of your programs.

Given a tentative list of required packages, create your new environment:

conda create -n myenvironment package1 [package2 ...]

Substitute a good name for myenvironment. Your environment will be created in $HOME/.conda/envs/myenvironment, and all packages requested on the command line are installed into your new environment.

Then activate your environment

conda activate myenvironment

and try to run your software. You may also want to explicitly look for missing components by trying to invoke them:

For all needed shell commands name (e.g. ipython or spyder, issue
which name
and note which commands have not been found.

For Python packages, write small test programs containing
import name
statements and note which packages cannot be found.

Likewise proceed for R packages with
library('name')
statements.

Identify Missing Packages Which Can Be Installed By Conda

For each required/missing package name, issue

conda search name

Take note if name was found. When your list is complete, reiterate your installation using the new set of packages

conda activate base
conda env remove myenvironment
conda create -n myenvironment package1 [package2 ...]
conda activate myenvironment

While it is possible to incrementally install more packages after creating and activating an environment, installing all packages at creation does a much better job at avoiding version conflicts.

Repeat this process until no new packages can be installed.

If any Python modules or R libraries are still missing after this process (i.e. they could not be installed using Conda), decide if you want to switch to the conda-forge channel (by adding -c conda-forge to your conda create command; this cannel has far more packages than Anaconda, but is not curated for quality) or if you want to install them using the conda-specific version of pip (Python) or install.packages() (R) into your activated Conda environment.

Add Missing Components Using Pip

If, after installing required packages using Conda (and possibly switching to the conda-forge channel), any packages are still missing, these can be installed into your Conda environment using Conda's version of pip.

First, activate your environment if you have not done so

conda activate myenvironment

and use Conda to install pip and Conda's compiler environment:

conda install pip gcc_linux-64 gxx_linux-64 gfortran_linux-64

The compilers are necessary because the OS compiler versions may be inconsistent with Conda's expectations, possibly leading to compile or runtime errors.

Then, do not create a Python virtual environment (as you would normally do outside Conda), but simply use Conda's pip to install the remaining packages into your active Conda environment:

pip install package1 [package2 ...]

This may result in a large number of requisites to be installed automatically, many of which could have been installed by Conda instead. To check, add the list of packages that pip would install to your list of candidate conda packages, destroy your existing environment:

conda deactivate
conda env remove -n myenvironment

... and repeat above Conda installation with your enhanced list. Often, this leaves very few packages to be installed with pip, making optimal use of Conda's performance optimizations.

Note

After using pip for the first time into a given Conda environment, you no longer should use Conda to install more packages into the same environment. Should this be necessary, simply begin from the start by creating a new Conda environment and proceeding as described above.

Final Cleanup

When you are done installing Conda packages, you may, with your environment still active, use

conda clean --all --yes

to remove unneeded installation material from your environments.

Various Hints

Mamba

If you install your own packages, you may want to try mamba [create|install] instead of conda. Mamba is much faster than conda, but it uses a different resolver and is in an early stage of development, so your mileage may vary. We have also managed to clobber existing installations, so please be prepared to start from scratch after a failed install.

Alternate Channels

If your software cannot be installed from the default Anaconda repository, you may install it either using pip as described above, or you may want to try a different channel using the -c channel argument for conda create.

A very popular channel, which contains a huge number of additions over Anaconda's default channel as well as more recent versions of packages also found in default, is conda-forge. Note, however, that conda-forge appears to have no quality monitoring, so you may end up with unreliable or poorly performing packages.

Another channel which contains many Bioinformatics packages is bioconda.

Sample Job Fragment

If you have created your own environment(s), you may use the following commands as a template

module load Anaconda3/yyyy.mm/miniconda-base-yyyy.mm
eval "$($UIBK_CONDA_DIR/bin/conda shell.bash hook)"
conda activate myenvironment

Using MPI With Python

Slurm clusters only. Anaconda's OpenMPI implementation by default does not integrate with Slurm. We have installed and recommend the MPICH implementation instead.

Using MPI With Pre-Installed Environments

Several pre-installed environments contain a functional MPI integration. These have the string ~~+mpi~~ (obsolete) -mpi in their names. Please see above list of pre-installed environments and then proceed to the section starting MPI programs below.

Using MPI With Your Own Environments

If you need to create your own environments, please include the following into your conda create command:

If you use the Conda-Forge channel:

conda create -n myenv -c conda-forge [your packages ...] mpi4py mpi=*=mpich

If you use the Anaconda default channel, the MPICH package installed by default is a non-functional dummy, and you need to explicitly specify the build. As of March 2023, the following appears to work fine:

conda create -n myenv [your packages ...] mpi4py mpi=*=mpich mpich=3.3.2=hc856adb_0

If this is incompatible with versions of other packages that you install, you may need to conda search mpich and pick a compatible version.

Starting MPI Programs on Slurm Clusters

In both cases, Anaconda's mpirun command is able to start processes only on your local node. To start MPI processes across multiple nodes, use the following Slurm command in your batch job:

srun --mpi=pmi2 your-script.py [...]

Omitting the --mpi=pmi2 option will cause all MPI tasks to be incorrectly started with rank 0, so please do not forget to include this option in your batch scripts.

Using Mpi4Py Futures on UIBK Slurm Clusters

mpi4py.futures is an MPI-enabled implementation of Python's concurrent.futures library class. This class is capable of dynamic process management. However, due to static resource allocation by Slurm, this capability is of no practical use, so the following fragments show how to correctly start mpi4py.futures code using resources allocated by Slurm.

Python program myprogram.py

#!/usr/bin/env python3
from mpi4py.futures import MPICommExecutor

def myworker( arg ):
  # do work on individual arg
  # note that this function must be defined in __main__ and __worker___
  return result

if __name__ == '__worker__':
  # optionally put code here specific to worker

if __name__ == '__main__':
  args = iterable # input arguments for myworker

  with MPICommExecutor() as executor:
    for arg, result in zip(args,
                           executor.map( myworker, args )):
      # process result

Inside a Slurm allocation or batch job, start this code using the construct

srun --mpi=pmi2 python3 -m mpi4py.futures ./myprogram.py

Note that mpi4py.futures has to be specified both in the program and on the command line. Direct invocation of the script without python -m mpi4py.futures would cause all tasks to be started in the __main__ namespace.

The worker function is run concurrently with as many processes as started by srun, and the executor loop body processes the results as they become available. Results are returned in-order unless the keyword argument unordered=True is passed to map.

Using Your PC To Display a Jupyter Notebook Running On A Server

You can start a JupyterLab process on a Leo login node or in a batch job and display its GUI in a browser window on your PC.

Interactive Session on a LEO Login Node

The following instructions work both for a local Windows or Linux workstation.

On your PC start an SSH session to your selected Leo login node.
Linux: start a terminal window and enter
ssh [user@]leon
Windows: connect to a LEO login node using Putty.
In the Leo SSH session:
1. load Anaconda (module load Anaconda3/xxxxxx) and activate your environment as needed.
2. Start JupyterLab:
```
jupyter-lab --no-browser
```
Jupyterlab will display several URLs. Indentify the following URL:
http://localhost:88xy/lab?token=zzzzzzzzzzzzzzzzzzzzz
On your PC, start another terminal window (Linux), a WSL window (if WSL is installed), or a CMD window (Windows).
In this window, set up an SSH tunnel to your JupyterLab session:
```
ssh -N -L localhost:9001:localhost:88xy user@leon
```
The port number 9001 is arbitrary. Pick any port which is not in use. If you want to have several JupyterLab sessions, we suggest to use sequential numbers 9001, 9002....
Start a new browser window on your PC.
Copy and paste above URL into your browser's address field, replace the port number 88xy by your local port number (e.g. 9001). Verify that your URL looks like
http://localhost:9001/lab?token=zzzzzzzzzzzzzzzzzzzzz
and hit Enter. Your Jupyter session should be displayed.

Jupyter Proxy Script For Linux Workstation

You can automate the creation of the tunnel (including automatic selection of a free port) by creating the following Python script jupyter-proxy in $HOME/bin on your Linux workstation:

cat > $HOME/bin/jupyter-proxy <<'EOF'
#!/usr/bin/env python3

import re, sys, socket, subprocess

def usage(msg = 'error'):
  print(f"""jupyter-proxy: {msg}
usage: jupyter-proxy [ -u user ] host [ node ] url""")
  exit(2)

if len(sys.argv) < 2:
  usage()

command = [ 'ssh', '-N' ]

args = sys.argv[1:]

while len(args) > 0 and args[0][0] == '-':
  if args[0] == '-u':
    command.extend(['-l', args[1]])
    args = args[2:]
  else:
    usage(f'invalid option: {args[0]}')

if len(args) == 3:
  command.extend(['-J', args[0]])
  args = args[1:]

if len(args) == 2:
  jupyterhost = args[0]
  url = args[1]
else:
  usage('missing or surplus arguments')

match = re.fullmatch(r'http://localhost:(\d+)/lab\?token=(\w+)', url)
if match is None:
  usage(f'invalid JupyterLab URL {url}')

jupyterport = match.group(1)
token = match.group(2)

with socket.socket() as s:
  s.bind(("", 0))
  localport = s.getsockname()[1]

newurl = f'http://localhost:{localport}/lab\?token={token}'

command.extend(['-L', f'localhost:{localport}:localhost:{jupyterport}', jupyterhost])

print(f"""Paste the following URL into your browser

{newurl}

When finished:
  shut down kernel
  close browser window
  hit <ctrl-c> to interrupt this program
""")
subprocess.call(command)
EOF
chmod +x $HOME/bin/jupyter-proxy

Usage:

jupyter-proxy [ -u user] leon http://localhost:88xy/lab?token=zzzzzzzzzzzzzzzzzzzzz
                ^        ^    ^
                |        |    |
                |        |    paste this from the jupyter-lab output
                |        cluster on which batch job was submitted
                needed when user name on cluster differs from local user

This also works on Windows PCs with WSL or Cygwin installed.

Using Your PC To Display a Jupyter Notebook Running in a Slurm or SGE job

You can also start a Jupyter kernel in a Leo batch job and display its dialog in a browser window on your PC.

Sample Job SGE (LEO3e, LEO4)

#!/bin/bash
#$ -q std.q
#$ -N jupyter
#$ -pe openmp 2
#$ -l h_vmem=4G
#$ -j yes
#$ -l h_rt=4:00:00

cat $0

module purge
module load Anaconda3/2023.03/python-3.10.9-anaconda-2023.03

echo "START: $(date)"
echo ${USER}@$(hostname)
jupyter-lab --no-browser
echo "END: $(date)"

Sample Job Slurm (LEO5, LCCn)



#!/bin/bash

#SBATCH -J jupyter
#SBATCH --export=NONE
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --threads-per-core=1
#SBATCH --hint=nomultithread
#SBATCH --mem=8G
#SBATCH --time=04:00:00

cat $0

module purge
module load Anaconda3/2023.03/python-3.8.13-anaconda-mpi-conda-2023.03

echo "START: $(date)"
echo ${USER}@$(hostname)
srun --export=ALL --cpus-per-task=$SLURM_CPUS_PER_TASK --unbuffered jupyter-lab --no-browser
echo "END: $(date)"

Adapt either of the sample job scripts accorting to your needs. Using srun to start JupyterLab ensures that CPU affinity is limited to the CPUs you actually requested. Otherwise, your job will use multiple hardware threads even if you requested one thread per core. The --unbuffered option of the jupyter-lab command in the Slurm example is optional and speeds up display of the job output.

Then proceed as follows:

On your PC start an SSH session to your selected Leo login node.
Linux: start a terminal window and enter
ssh [user@]leon
Windows: connect to a LEO login node using Putty.
Submit your job script.
Wait until your job has started, then look at its output. It should contain the user@host close to the top, and an URL of the form
http://localhost:88xy/lab?token=zzzzzzzzzzzzzzzzzzzzz
close to the bottom.
On your PC, start another terminal window (Linux), a WSL window (if WSL is installed), or a CMD window (Windows).
In this window, set up an SSH tunnel to your JupyterLab session on the worker node nxxx, using the login node leon as a jump host:
```
ssh -N -L localhost:9001:localhost:88xy -J user@leon user@nxxx
```
The port number 9001 is arbitrary. Pick any port which is not in use (if it is, you'll get an error message). The user@ part is optional if your local user name is the same as on the remote machine. If you want to have several JupyterLab sessions, we suggest to use sequential numbers 9001, 9002....
Start a new browser window on your PC.
Copy and paste above URL into your browser's address field, replace the port number 88xy by your local port number (e.g. 9001). Verify that your URL looks like
http://localhost:9001/lab?token=zzzzzzzzzzzzzzzzzzzzz
and hit Enter. Your Jupyter session should be displayed.

Instead of constructing the ssh command, you may also use above proxy script, using the extended command line

jupyter-proxy [ -u user] leon nxxx http://localhost:88xy/lab?token=zzzzzzzzzzzzzzzzzzzzz
                ^        ^    ^    ^
                |        |    |    |
                |        |    |    paste this from the jupyter-lab output
                |        |    node on which batch job was actually started
                |        cluster on which batch job was submitted
                needed when user name on cluster differs from local user

Recommendations and Caveats

Should you run your JupyterLab session on the login node or in a batch job?
This depends on what you are planning to do in your session: If you are doing development work with long idle periods and only brief calculations, use the login node. If you do production work with substantial CPU and memory usage, start JupyterLab in a batch job.
Please do not forget to terminate your Jupyter session (Menu "File/Shutdown") and the SSH tunnel (CTRL-C) after use. Remember that an idling job blocks resources from productive use.
GPUs are a particularly scarce resource and cannot be easily shared. If your code uses GPUs (e.g. tensorflow or pytorch), make sure that it does not grab both GPUs on the login node, and also make sure to terminate your JupyterLab session when you are done.
TBD: ensure resource usage enforcement in batch job.

Documentation and Notes

License Information

Use of Anaconda is free under certain conditions. Please read the Terms Of Service and the Update on Anaconda’s Terms of Service for Academia and Research.

Anaconda And Conda Web Sites

Python 2 Legacy Code

We no longer support Python2 because it is obsolete as of January 2020. See the "Sunsetting Python 2" article for background information.

If you still have legacy code written in Python2, it will likely be possible to automatically convert large portions of your code using tools such as 2to3. Since Python2 and Python3 have a few semantically undecidable incompatibilities (e.g. string handling, generator functions vs. functions returning lists), you may need to apply a few manual corrections after automatic conversions to get your code to run and perform well. To our experience with a few projects, the effort for a successful conversion is not very high.

Links To Other Noteworthy Anaconda Installations

Notes

This is a generic binary installation that should work on all of our microarchitectures. Should any of your Anaconda based jobs or processes abort with Illegal instruction, please let us know, and we will try to fix this.
If a conda clean command fails because it tries to remove material from our shared information, please let us know so we can correct this situation.
Starting some time between 2022 and 2023, Conda have silently stopped installing GPU enabled versions of Cuda based software such as Tensorflow and Pytorch when the installation runs on a machine not equipped with GPUs. Affected modules on the Leo systems are (excluding those that explicitly have "cpu" in their names):
Pytorch
- Anaconda3/2023.03/pytorch-1.13.1-conda-2023.03
- Anaconda3/2023.10/pytorch-2.0.1-conda-2023.10
Tensorflow
- python/tensorflow-2.12.0-conda-2023.10
- tensorflow/2.10.0-conda-2023.03
- Anaconda3/2022.01/tensorflow-2.6.0-conda-2022.01
- Anaconda3/2023.03/tensorflow-2.10.0-conda-2023.03
- Anaconda3/2023.10/tensorflow-2.12.0-conda-2023.10
These modules use CPU only. If you wish to use GPU enabled versions, please create your own environments for the time being. We plan to supply GPU enabled versions with the next Anaconda instance. For details, please see Installing CUDA-enabled packages like TensorFlow and PyTorch.

Zentraler Informatikdienst der Universität Innsbruck