Skip to content

GPU basics

The Helios cluster has 3 gpu nodes. Each of these nodes contain 4 H100 Nvidia gpus. Note that many gpus are MIG partitioned. This means that they are split in smaller gpus.

Getting started

To see which gpu (gres) resources are available on the cluster per partition:

scontrol show partition

To see which gpu (gres) resources are available per node:

sinfo  -o "%20N %10c %10m  %100G"

MIG devices use a naming convention in the format g.gb. The first number (g) shows how many compute slices are available, and the second number (gb) shows how much memory (in gigabytes) is assigned. For example, 1g.5gb means 1 compute slice with 5 GB of memory.

If you are on the gpu node, you can get information about the gpus on that node and an overview of gpu processes with the following command:

nvidia-smi

Requesting gpu resources and billing

To request a gpu in your slurm sbatch or salloc command, add the gres parameter with one of the gpu types listed in the above command. For example:

salloc --partition=gpu --cpus-per-task 8 --time 12:00:00 --mem 52G --gres=gpu:2g.24gb:1

Resources are billed based on the largest percentage of a resource that you use on a node. So if you use one H100 gpu (a quarter of a node) and half the memory, you get billed based on memory. It is therefore important to know the maximum number of cpu and memory resources that are included with using a gpu. These are listed in the table below. So it is important to understand that if you assign more gpu / RAM then in the table below, you will get billed for it, but if you allocate less your bill does not go down.

GPU Nr cpus Memory (Gb)
1g.12gb 4 26
2g.24gb 8 52
4g.47gb 18 100
H100 32 188

Loading cuda libraries

When using tensorflow, it is recommended to install it together with the correct cuda libraries through pip

pip install tensorflow[and-cuda]

When using other python packages, conda is a popular tool to install cuda libraries.

It is also possible to load cuda libraries though the module environment. It is strongly recommended to specify the version as well since the default changes over time. For example:

module load cuda/12.9.1

Containers

When running containers that require access to gpu resources, extra parameters need to be specified depending on the container technology.

Apptainer (formerly singularity) requires --nv to access the gpu. For example:

apptainer run --nv <container_name>

Podman requires a device specification to access the gpu. For example to run nvidia-smi inside an ubuntu podman container, run:

podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi

Monitoring GPU usage

Note

Monitoring gpu usage is only possible on the H100 cards. Unfortunately, MIG devices are not capable of providing profiling information.

It is strongly recommended to check gpu usage of your pipeline. GPUs are expensive and you would like them to be fully used when you allocate them. Two important benefits of profiling are:

  • Identifying bottlenecks in your pipeline: for example if the gpu is stalling a significant amount of time while loading data from a slow disk. Determining these bottlenecks and solving them will speed up your pipeline
  • Finding the best gpu type for your pipeline so you can reduce cost. Because you can only profile H100 gpus, not the smaller MIG devices, this functionality is a bit limited. Essentially you can test your code on an H100 and check if a smaller gpu is likely to suffice.

Interactive profiling the GPU

You can use nvtop to see a graph of GPU core usage and GPU RAM while your code is running.

Logging of resource usage

With the nvidia-smi command you can get gpu and gpu-ram utilization numbers over time. The following code will start tracking these resources in the background and write them to file. Note that you can change the update frequency, it is set to 10 below.

# Function to cleanup all background processes
cleanup() {
    echo "Cleaning up background processes..."
    jobs -p | xargs -r kill
}

# Call the cleanup function with the scripts exists
trap cleanup EXIT INT TERM

# Start GPU monitoring in background
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,memory.used,temperature.gpu --format=csv,nounits -l 10 > gpu_stats_${SLURM_JOB_ID}.log &