Setup MFF UK GPULab
Prerequisites
- Get a CAS login.
- Login to gpulab using CAS login, follow this guide.
Charliecloud image
- Log into a gpu node
srun -p gpu-ffa --gpus=1 --time=5:00:00 --pty bash
- Pull nvidia docker image with Charliecloud
ch-image pull tensorflow/tensorflow:latest-gpu
- Convert the docker image to charliecloud image expressed as a directory
./my-tf
ch-convert -i ch-image -o dir tensorflow/tensorflow:latest-gpu ./my-tf
- Import CUDA libraries
ch-fromhost --nvidia ./my-tf
- Launch the container
ch-run -w -c /home/jankovys --bind=/home/jankovys -u 0 -g 0 ./my-tf -- bash
The command above will launch the container with working directory /home/jankovys
and write access to the container (-w
). The --bind=/home/jankovys
option will bind the /home/jankovys
directory on the host to the /home/jankovys
directory in the container. The -u 0 -g 0
options will run the container as root user. The --
at the end of the command tells ch-run that the command to run in the container follows.
- Verify the GPU support
nvidia-smi
Tensorflow
If you want to install tensorflow, it must be the same version as tensorflowe in the charliecloud image. 16. Upgrade pip
pip install --upgrade pip
- create a virtual environment
python -m venv venv
source venv/bin/activate
- Install tensorflow
pip install tensorflow
- Verify the installation and GPU supportYou should see something like this:
python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
Verify GPU supporttf.Tensor(0.0, shape=(), dtype=float32)
You should see something like this:python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Running scripts
To run a python script runMe.py
create a file runMe.sh
with the following content:
#!/bin/bash
#SBATCH --partition=gpu-ffa # partition you want to run job in
#SBATCH --gpus=1 # number of GPUs
#SBATCH --mem=16G # CPU memory resource
#SBATCH --time=12:00:00 # time limit
#SBATCH --cpus-per-task=8 # cpus per tasks
#SBATCH --job-name="run_conda" # change to your job name
#SBATCH --output=/home/jankovys/JIDENN/out/%x.%A.%a.log # output file
ch-run -w --bind=/home/jankovys -c /home/jankovys/JIDENN /home/jankovys/my-tf -- bash_scripts/runner_inside.sh
and a file runner_inside.sh
with the following content:
#!/bin/bash
venv/bin/python runMe.py
To import the CUDA libraries, you need start the sbatch job from a working node:
srun -p gpu-ffa --gpus=1 --pty bash
sbatch runMe.sh
The runMe.sh
script will automatically log onto a working node with selected resources, launch the container, and run the runner_inside.sh
script.
The runner_inside.sh
script will activate the virtual environment and run the runMe.py
script with CUDA support.
To run the script in a interactive session, run the following command:
srun -p gpu-ffa --gpus=1 --time=5:00:00 --pty bash
ch-run -w -c /home/jankovys --bind=/home/jankovys -u 0 -g 0 ./my-tf -- bash
source venv/bin/activate
python runMe.py
Troubleshooting
Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
error
When using CNN in TF you might get the following error:
Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
To fix this create a link inside your charliecloud image:
ln -s /usr/local/cuda/targets/x86_64-linux/lib/libcuda.so.1 /usr/local/cuda/targets/x86_64-linux/lib/libcuda.so