Building Docker image for the RCP
To build a Docker image for the RCP, we will use the LiGHT cluster template.
Prerequisites: - You need docker on your computer - A Linux environment (as we are building docker images that target Linux) - Good specs, building docker images require lots of resources in RAM and can take lots of space in your disk. Make sure that your computer is "good enough"
Setup
On your machine:
The repository contains tutorials and reproducibility scripts for many clusters, but we will only focus on the RCP in thus tutorial.
We will build the docker images in the following way:
- Build a generic docker image that contains all the dependencies (NVidia drivers, pip, uv, git, etc.)
- Build many user docker images extending this generic image with the right permisions
This folder contains a template.sh
file which contains all the available functions
Creating a project on the EPFL registry (Optional)
EPFL has a registry of docker images which works in a similar way to the Docker hub. This registry is used to store docker images and is only accessible through the EPFL VPN.
Connect to registry.rcp.epfl.ch with the VPN. To push new docker images, you need to create a Project. At the time of this tutorial, some projects have already been created like multimeditron and you can ask Michael to add you to the project so that you can also push Docker images to this project.
You also have the possibility of creating your own project by clicking on "New project".
IMPORTANT: Beware that if you choose to create your own project, you will have to change the link of the docker images in the scripts accordingly (mainly replacing multimeditron/basic by the right path)
Generating the .env file
Run the following command:
This command create a (hidden) .env
file which looks like this:
# All user-specific configurations are here.
## For building:
# Which docker and compose binary to use
# docker and docker compose in general or podman and podman-compose for CSCS Clariden
DOCKER=docker
COMPOSE="docker compose"
# Use the same USRID and GRPID as on the storage you will be mounting.
# USR is used in the image name and must be lowercase.
# It's fine if your username is not lowercase, jut make it lowercase.
USR=john
USRID=1000
GRPID=984
GRP=users
# PASSWD is not secret,
# it is only there to avoid running password-less sudo commands accidentally.
PASSWD=john
# LAB_NAME will be the first component in the image path.
# It must be lowercase.
LAB_NAME=users
#### For running locally
# You can find the acceleration options in the compose.yaml file
# by looking at the services with names dev-local-ACCELERATION.
PROJECT_ROOT_AT=/path/to/LiGHT-cluster-template
ACCELERATION=cuda
WANDB_API_KEY=
# PyCharm-related. Fill after installing the IDE manually the first time.
PYCHARM_IDE_AT=
####################
# Project-specific environment variables.
## Used to avoid writing paths multiple times and creating inconsistencies.
## You should not need to change anything below this line.
PROJECT_NAME=template-project-name
PACKAGE_NAME=template_package_name
IMAGE_NAME=${LAB_NAME}/${USR}/${PROJECT_NAME}
IMAGE_PLATFORM=amd64-cuda
# The image name includes the USR to separate the images in an image registry.
# Its tag includes the platform for registries that don't hand multi-platform images for the same tag.
# You can also add a suffix to the platform e.g. -jax or -pytorch if you use different images for different environments/models etc.
Edit this file by setting LAB_NAME
to the name of the project you have given in the previous step (multimeditron if you choose to push your docker images there).
Set the PROJECT_NAME
to a meaningful name to group the docker images that belong to the same project. Here we will choose to set it up to basic
. Docker images with the same PROJECT_NAME
will appear together in the registry.rcp.epfl.ch
interface.
We also change the IMAGE_NAME
to another template for organization purpose.
Edit the corresponding lines to the following lines:
LAB_NAME=multimeditron # Or the name you have chosen
PROJECT_NAME=basic # Optional: Change this to a more meaningful name (e.g. vllm or axolotl for instance)
IMAGE_NAME=${LAB_NAME}/${PROJECT_NAME}
Building the generic docker image
Beware that this docker image uses the Dockerfile
to build. You may need to change the base Docker image in compose-base.yaml
:
BASE_IMAGE: nvcr.io/nvidia/pytorch:pytorch:24-07-py3 # (Optional: Change this tag to a newer version)
You can change this line to another newer version that you can find here. This will change the version of the NVidia driver that many libraries rely on. Some of the latest features may only be compatible with the newer version of the drivers (for instance the flash-attn library).
To build the generic docker image run the following command:
Useful commands:
docker ps
: To list all the docker images on your machine. Search for the one that you just built, it should have the name that you set up in IMAGE_NAME
. So if we are using multimeditron, it will be tagged as multimeditron/basic:amd64-cuda-root-latest
and multimeditron/basic:amd64-cuda-root-<some git commit hash>
docker run --rm -it --entrypoint bash multimeditron/basic:amd64-cuda-root-latest
: To bash into your new docker image and test if you have everything installed correctly.
Building the user docker images
Connect to people.epfl.ch and search for the user you want to build an image. You need the Username
and the UID
fields to build the user docker image
Connect to groups.epfl.ch and search for light-scratch in "All groups". You need the GID
field.
To build user docker images, you need to edit the .env
file. You need to change the following attributes:
GRPID=<GID of the light-scratch>
USRID=<UID of the user>
USR=<Username of the user>
PASSWD=<Username of the user>
To build the user docker image, run:
Push the docker images
Login to the registry:
The login informations are your EPFL login