Connecting to CSCS
DO NOT RUN ON LOGIN NODE
When you establish a direct connection using ssh
you connect to the login node. Everyone is on that node and as such YOU SHOULD NEVER RUN ANY
JOBS DIRECTLY ON THE LOGIN NODE. If you want to run a process, like a training, you can run it on a dedicated allocated job
Pre-setup (access to the CSCS)
Please ask Michael or Peter to add you to the CSCS project (send a message on Slack to get a faster answer). Once you have been added, check your mail for the invitation link. You will to have to create an account.
Connect to the login node
To connect to the login node, you will need to refresh your key every 24 hours. To refresh your keys, you need to execute the following script. Store the following script in a .sh
file (e.g. cscs_connect.sh
). Make sure to replace $CSCS_USERNAME
with your CSCS username and the $CSCS_PASSWORD
with your CSCS password.
#!/bin/bash
# This script sets the environment properly so that a user can access CSCS
# login nodes via ssh.
# Copyright (C) 2023, ETH Zuerich, Switzerland
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3 of the License.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
# AUTHORS Massimo Benini
USERNAME=$CSCS_USERNAME
PASSWORD=$CSCS_PASSWORD
#read -p "Username : " USERNAME
#read -s -p "Password: " PASSWORD
function ProgressBar {
# Process data
let _progress=(${1}*100/${2}*100)/100
let _done=(${_progress}*4)/10
let _left=40-$_done
# Build progressbar string lengths
_fill=$(printf "%${_done}s")
_empty=$(printf "%${_left}s")
# 1.2 Build progressbar strings and print the ProgressBar line
# 1.2.1 Output example:
# 1.2.1.1 Progress : [########################################] 100%
printf "\rSetting the environment : [${_fill// /#}${_empty// /-}] ${_progress}%%"
}
#Variables
_start=1
#This accounts as the "totalState" variable for the ProgressBar function
_end=100
#Params
MFA_KEYS_URL="https://sshservice.cscs.ch/api/v1/auth/ssh-keys/signed-key"
#Detect OS
OS="$(uname)"
case "${OS}" in
'Linux')
OS='Linux'
;;
'FreeBSD')
OS='FreeBSD'
;;
'WindowsNT')
OS='Windows'
;;
'Darwin')
OS='Mac'
;;
*) ;;
esac
#OS validation
if [ "${OS}" != "Mac" ] && [ "${OS}" != "Linux" ]; then
echo "This script works only on Mac-OS or Linux. Abording."
exit 1
fi
#Read Inputs
echo
read -s -p "Enter OTP (6-digit code): " OTP
echo
if [ -z "${PASSWORD}" ]; then
echo "Password is empty."
exit 1
fi
if ! [[ "${OTP}" =~ ^[[:digit:]]{6} ]]; then
echo "OTP is not valid, OTP must contains only six digits."
exit 1
fi
ProgressBar 25 "${_end}"
echo " Authenticating to the SSH key service..."
HEADERS=(-H "Content-Type: application/json" -H "accept: application/json")
KEYS=$(curl -s -S --ssl-reqd \
"${HEADERS[@]}" \
-d "{\"username\": \"$USERNAME\", \"password\": \"$PASSWORD\", \"otp\": \"$OTP\"}" \
"$MFA_KEYS_URL")
if [ $? != 0 ]; then
exit 1
fi
ProgressBar 50 "${_end}"
echo " Retrieving the SSH keys..."
DICT_KEY=$(echo ${KEYS} | cut -d \" -f 2)
if [ "${DICT_KEY}" == "payload" ]; then
MESSAGE=$(echo ${KEYS} | cut -d \" -f 6)
! [ -z "${MESSAGE}" ] && echo "${MESSAGE}"
echo "Error fetching the SSH keys. Aborting."
exit 1
fi
PUBLIC=$(echo ${KEYS} | cut -d \" -f 4)
PRIVATE=$(echo ${KEYS} | cut -d \" -f 8)
#Check if keys are empty:
if [ -z "${PUBLIC}" ] || [ -z "${PRIVATE}" ]; then
echo "Error fetching the SSH keys. Aborting."
exit 1
fi
ProgressBar 75 "${_end}"
echo " Setting up the SSH keys into your home folder..."
#Check ~/.ssh folder and store the keys
echo ${PUBLIC} | awk '{gsub(/\\n/,"\n")}1' > ~/.ssh/cscs-key-cert.pub || exit 1
echo ${PRIVATE} | awk '{gsub(/\\n/,"\n")}1' > ~/.ssh/cscs-key || exit 1
#Setting permissions:
chmod 644 ~/.ssh/cscs-key-cert.pub || exit 1
chmod 600 ~/.ssh/cscs-key || exit 1
#Format the keys:
if [ "${OS}" = "Mac" ]
then
sed -i '' -e '$ d' ~/.ssh/cscs-key-cert.pub || exit 1
sed -i '' -e '$ d' ~/.ssh/cscs-key || exit 1
else [ "${OS}" = "Linux" ]
sed '$d' ~/.ssh/cscs-key-cert.pub || exit 1
sed '$d' ~/.ssh/cscs-key || exit 1
fi
ProgressBar 100 "${_end}"
echo " Completed."
exit_code_passphrase=1
read -n 1 -p "Do you want to add a passphrase to your key? [y/n] (Default y) " reply;
if [ "$reply" != "" ];
then echo;
fi
if [ "$reply" = "${reply#[Nn]}" ]; then
while [ $exit_code_passphrase != 0 ]; do
ssh-keygen -f ~/.ssh/cscs-key -p
exit_code_passphrase=$?
done
fi
if (( $exit_code_passphrase == 0 ));
then
SUBSTRING=", using the passphrase you have set:";
else
SUBSTRING=":";
fi
eval `ssh-agent -s`
ssh-add -t 1d ~/.ssh/cscs-key
You will have to execute this bash script every day. If you don't want to have your login ID stored in a script, you can comment out the lines:
and remove the lines:
Setup your ssh config
Add the following lines to the ~/.ssh/config
file:
Host ela
HostName ela.cscs.ch
User $CSCS_USERNAME
ForwardAgent yes
ForwardX11 yes
forwardX11Trusted yes
IdentityFile ~/.ssh/cscs-key
Host todi
HostName todi.cscs.ch
User $CSCS_USERNAME
ProxyJump ela
ForwardAgent yes
ForwardX11 yes
forwardX11Trusted yes
IdentityFile ~/.ssh/cscs-key
Host clariden
HostName clariden.cscs.ch
User $CSCS_USERNAME
ProxyJump ela
ForwardAgent yes
ForwardX11 yes
forwardX11Trusted yes
IdentityFile ~/.ssh/cscs-key
To connect to the cluster, run the following:
DO NOT RUN ON LOGIN NODE
This opens a terminal on the CSCS login node. You should have a terminal that looks like this:
DO NOT RUN ANYTHING HEAVY WHEN YOU SEE THIS PROMPT (training, inference, heavy download, etc.). You should always ask for a job first. Here ln
stands for login node, this is how you know that you are on the login node. To launch a job, see how to use an environment and how to launch a job
Setup Github
To operate on private repositories on GitHub. You can either generate a SSH key pairs or use a GitHub personal access token (GitHub PAT). We recommand doing the second option but both options are viable.
To generate a GitHub PAT, follow those instructions. Make sure that this PAT is stored somewhere
For this tutorial, we are gonna use the MultiMeditron
training pipeline setup. Clone the MultiMeditron repository in your user directory:
When GitHub asks for your password, input the PAT that you have generated in this step.
Create a personal folder in the capstor partition
This personal folder on the capstor will be mainly used to store your huggingface home and your big files that don't fit in your users personal folder
In your ~/.bashrc
, append the following line:
Setup the environment on the cluster
The terminal will spawn you into the /users/$CSCS_USERNAME
directory.
When running job, you will need to execute your job inside docker images. This is done by using .toml
files that specify which docker image, environment variables are gonne be set when running the job. Create a folder .edf
in /users/$CSCS_USERNAME
:
Create a /users/$CSCS_USERNAME/.edf/multimodal.toml
file:
image = "/capstor/store/cscs/swissai/a127/meditron/docker/multimeditron_latest.sqsh"
mounts = ["/capstor", "/iopsstor", "/users"]
writable = true
workdir = "/users/$CSCS_USERNAME/meditron/MultiMeditron"
[annotations]
com.hooks.aws_ofi_nccl.enabled = "true"
com.hooks.aws_ofi_nccl.variant = "cuda12"
[env]
CUDA_CACHE_DISABLE = "1"
NCCL_NET = "AWS Libfabric"
NCCL_CROSS_NIC = "1"
NCCL_NET_GDR_LEVEL = "PHB"
FI_CXI_DISABLE_HOST_REGISTER = "1"
FI_MR_CACHE_MONITOR = "userfaultfd"
FI_CXI_DEFAULT_CQ_SIZE = "131072"
FI_CXI_DEFAULT_TX_SIZE = "32768"
FI_CXI_RX_MATCH_MODE = "software"
FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD = "16777216"
FI_CXI_COMPAT = "0"
Notice 2 things:
- We specify the path to the
.sqsh
file in theimage
attribute. This is the image used by the job that stores all of the dependencies. - We specify the path to the MultiMeditron repo in the
workdir
attribute. This is the directory where we spawn when the job is launched.
Note that for other types of job, you will probably require a different image and a different working directory.
Launching job
There are 2 types of job that you can launch:
- Interactive using
srun
(which gives you a terminal) - Non-interactive using
sbatch
(which schedule a job)
Interactive job
On the login node, you can launch an interactive job by executing the following command:
srun --time=1:29:59 --partition debug -A a127 --environment=/users/$CSCS_USERNAME/.edf/multimodal.toml --pty bash
Here is a breakdown of the command:
--time
is the maximum running time of the job (here, the job runs for 1h30 before it gets killed)-
--partition debug
is the node partition in which the job executed. As of 14/08/2025, there are 3 partitions:normal
: with a maximum running time of 12 hours and no limit on the number of distributed nodes. This partition is the partition used for non-interactive jobs and long interactive jobsdebug
: with a maximum running time of 1h30 with only one node. This partition is meant for interactive jobsxfer
: this partition is meant for data transfer and doesn't claim any GPU
To check if you have been allocated a node, run the following command in another terminal:
This command will give you a dynamic estimation of the scheduled time (may change as people pass you in the priority queue). Note that this command doesn't output anything if your job has been allocated.
Once you have been allocated a job, you will have a terminal inside the allocated node. Make sure that your bash prompt
is of the form $CSCS_USERNAME@nidxxxxxx
(and not [clariden][$CSCS_USERNAME@clariden-lnxxx]
. Run:
to make sure you have 4 GPUs and that you have the driver installed.
You can try to launch a training with MultiMeditron by running the following commands:
cd MultiMeditron
pip install -e .
torchrun --nproc-per-node 4 train.py --config config/config_alignment.yaml
Once you are done with the job. Type exit
to exit the terminal to exit the terminal. This will cancel your job.
Non-interactive job
To launch a non-interactive job, you need to create a sbatch script. Create a file called sbatch_train.sh
:
#!/bin/bash
#SBATCH --job-name demo-job
#SBATCH --output /users/$CSCS_USERNAME/meditron/reports/R-%x.%j.out
#SBATCH --error /users/$CSCS_USERNAME/meditron/reports/R-%x.%j.err
#SBATCH --nodes 1 # number of Nodes
#SBATCH --ntasks-per-node 1 # number of MP tasks. IMPORTANT: torchrun represents just 1 Slurm task
#SBATCH --gres gpu:4 # Number of GPUs
#SBATCH --cpus-per-task 288 # number of CPUs per task.
#SBATCH --time 0:59:59 # maximum execution time (DD-HH:MM:SS)
#SBATCH --environment /users/$CSCS_USERNAME/.edf/multimodal.toml
#SBATCH -A a127
export WANDB_DIR=/capstor/store/cscs/swissai/a127/homes/$CSCS_USERNAME/wandb
export WANDB_MODE="offline"
export HF_TOKEN=$HF_TOKEN
export SETUP="cd /users/$CSCS_USERNAME/meditron/multimodal/MultiMeditron && pip install -e ."
export CUDA_LAUNCH_BLOCKING=1
echo "START TIME: $(date)"
# auto-fail on any errors in this script
set -eo pipefail
# logging script's variables/commands for future debug needs
set -x
######################
### Set enviroment ###
######################
GPUS_PER_NODE=4
echo "NODES: $SLURM_NNODES"
######## Args ########
export HF_HOME=/capstor/store/cscs/swissai/a127/homes/$CSCS_USERNAME/hf_home
######################
######################
#### Set network #####
######################
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6200
######################
# note that we don't want to interpolate `\$SLURM_PROCID` till `srun` since otherwise all nodes will get
# 0 and the launcher will hang
#
# same goes for `\$(hostname -s|tr -dc '0-9')` - we want it to interpolate at `srun` time
LAUNCHER="
torchrun \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $SLURM_NNODES \
--node_rank \$SLURM_PROCID \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d \
--max_restarts 0 \
--tee 3 \
"
export CMD="$LAUNCHER train.py --config config/config_alignment.yaml"
echo $CMD
# srun error handling:
# --wait=60: wait 60 sec after the first task terminates before terminating all remaining tasks
SRUN_ARGS=" \
--cpus-per-task $SLURM_CPUS_PER_TASK \
--jobid $SLURM_JOB_ID \
--wait 60 \
-A a127 \
--reservation=sai-a127
"
# bash -c is needed for the delayed interpolation of env vars to work
srun $SRUN_ARGS bash -c "$SETUP && $CMD"
echo "END TIME: $(date)"
Make sure to replace all the $CSCS_USERNAME
by your username and the $HF_TOKEN
with your huggingface token. Pay attention to the following parameters:
#SBATCH --job-name demo-job
sets the job name todemo-job
#SBATCH --nodes 1
means that we are claiming one node (of 4 GPUs). You should increase this if you are launching bigger jobs#SBATCH --output /users/$CSCS_USERNAME/meditron/reports/R-%x.%j.out
and#SBATCH --error /users/$CSCS_USERNAME/meditron/reports/R-%x.%j.err
mean that this will create a folder/users/$CSCS_USERNAME/meditron/reports
that stores all the job logs- Note that here, we execute a training of MultiMeditron with
config/config_alignment.yaml
, thus you need to make sure that the paths of the dataset are correct - Note that the part which follows the
#SBATCH
commands will be executed on every node
To queue your job, run: bash
You can check if your job has been allocated GPUs by running:
This command gives you theJOBID
of the job you have launched
Once the job enters the R
state (for running), the job is running. You can check the logs of your job by going into the reports
directory:
where you need to replace R-%x.%j.err
by the actual report name.
You can either let the job finishes or cancels the job.
where $JOBID
is the JOBID
that you get when running squeue --me
VSCode Connection
If you want to join the modern era of computers and have something more involve than a terminal to code (unlike some people), you may want to "connect" your visual studio code instance directly to the cluster. This allows to directly modify the code, using the correct environment (so that it doesn't show you half the package as non existent).
Procedure
You will need the vscode CLI installed on the job you launched.
You can use the image that I personally used, you can update your environment file, and use the image at /capstor/store/cscs/swissai/a127/meditron/docker/multimeditron_code.sqsh
. With this solution however you'll inherit from all of my python dependencies. If you want to use your own image, you can check the manual installation.
If you want to use custom dependency, you'll need to manually install the vscode cli onto you image. To show you an example of it, here's a sample of my Dockerfile
responsible for installing the CLI.
FROM michelducartier24/multimeditron-apertus
RUN pip install -U transformers
RUN echo "" > /etc/pip/constraint.txt
RUN mkdir -p /workspace/code
WORKDIR /workspace/code
RUN curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-arm64' --output vscode_cli.tar.gz
RUN tar -xf vscode_cli.tar.gz
RUN mv ./code /usr/bin
RUN rm -rf /workspace/code
- Once your job has been launched with vscode CLI installed, it's time to run the code tunnel. Go to the folder of your project and run the following command
github
account, do so.
- Finally, open vscode locally on your computer then in the remote extension select the appropriate tunnel and that's it, you are in !