-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to set up Ensembler-mpi-multiple GPUs? #71
Comments
Ok I did a trial MPI run and it's not all quite right :/ I did 2 CPU threads and 2 GPUs:shared. Log is here - you can see it appears like there are two processes running everything in parallel, rather than dividing work in half - there's even a time where one deletes a file, and the other one complains when it tries the same next. Log: https://gist.github.com/rafwiewiora/b7f663ee76059ea8aca22b42347cafce (killed it at the start of explicit refinement - hanged). I did: Ensembler script:
MPI script:
This uses the Any pointers? |
Actually had |
Yep, same thing. |
You can't use # walltime : maximum wall clock time (hh:mm:ss)
#PBS -l walltime=03:00:00
#
# join stdout and stderr
#PBS -j oe
#
# spool output immediately
#PBS -k oe
#
# specify GPU queue
#PBS -q gpu
#
# nodes: number of nodes
# ppn: number of processes per node
# gpus: number of gpus per node
# GPUs are in 'exclusive' mode by default, but 'shared' keyword sets them to shared mode.
#PBS -l nodes=1:ppn=2:gpus=2:shared
#
# export all my environment variables to the job
#PBS -V
#
# job name (default = name of script file)
#PBS -N myjob
#
# mail settings (one or more characters)
# email is sent to local user, unless another email address is specified with PBS -M option
# n: do not send mail
# a: send mail if job is aborted
# b: send mail when job begins execution
# e: send mail when job terminates
#PBS -m n
#
# filename for standard output (default = <job_name>.o<job_id>)
# at end of job, it is in directory from which qsub was executed
# remove extra ## from the line below if you want to name your own file
##PBS -o myoutput
# Change to working directory used for job submission
cd $PBS_O_WORKDIR
# add anaconda to PATH
export PATH=/cbio/jclab/home/rafal.wiewiora/anaconda/bin:$PATH
ensembler init
cp ../manual-overrides.yaml .
ensembler gather_targets --gather_from uniprot --query SETD8_HUMAN --uniprot_domain_regex SET
ensembler gather_templates --gather_from uniprot --query SETD8_HUMAN
# no loopmodel
ensembler align
ensembler build_models
ensembler cluster --cutoff 0
# Parallelize refinement
python build-mpirun-configfile.py ensembler refine_implicit --gpupn 4
mpirun -configfile configfile
ensembler solvate
# Parallelize refinement
python build-mpirun-configfile.py ensembler refine_explicit --gpupn 4
mpirun -configfile configfile This parallelizes the two refinement steps. Hopefully @danielparton can jump in if I've made a mistake. |
Also, you might as well grab 4 GPUs by changing #PBS -l nodes=1:ppn=2:gpus=2:shared to #PBS -l nodes=1:ppn=4:gpus=4:shared |
Oh I see, thanks @jchodera! An hour wait time for 4GPUs last time I checked, so testing with 2 for now. (still need to work out the manual PDB before this is all ready to run right) |
Same thing - still running duplicates of the same thing at
etc. And it seems like they're both on the same GPU? |
Can you post the complete queue submission script from this last attempt?
|
Sure thing:
and I'm doing |
I should also mention that I've created |
# walltime : maximum wall clock time (hh:mm:ss)
#PBS -l walltime=03:00:00
#
# join stdout and stderr
#PBS -j oe
#
# spool output immediately
#PBS -k oe
#
# specify GPU queue
#PBS -q gpu
#
# nodes: number of nodes
# ppn: number of processes per node
# gpus: number of gpus per node
# GPUs are in 'exclusive' mode by default, but 'shared' keyword sets them to shared mode.
#PBS -l nodes=1:ppn=2:gpus=2:shared
#
# export all my environment variables to the job
#PBS -V
#
# job name (default = name of script file)
#PBS -N myjob
#
# mail settings (one or more characters)
# email is sent to local user, unless another email address is specified with PBS -M option
# n: do not send mail
# a: send mail if job is aborted
# b: send mail when job begins execution
# e: send mail when job terminates
#PBS -m n
#
# filename for standard output (default = <job_name>.o<job_id>)
# at end of job, it is in directory from which qsub was executed
# remove extra ## from the line below if you want to name your own file
##PBS -o myoutput
# Change to working directory used for job submission
cd $PBS_O_WORKDIR
# add anaconda to PATH
export PATH=/cbio/jclab/home/rafal.wiewiora/anaconda/bin:$PATH
ensembler init
cp ../manual-overrides.yaml .
ensembler gather_targets --gather_from uniprot --query SETD8_HUMAN --uniprot_domain_regex SET
ensembler gather_templates --gather_from uniprot --query SETD8_HUMAN
# no loopmodel
ensembler align
ensembler build_models
ensembler cluster --cutoff 0
# Parallelize refinement
build_mpirun_configfile ensembler refine_implicit --gpupn 2
mpirun -configfile configfile
ensembler solvate
# Parallelize refinement
build_mpirun_configfile ensembler refine_explicit --gpupn 2
mpirun -configfile configfile That worked for me:
|
Problem resolved - mpi4py wasn't installed! So how to will be:
Really neat once you know how to do it! TODO: put this in docs. |
Ensembler running on 4GPUs since last night - wasn’t able to make it work on >1 node - either Ensembler was throwing some exception (and I don’t know what because it really throws the list of your commands at you) or CUDA was throwing initialization errors. That was on a |
I know this would take me ages to work out on my own, so hopefully someone could spare a few mins for a tutorial please.
How does one set up Ensembler to run with multiple GPUs? I have 23 models and need to equilibrate them all for 5ns each.
More detailed questions:
--gpupn
torefine_explicit
and everything else will happen automagically as far as Ensembler is concerned?Thanks!
The text was updated successfully, but these errors were encountered: