AlphaFold¶
From the Alphafold website https://github.com/deepmind/alphafold
This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP14 and published in Nature. For simplicity, we refer to this model as AlphaFold throughout the rest of this document.
We also provide an implementation of AlphaFold-Multimer. This represents a work in progress and AlphaFold-Multimer isn’t expected to be as stable as our monomer AlphaFold system. Read the guide for how to upgrade and update code.
AlphaFold data on CSD3¶
The 2.8TB dataset is stored in:
/datasets/public/AlphaFold/data
Note that you may need to ls the directory in order for it to be mounted. There are example sequences stored in the input directory.
The dataset has been recently updated (November 2023) so scripts may not work without pointing to new versions of the files. In addition, to address this issue, newer versions of the uniref
can be found here /datasets/public/AlphaFold/data/uniref30/UniRef30_2023_02*.
Running AlphaFold2 on CSD3¶
There are various ways to run AlphaFold2 on CSD3. We encourage the use of:
To get up and running quickly on CSD3 it is possible to run the Singularity container provided as a module:
module load alphafold/2.3.2-singularity
See Singularity for more information. This is not performant as it runs the slow CPU step and GPU step in sequence meaning that 4 GPUs are sitting idle for most of the time. Instead, see the ParaFold section for instructions on how to obtain better performance.
If you would like us to support other implementations of AlphaFold2 or if anything here is unclear or incorrect please contact support.
Separating CPU and GPU steps using ParallelFold and Conda¶
ParaFold https://parafold.sjtu.edu.cn is a fork of AlphaFold2 that separates the CPU MSA step from the GPU prediction so that it can be executed in a two step process. This is more desirable because the GPU remains idle for most of the running time when using DeepMind’s Singularity build as shown below.
To install, create a Conda environment using CSD3’s Conda module or download it yourself:
module load miniconda/3
conda create -n parafold python=3.8
If downloading yourself, use Miniforge as it is shipped with mamba which is an optimised implementation on Conda.
Then follow the instructions https://github.com/RSE-Cambridge/ParallelFold, with usage information here. Note that this fork is an optimised form of the original to run on CSD3.
First we run the CPU MSA step on an Icelake node. The -f flag means that we only run the featurisation step:
#!/bin/bash
#SBATCH -A MY_CPU_ACCOUNT
#SBATCH -p icelake
#SBATCH -N 1
#SBATCH --exclusive
#SBATCH -t 04:00:00
# source conda environment
module load miniconda/3
conda activate parafold
DATA=/datasets/public/AlphaFold/data
./run_alphafold.sh \
-d $DATA \
-o output \
-p monomer_ptm \
-i input/mono_set1/GB98.fasta \
-m model_1 \
-f
The featurisation step will output feature.pkl and an MSA directory in the output directory. To run a monomer prediction, execute the following command on the GPU:
#!/bin/bash
#SBATCH -A MY_GPU_ACCOUNT
#SBATCH -p ampere
#SBATCH -N 1
#SBATCH --gres=gpu:1
#SBATCH -t 02:00:00
# source conda environment
module load miniconda/3
conda activate parafold
DATA=/datasets/public/AlphaFold/data
./run_alphafold.sh \
-d $DATA \
-o output \
-m model_1,model_2,model_3,model_4,model_5 \
-p monomer_ptm \
-i input/mono_set1/GB98.fasta \
-t 1800-01-01
Optimising the hhblits step¶
Follow the advice here for speeding up the hhblits step on CPU. The optimisation involves copying the two cs219 files from bfd directory and creating symbolic links to the remaining four on the local ssd:
ln -s /datasets/public/AlphaFold/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex /local/
Then point to the correct path by modifying the Parafold script (a similar apprach should work for other implementations of AlphaFold).
If running as part of a Slurm script, be sure to add the copy and symlink commands to the beginning of the script.
Running AlphaFold2 using Singularity on CSD3¶
Load the Singularity image which exposes a run_alphafold script into the environment. The script sets some default paths to the dataset.:
module load alphafold/2.3.2-singularity
Create a slurm script with the following contents to predict the structure of the T1050 sequence (779 residues). The script assumes that an input subdirectory exists containing T1050.fasta file:
#!/bin/bash
#SBATCH -A MYGPUACCOUNT
#SBATCH -p ampere
#SBATCH -N 1
#SBATCH --gres=gpu:1
#SBATCH -t 04:00:00
# load appropriate modules
module load rhel8/default-amp
module load alphafold/2.3.2-singularity
run_alphafold \
--pdb70_database_path=/data/pdb70/pdb70 \
--bfd_database_path /data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--output_dir $PWD/output/T1050 \
--fasta_paths $PWD/input/T1050.fasta \
--max_template_date=2020-05-14 \
--db_preset=full_dbs \
--use_gpu_relax=True
OR execute the full singularity command:
SIMAGE=/usr/local/Cluster-Apps/singularity/images/alphafold-2.3.2.sif
# point to location of AlphaFold data
DATA=/datasets/public/AlphaFold/data
singularity run --env \
TF_FORCE_UNIFIED_MEMORY=1,XLA_PYTHON_CLIENT_MEM_FRACTION=4.0,OPENMM_CPU_THREADS=32 \
-B $DATA:/data \
-B .:/etc \
--pwd /app/alphafold \
--nv ${SIMAGE} \
--data_dir /data/ \
--fasta_paths $PWD/input/T1050.fasta \
--output_dir $PWD/output/T1050/ \
--use_gpu_relax=True \
--max_template_date=2020-05-14 \
--uniref90_database_path=/data/uniref90/uniref90.fasta \
--mgnify_database_path /data/mgnify/mgy_clusters_2022_05.fa \
--template_mmcif_dir=/data/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
--bfd_database_path /data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path /data/uniref30/UniRef30_2021_03 \
--pdb70_database_path=/data/pdb70/pdb70
Timings are reported below running on 5 models for T1050 (779 residues) sequence:
real 149m52.862s
user 1111m23.014s
sys 22m55.353s
{
"features": 5646.51958489418,
"process_features_model_1": 95.72981929779053,
"predict_and_compile_model_1": 233.02064847946167,
"predict_benchmark_model_1": 130.08757734298706,
"relax_model_1": 334.7365086078644,
"process_features_model_2": 4.438706398010254,
"predict_and_compile_model_2": 184.557687997818,
"predict_benchmark_model_2": 116.91508865356445,
"relax_model_2": 307.3584554195404,
"process_features_model_3": 3.6764779090881348,
"predict_and_compile_model_3": 163.3666865825653,
"predict_benchmark_model_3": 121.80361533164978,
"relax_model_3": 420.58361291885376,
"process_features_model_4": 4.023890972137451,
"predict_and_compile_model_4": 169.06972408294678
"predict_benchmark_model_4": 121.70339488983154,
"relax_model_4": 300.7459502220154,
"process_features_model_5": 4.179120063781738,
"predict_and_compile_model_5": 154.17626547813416,
"predict_benchmark_model_5": 108.35132598876953,
"relax_model_5": 329.9167058467865
}
The Singularity image was built from Deepmind’s docker script and has been tested on the A100 nodes. The MSA construction and model inference are done on the same node type - it isn’t easy to decouple the two steps without using separate implementations (see below). Users can choose to run on CPU but the inference step takes considerably longer than on a GPU. Running on a GPU means that the CPU preprocessing (MSA) step can dominate the running time (depending on the particular sequence we wish to predict the structure of).
Current Issues¶
We are aware of the slow preprocessing time of hhblits on CSD3 and are working to improve this. For the small database it’s possible to pre-stage the data on the local ssd drive (with rsync), but this is not possible for the full database as it exceeds the capacity of the local ssd drive.
Appendices¶
Building the singularity image on CSD3:
git clone alphafold
cd ./alphafold
docker build -f docker/Dockerfile -t alphafold .
docker tag alphafold:latest ma595/alphafold:latest
docker push alphafold:latest