Reproducing Experiments

Reproducing Experiments#

In this tutorial you will learn how to seamlessly reproduce experiments run with ChemTorch. This tutorial assumes that you have already setup logging with W&B.

  1. Run your experiments with log=true.

    chemtorch +experiment=my_experiment log=true
    

    When logging is active, ChemTorch will also log the fully resolved config to W&B and it will be available in the wandb/ folder.

  2. The W&B config format is slightly different from that used by Hydra. So you first need to parse the config to Hydra format using the wandb_to_hydra.py script located in the scripts/ folder of ChemTorch.

    wandb_to_hydra.py usage#
    usage: wandb_to_hydra.py [-h] --run-id RUN_ID --output-path OUTPUT_PATH
                             [--wandb-dir WANDB_DIR]
    
    This script parses the config of a WandB run to Hydra format, removing the
    `value` wrappers and saving the result to a specified output path.
    
    options:
      -h, --help            show this help message and exit
      --run-id RUN_ID, -i RUN_ID
                            Run ID of the WandB run.
      --output-path OUTPUT_PATH, -o OUTPUT_PATH
                            Path to save the resulting hydra config file. If not
                            absolute, it is considered relative to the project
                            directory (parent of the script directory).
      --wandb-dir WANDB_DIR, -d WANDB_DIR
                            Path to the `wandb/` directory. If not provided, the
                            script will search in the project directory (parent of
                            the script directory).
    
  3. If we save the config to conf/saved_configs/my_experiment/my_run.yaml, we can run the following command to reproduce the experiment:

    chemtorch -cd=conf/saved_configs/my_experiment -cn=my_run
    

    This will run the exact same experiment as before, using the same config parameters.

Reproducing The ChemTorch Benchmarks#

The ChemTorch white paper includes a set of benchmark experiments that can be easily reproduced using the steps outlined above. The configs for these experiments are available in the conf/saved_configs/chemtorch_benchmarks/ folder:

conf/saved_configs/chemtorch_benchmarks/
├── optimal_model_configs
│   ├── atom_han.yaml
│   ├── cgr_dmpnn.yaml
│   ├── dimreaction.yaml
│   ├── drfp_mlp.yaml
│   └── ...
├── dmpnn_data_split_benchmark
│   ├── random_split.yaml
│   ├── reactant_scaffold_split.yaml
│   ├── reaction_core_split.yaml
│   └── ...
└── ...

The optimal model configs contain the best hyper-parameters found for each model/reaction representation on the benchmark datasets. The data split benchmark configs use the same hyper-parameters as the optimal model config, but vary the data splitting strategy to demonstrate the effect of data splits on model performance.

For example, run the following command to reproduce the CGR/D-MPNN experiment with seeds 0 to 9:

chemtorch -m -cd=conf/saved_configs/chemtorch_benchmarks/optimal_model_configs -cn=cgr_dmpnn +experiment=chemtorch_benchmarks seed=0,1,2,3,4,5,6,7,8,9

The multirun (-m) flag will run the experiment with all the specified seeds in sequence (see CLI Usage).