Runtime Properties#
Model configurations often depend on values that are external to the configuration itself—properties of the dataset or the data representation being used. For example:
Standardization parameters: Standardization rescales targets to zero mean and unit variance using statistics computed on the training split to improve optimization stability. During training, targets are standardized; at inference, predictions are de-standardized (inverse-transformed) back to the original scale. In ChemTorch this is handled by the regression routine’s
Standardizerbut it depends on the mean and standard deviation of the training targets.routine: _target_: chemtorch.core.routine.RegressionRoutine target_mean: ??? # Depends on training data target_std: ??? # Depends on training data
When you change the split you need to recompute the mean and standard deviation of the training targets.
GNN with interchangeable featurizers: Encoders consume per-node (or per-edge) feature vectors produced by your featurizer; the
input_dimmust match the length of these vectors.# conf/model/gnn.yaml _target_: chemtorch.components.model.GNNModel encoder: _target_: chemtorch.components.model.NodeEncoder input_dim: ??? # This depends on the featurizer! hidden_dim: 128
Because featurizers are interchangeable, hardcoding input_dim couples configs to one featurizer and breaks portability.
To reduce friction and avoid hardcoding values, ChemTorch uses a property system that lets you declare properties to be inferred at runtime and specify where they should be injected.
Configuring Runtime Properties#
Let’s return to our GNN encoder example.
Instead of hardcoding the input dimension we can declare the NumNodeFeatures property to infer the size of the feature vector at runtime:
# conf/experiment/my_experiment.yaml
props:
num_node_features:
_target_: chemtorch.core.property_system.NumNodeFeatures
name: num_node_features
source: any
add_to_cfg: true
label_mean:
_target_: chemtorch.core.property_system.LabelMean
name: label_mean
source: train
add_to_cfg: true
label_std:
_target_: chemtorch.core.property_system.LabelStd
name: label_std
source: train
add_to_cfg: true
precompute_time:
_target_: chemtorch.core.property_system.PrecomputeTime
name: precompute_time
source: all
log: true
Here, the top-level config key num_node_features is just an index within props (useful for Hydra defaults list inclusion shown below).
Key |
Allowed values |
Purpose |
|---|---|---|
|
Any valid import path |
Python path to the property class |
|
|
Dataset partition used to compute the property. For some properties this does not matter (e.g., feature vector size is typically the same across partitions, so |
|
|
Inject the computed value back into the config at the very top level for interpolation from anywhere in the config tree. |
|
|
Log the computed value to W&B |
|
String |
Top-level config key under which the value is injected into the config or the name under which the property is logged to W&B. |
Interpolating Runtime Properties#
Once you configure your runtime properties, you can use the name to interpolate their runtime values in your configs:
# conf/model/gnn.yaml
_target_: chemtorch.components.model.GNNModel
encoder:
_target_: chemtorch.components.model.NodeEncoder
input_dim: ${num_node_features} # Automatically computed!
hidden_dim: 128
# conf/routine/regression.yaml
_target_: chemtorch.core.routine.RegressionRoutine
target_mean: ${train_label_mean}
target_std: ${train_label_std}
Now the same config works with any featurizer or dataset, the standardization parameters are computed automatically from the training dataset, and the time it takes to precompute the representations for each property are logged to W&B.
Reusing Existing Properties#
ChemTorch already includes a series of useful properties.
For a complete list of available built-in properties, see chemtorch.core.property_system.
Each built-in property comes with a default config at conf/props so that you can just include them in the defaults list of your experiment config:
# conf/experiment/my_experiment.yaml
defaults:
- /props/num_node_features@props.num_node_features
- /props/num_edge_features@props.num_edge_features
- /props/label_mean@props.label_mean
- /props/label_std@props.label_std
This declares four properties to be computed.
Each points to a config file in conf/props/.
Note
The defaults list syntax is a bit special here because we include configs that do not sit directly under the base config. Use the absolute path from the config root to the file, followed by the target node where it should be placed: /path/to/config@place.to.go.
See the Hydra documentation for full details.
Implementing Custom Properties#
To create a custom property, subclass DatasetProperty and implement the compute().
The method takes in an instance of DatasetBase corresponding to the partition selected as source.
Tip: For properties that depend on a single sample’s representation (e.g., feature vector length), access the first item: first_item = dataset[0] and extract the data object from tuples as needed.
Once you’ve implemented your property, create a config file for it in conf/props/ and use it in your configs as described above.
Implementation Guidelines#
Handle empty datasets: Always check if the dataset is empty and return a sensible default
Extract data correctly: The dataset may return tuples of
(data, label)if labeled, so use:first_item = dataset[0] data = first_item[0] if isinstance(first_item, tuple) else first_item
Validate requirements: Check for required attributes (e.g., labels, precomputation status) and raise informative errors
Type hints: Use proper type hints for the return value to enable IDE support
Documentation: Document what data the property requires and any assumptions