codes.utils package

codes.utils package#

Submodules#

codes.utils.data_utils module#

exception codes.utils.data_utils.DatasetError#

Bases: Exception

Error for missing data or dataset or if the data shape is incorrect.

class codes.utils.data_utils.DownloadProgressBar(*_, **__)#

Bases: tqdm

update_to(b=1, bsize=1, tsize=None)#: Update the progress bar. :type b: :param b: Number of blocks transferred so far. Default is 1. :type b: int, optional :type bsize: :param bsize: Size of each block (in tqdm units). Default is 1. :type bsize: int, optional :type tsize: :param tsize: Total size (in tqdm units). Default is None. :type tsize: int, optional

codes.utils.data_utils.check_and_load_data(dataset_name, verbose=True, log=True, log_params=True, normalisation_mode='standardise', tolerance=None, per_species=False)#

Check the specified dataset and load the data based on the mode (train or test).

Parameters:

dataset_name (str) – The name of the dataset.
verbose (bool) – Whether to print information about the loaded data.
log (bool) – Whether to log-transform the data (log10).
log_params (bool) – Whether to log-transform the parameters.
normalisation_mode (str) – The normalization mode, either “disable”, “minmax”, or “standardise”.
tolerance (float, optional) – The tolerance value for log-transformation. Values below this will be set to the tolerance value. Pass None to disable.
per_species (bool) – If True, normalize for each species separately.

Returns:

A tuple containing:

(train_data, test_data, val_data)
(train_params, test_params, val_params) or (None, None, None) if parameters are absent
timesteps
n_train_samples
data_info (including transformation parameters for data and for parameters)
labels

Return type:

tuple

Raises:

DatasetError – If the dataset or required data is missing or if the data shape is incorrect.

codes.utils.data_utils.create_dataset(name, data, params=None, split=(0.7, 0.1, 0.2), timesteps=None, labels=None)#

Creates a new dataset in the data directory.

Parameters:

name (str) – The name of the dataset.
data (np.ndarray or tuple of np.ndarray) – Either a single 3D array of shape (n_samples, n_timesteps, n_quantities) or a tuple of three 3D arrays representing (train, test, val).
params (np.ndarray or tuple of np.ndarray, optional) – Either a single 2D array of shape (n_samples, n_parameters) corresponding to all samples, or a tuple of three 2D arrays representing (train, test, val) parameters. Must be provided in the same structure as data.
split (tuple of three floats, optional) – If data is provided as a single array, it is split into train, test, and validation sets according to these ratios (which must sum to 1).
timesteps (np.ndarray, optional) – A 1D array of timesteps. Its length must equal the number of timesteps in the data.
labels (list[str], optional) – Labels for the quantities. The number of labels must match the last dimension of the data.

Raises:

FileExistsError – If the dataset directory already exists.
TypeError – If data (or params) are not of the expected type.
ValueError – If the shapes of data or params are inconsistent.

codes.utils.data_utils.create_hdf5_dataset(train_data, test_data, val_data, dataset_name, data_dir='datasets', timesteps=None, labels=None, train_params=None, test_params=None, val_params=None)#

Create an HDF5 file for a dataset with train, test, and validation data, along with optional timesteps and parameters.

Parameters:

train_data (np.ndarray) – The training data array of shape (n_samples, n_timesteps, n_quantities).
test_data (np.ndarray) – The test data array of shape (n_samples, n_timesteps, n_quantities).
val_data (np.ndarray) – The validation data array of shape (n_samples, n_timesteps, n_quantities).
dataset_name (str) – The name of the dataset.
data_dir (str) – The directory in which to save the dataset.
timesteps (np.ndarray, optional) – A 1D array of timesteps.
labels (list[str], optional) – Labels for the quantities.
train_params (np.ndarray, optional) – Training parameters of shape (n_samples, n_parameters).
test_params (np.ndarray, optional) – Testing parameters of shape (n_samples, n_parameters).
val_params (np.ndarray, optional) – Validation parameters of shape (n_samples, n_parameters).

codes.utils.data_utils.download_data(dataset_name, path=None, verbose=True)#: Download the specified dataset if it is not present, with a progress bar. :type dataset_name: str :param dataset_name: The name of the dataset. :type dataset_name: str :type path: Optional[str] :param path: The path to save the dataset. If None, the default data directory is used. :type path: str, optional :type verbose: bool :param verbose: Whether to print information about the download progress. :type verbose: bool

codes.utils.data_utils.get_data_subset(data, timesteps, mode, metric, params=None, subset_factor=1)#

Get the appropriate data subset based on the mode and metric.

Parameters:

data (tuple[np.ndarray, ...]) – A tuple of data arrays of shape (n_samples, n_timesteps, n_quantities).
timesteps (np.ndarray) – The timesteps.
mode (str) – The benchmark mode (must be one of “interpolation”, “extrapolation”, “sparse”, “batch_size”). For “batch_size”, we thin the dataset by a factor of 4 for faster processing.
metric (int) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
params (tuple[np.ndarray, ...] | None | tuple[None, ...]) – Optional parameters (or tuple of parameters) with shape (n_samples, n_parameters). If None, params_subset will be None. If it is a tuple of Nones, then params_subset will be that tuple of Nones.
subset_factor (int) – The factor to subset the data by. Default is 1 (use full train and test data).

Returns:

(data_subset, params_subset, timesteps_subset)

Return type:

tuple

codes.utils.data_utils.normalize_data(train_data, test_data=None, val_data=None, mode='standardise', per_species=False)#

Normalize the data based on the training data statistics.

Parameters:

train_data (np.ndarray) – Training data array.
test_data (np.ndarray, optional) – Test data array.
val_data (np.ndarray, optional) – Validation data array.
mode (str) – Normalization mode, either “minmax” or “standardise”.
per_species (bool) – If True, normalize for each species separately.

Returns:

Normalized training data, test data, and validation data.

Return type:

tuple

codes.utils.data_utils.print_data_info(data_path)#

codes.utils.utils module#

codes.utils.utils.batch_factor_to_float(batch_factor)#

Convert a batch factor to a float value.

Parameters:: batch_factor (str | int | float) – The batch factor to convert.
Returns:: The converted batch factor as a float.
Return type:: float

codes.utils.utils.check_training_status(config)#

Check if the training is already completed by looking for a completion marker file. If the training is not complete, compare the configurations and ask for a confirmation if there are differences.

Parameters:: config (dict) – The configuration dictionary.
Returns:: The path to the task list file. bool: Whether to copy the configuration file.
Return type:: str

codes.utils.utils.create_model_dir(base_dir='.', subfolder='trained', unique_id='')#

Create a directory based on a unique identifier inside a specified subfolder of the base directory.

Parameters:

base_dir (str) – The base directory where the subfolder and unique directory will be created.
subfolder (str) – The subfolder inside the base directory to include before the unique directory.
unique_id (str) – A unique identifier to be included in the directory name.

Returns:

The path of the created unique directory within the specified subfolder.

Return type:

str

codes.utils.utils.determine_batch_size(config, surr_idx, mode, metric)#

Determine the appropriate batch size based on the config, surrogate index, mode, and metric.

Parameters:

config (dict) – The configuration dictionary.
surr_idx (int) – Index of the surrogate model in the config.
mode (str) – The benchmark mode (e.g., “main”, “batch_size”).
metric (int) – Metric used for determining the batch size in “batch_size” mode.

Returns:

The determined batch size.

Return type:

int

Raises:

ValueError – If the number of batch sizes does not match the number of surrogates.

codes.utils.utils.get_progress_bar(tasks)#

Create a progress bar with a specific description.

Parameters:: tasks (list) – The list of tasks to be executed.
Returns:: The created progress bar.
Return type:: tqdm

codes.utils.utils.load_and_save_config(config_path='config.yaml', save=True)#

Load configuration from a YAML file and save a copy to the specified directory.

Parameters:

config_path (str) – The path to the configuration YAML file.
save (bool) – Whether to save a copy of the configuration file. Default is True.

Returns:

The loaded configuration dictionary.

Return type:

dict

codes.utils.utils.load_task_list(filepath)#

Load a list of tasks from a JSON file.

Parameters:: filepath (str) – The path to the JSON file.
Returns:: The loaded list of tasks
Return type:: list

codes.utils.utils.make_description(mode, device, metric, surrogate_name)#

Create a formatted description for the progress bar that ensures consistent alignment.

Parameters:

mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).
device (str) – The device to use for training (e.g., ‘cuda:0’).
metric (str) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
surrogate_name (str) – The name of the surrogate model.

Returns:

A formatted description string for the progress bar.

Return type:

str

codes.utils.utils.nice_print(message, width=80)#

Print a message in a nicely formatted way with a fixed width.

Parameters:

message (str) – The message to print.
width (int) – The width of the printed box. Default is 80.

Return type:

None

codes.utils.utils.parse_for_none(dictionary)#

Parse a dictionary and replace all strings “None” with None. If value is a dictionary, recursively parse the dictionary.

Parameters:: dictionary (dict) – The dictionary to parse.
Returns:: The parsed dictionary.
Return type:: dictionary (dict)

codes.utils.utils.read_yaml_config(config_path)#

codes.utils.utils.save_task_list(tasks, filepath)#

Save a list of tasks to a JSON file.

Parameters:

tasks (list) – The list of tasks to save.
filepath (str) – The path to the JSON file.

Return type:

None

codes.utils.utils.set_random_seeds(seed, device)#

Set random seeds for reproducibility.

Parameters:

seed (int) – The random seed to set.
device (str) – The device to use for training, e.g., ‘cuda:0’

Return type:

None

codes.utils.utils.time_execution(func)#

Decorator to time the execution of a function and store the duration as an attribute of the function.

Parameters:: func (callable) – The function to be timed.

codes.utils.utils.worker_init_fn(worker_id)#

Initialize the random seed for each worker in PyTorch DataLoader.

Parameters:: worker_id (int) – The worker ID.

Module contents#

codes.utils.batch_factor_to_float(batch_factor)#

Convert a batch factor to a float value.

Parameters:: batch_factor (str | int | float) – The batch factor to convert.
Returns:: The converted batch factor as a float.
Return type:: float

codes.utils.check_and_load_data(dataset_name, verbose=True, log=True, log_params=True, normalisation_mode='standardise', tolerance=None, per_species=False)#

Check the specified dataset and load the data based on the mode (train or test).

Parameters:

dataset_name (str) – The name of the dataset.
verbose (bool) – Whether to print information about the loaded data.
log (bool) – Whether to log-transform the data (log10).
log_params (bool) – Whether to log-transform the parameters.
normalisation_mode (str) – The normalization mode, either “disable”, “minmax”, or “standardise”.
tolerance (float, optional) – The tolerance value for log-transformation. Values below this will be set to the tolerance value. Pass None to disable.
per_species (bool) – If True, normalize for each species separately.

Returns:

A tuple containing:

(train_data, test_data, val_data)
(train_params, test_params, val_params) or (None, None, None) if parameters are absent
timesteps
n_train_samples
data_info (including transformation parameters for data and for parameters)
labels

Return type:

tuple

Raises:

DatasetError – If the dataset or required data is missing or if the data shape is incorrect.

codes.utils.check_training_status(config)#

Check if the training is already completed by looking for a completion marker file. If the training is not complete, compare the configurations and ask for a confirmation if there are differences.

Parameters:: config (dict) – The configuration dictionary.
Returns:: The path to the task list file. bool: Whether to copy the configuration file.
Return type:: str

codes.utils.create_dataset(name, data, params=None, split=(0.7, 0.1, 0.2), timesteps=None, labels=None)#

Creates a new dataset in the data directory.

Parameters:

name (str) – The name of the dataset.
data (np.ndarray or tuple of np.ndarray) – Either a single 3D array of shape (n_samples, n_timesteps, n_quantities) or a tuple of three 3D arrays representing (train, test, val).
params (np.ndarray or tuple of np.ndarray, optional) – Either a single 2D array of shape (n_samples, n_parameters) corresponding to all samples, or a tuple of three 2D arrays representing (train, test, val) parameters. Must be provided in the same structure as data.
split (tuple of three floats, optional) – If data is provided as a single array, it is split into train, test, and validation sets according to these ratios (which must sum to 1).
timesteps (np.ndarray, optional) – A 1D array of timesteps. Its length must equal the number of timesteps in the data.
labels (list[str], optional) – Labels for the quantities. The number of labels must match the last dimension of the data.

Raises:

FileExistsError – If the dataset directory already exists.
TypeError – If data (or params) are not of the expected type.
ValueError – If the shapes of data or params are inconsistent.

codes.utils.create_hdf5_dataset(train_data, test_data, val_data, dataset_name, data_dir='datasets', timesteps=None, labels=None, train_params=None, test_params=None, val_params=None)#

Create an HDF5 file for a dataset with train, test, and validation data, along with optional timesteps and parameters.

Parameters:

train_data (np.ndarray) – The training data array of shape (n_samples, n_timesteps, n_quantities).
test_data (np.ndarray) – The test data array of shape (n_samples, n_timesteps, n_quantities).
val_data (np.ndarray) – The validation data array of shape (n_samples, n_timesteps, n_quantities).
dataset_name (str) – The name of the dataset.
data_dir (str) – The directory in which to save the dataset.
timesteps (np.ndarray, optional) – A 1D array of timesteps.
labels (list[str], optional) – Labels for the quantities.
train_params (np.ndarray, optional) – Training parameters of shape (n_samples, n_parameters).
test_params (np.ndarray, optional) – Testing parameters of shape (n_samples, n_parameters).
val_params (np.ndarray, optional) – Validation parameters of shape (n_samples, n_parameters).

codes.utils.create_model_dir(base_dir='.', subfolder='trained', unique_id='')#

Create a directory based on a unique identifier inside a specified subfolder of the base directory.

Parameters:

base_dir (str) – The base directory where the subfolder and unique directory will be created.
subfolder (str) – The subfolder inside the base directory to include before the unique directory.
unique_id (str) – A unique identifier to be included in the directory name.

Returns:

The path of the created unique directory within the specified subfolder.

Return type:

str

codes.utils.determine_batch_size(config, surr_idx, mode, metric)#

Determine the appropriate batch size based on the config, surrogate index, mode, and metric.

Parameters:

config (dict) – The configuration dictionary.
surr_idx (int) – Index of the surrogate model in the config.
mode (str) – The benchmark mode (e.g., “main”, “batch_size”).
metric (int) – Metric used for determining the batch size in “batch_size” mode.

Returns:

The determined batch size.

Return type:

int

Raises:

ValueError – If the number of batch sizes does not match the number of surrogates.

codes.utils.download_data(dataset_name, path=None, verbose=True)#: Download the specified dataset if it is not present, with a progress bar. :type dataset_name: str :param dataset_name: The name of the dataset. :type dataset_name: str :type path: Optional[str] :param path: The path to save the dataset. If None, the default data directory is used. :type path: str, optional :type verbose: bool :param verbose: Whether to print information about the download progress. :type verbose: bool

codes.utils.get_data_subset(data, timesteps, mode, metric, params=None, subset_factor=1)#

Get the appropriate data subset based on the mode and metric.

Parameters:

data (tuple[np.ndarray, ...]) – A tuple of data arrays of shape (n_samples, n_timesteps, n_quantities).
timesteps (np.ndarray) – The timesteps.
mode (str) – The benchmark mode (must be one of “interpolation”, “extrapolation”, “sparse”, “batch_size”). For “batch_size”, we thin the dataset by a factor of 4 for faster processing.
metric (int) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
params (tuple[np.ndarray, ...] | None | tuple[None, ...]) – Optional parameters (or tuple of parameters) with shape (n_samples, n_parameters). If None, params_subset will be None. If it is a tuple of Nones, then params_subset will be that tuple of Nones.
subset_factor (int) – The factor to subset the data by. Default is 1 (use full train and test data).

Returns:

(data_subset, params_subset, timesteps_subset)

Return type:

tuple

codes.utils.get_progress_bar(tasks)#

Create a progress bar with a specific description.

Parameters:: tasks (list) – The list of tasks to be executed.
Returns:: The created progress bar.
Return type:: tqdm

codes.utils.load_and_save_config(config_path='config.yaml', save=True)#

Load configuration from a YAML file and save a copy to the specified directory.

Parameters:

config_path (str) – The path to the configuration YAML file.
save (bool) – Whether to save a copy of the configuration file. Default is True.

Returns:

The loaded configuration dictionary.

Return type:

dict

codes.utils.load_task_list(filepath)#

Load a list of tasks from a JSON file.

Parameters:: filepath (str) – The path to the JSON file.
Returns:: The loaded list of tasks
Return type:: list

codes.utils.make_description(mode, device, metric, surrogate_name)#

Create a formatted description for the progress bar that ensures consistent alignment.

Parameters:

mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).
device (str) – The device to use for training (e.g., ‘cuda:0’).
metric (str) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
surrogate_name (str) – The name of the surrogate model.

Returns:

A formatted description string for the progress bar.

Return type:

str

codes.utils.nice_print(message, width=80)#

Print a message in a nicely formatted way with a fixed width.

Parameters:

message (str) – The message to print.
width (int) – The width of the printed box. Default is 80.

Return type:

None

codes.utils.normalize_data(train_data, test_data=None, val_data=None, mode='standardise', per_species=False)#

Normalize the data based on the training data statistics.

Parameters:

train_data (np.ndarray) – Training data array.
test_data (np.ndarray, optional) – Test data array.
val_data (np.ndarray, optional) – Validation data array.
mode (str) – Normalization mode, either “minmax” or “standardise”.
per_species (bool) – If True, normalize for each species separately.

Returns:

Normalized training data, test data, and validation data.

Return type:

tuple

codes.utils.read_yaml_config(config_path)#

codes.utils.save_task_list(tasks, filepath)#

Save a list of tasks to a JSON file.

Parameters:

tasks (list) – The list of tasks to save.
filepath (str) – The path to the JSON file.

Return type:

None

codes.utils.set_random_seeds(seed, device)#

Set random seeds for reproducibility.

Parameters:

seed (int) – The random seed to set.
device (str) – The device to use for training, e.g., ‘cuda:0’

Return type:

None

codes.utils.time_execution(func)#

Decorator to time the execution of a function and store the duration as an attribute of the function.

Parameters:: func (callable) – The function to be timed.

codes.utils.worker_init_fn(worker_id)#

Initialize the random seed for each worker in PyTorch DataLoader.

Parameters:: worker_id (int) – The worker ID.

codes.utils package

Contents

codes.utils package#

Submodules#

codes.utils.data_utils module#

codes.utils.utils module#

Module contents#