codes.utils package#

Submodules#

codes.utils.data_utils module#

exception codes.utils.data_utils.DatasetError#

Bases: Exception

Error for missing data or dataset or if the data shape is incorrect.

codes.utils.data_utils.check_and_load_data(dataset_name, verbose=True, log=True, normalisation_mode='standardise', tolerance=1e-20)#

Check the specified dataset and load the data based on the mode (train or test).

Parameters:
  • dataset_name (str) – The name of the dataset.

  • verbose (bool) – Whether to print information about the loaded data.

  • log (bool) – Whether to log-transform the data (log10).

  • normalisation_mode (str) – The normalization mode, either “disable”, “minmax”, or “standardise”.

  • tolerance (float, optional) – The tolerance value for log-transformation. Values below this will be set to the tolerance value. Pass None to disable.

Returns:

Loaded data and timesteps.

Return type:

tuple

Raises:

DatasetError – If the dataset or required data is missing or if the data shape is incorrect.

codes.utils.data_utils.create_dataset(name, train_data, test_data=None, val_data=None, split=None, timesteps=None, labels=None)#

Creates a new dataset in the data directory.

Parameters:
  • name (str) – The name of the dataset.

  • train_data (np.ndarray | torch.Tensor) – The training data.

  • test_data (np.ndarray | torch.Tensor, optional) – The test data.

  • val_data (np.ndarray | torch.Tensor, optional) – The validation data.

  • tuple (split) – If test_data and val_data are not provided, train_data can be split into training, test and validation data.

  • timesteps (np.ndarray | torch.Tensor, optional) – The timesteps array.

  • labels (list[str], optional) – The labels for the chemicals.

Raises:
  • FileExistsError – If the dataset already exists.

  • TypeError – If the train_data is not a numpy array or torch tensor.

  • ValueError – If the train_data, test_data, and val_data do not have the correct shape.

codes.utils.data_utils.create_hdf5_dataset(train_data, test_data, val_data, dataset_name, data_dir='datasets', timesteps=None, labels=None)#

Create an HDF5 file for a dataset with train and test data, and optionally timesteps. Additionally, store metadata about the dataset.

Parameters:
  • train_data (np.ndarray) – The training data array of shape (n_samples, n_timesteps, n_chemicals).

  • test_data (np.ndarray) – The test data array of shape (n_samples, n_timesteps, n_chemicals).

  • val_data (np.ndarray) – The validation data array of shape (n_samples, n_timesteps, n_chemicals).

  • dataset_name (str) – The name of the dataset.

  • data_dir (str) – The directory to save the dataset in.

  • timesteps (np.ndarray, optional) – The timesteps array. If None, integer timesteps will be generated.

  • labels (list[str], optional) – The labels for the chemicals.

codes.utils.data_utils.download_data(dataset_name, path=None)#

Download the specified dataset if it is not present :type dataset_name: str :param dataset_name: The name of the dataset. :type dataset_name: str :type path: Optional[str] :param path: The path to save the dataset. If None, the default data directory is used. :type path: str, optional

codes.utils.data_utils.get_data_subset(full_train_data, full_test_data, timesteps, mode, metric)#

Get the appropriate data subset based on the mode and metric.

Parameters:
  • full_train_data (np.ndarray) – The full training data.

  • full_test_data (np.ndarray) – The full test data.

  • timesteps (np.ndarray) – The timesteps.

  • mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).

  • metric (int) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).

Returns:

The training data, test data, and timesteps subset.

Return type:

tuple

codes.utils.data_utils.normalize_data(train_data, test_data=None, val_data=None, mode='standardise')#

Normalize the data based on the training data statistics.

Parameters:
  • train_data (np.ndarray) – Training data array.

  • test_data (np.ndarray, optional) – Test data array.

  • val_data (np.ndarray, optional) – Validation data array.

  • mode (str) – Normalization mode, either “minmax” or “standardise”.

Returns:

Normalized training data, test data, and validation data.

Return type:

tuple

codes.utils.utils module#

codes.utils.utils.check_training_status(config)#

Check if the training is already completed by looking for a completion marker file. If the training is not complete, compare the configurations and ask for a confirmation if there are differences.

Parameters:

config (dict) – The configuration dictionary.

Returns:

The path to the task list file. bool: Whether to copy the configuration file.

Return type:

str

codes.utils.utils.create_model_dir(base_dir='.', subfolder='trained', unique_id='')#

Create a directory based on a unique identifier inside a specified subfolder of the base directory.

Parameters:
  • base_dir (str) – The base directory where the subfolder and unique directory will be created.

  • subfolder (str) – The subfolder inside the base directory to include before the unique directory.

  • unique_id (str) – A unique identifier to be included in the directory name.

Returns:

The path of the created unique directory within the specified subfolder.

Return type:

str

codes.utils.utils.get_progress_bar(tasks)#

Create a progress bar with a specific description.

Parameters:

tasks (list) – The list of tasks to be executed.

Returns:

The created progress bar.

Return type:

tqdm

codes.utils.utils.load_and_save_config(config_path='config.yaml', save=True)#

Load configuration from a YAML file and save a copy to the specified directory.

Parameters:
  • config_path (str) – The path to the configuration YAML file.

  • save (bool) – Whether to save a copy of the configuration file. Default is True.

Returns:

The loaded configuration dictionary.

Return type:

dict

codes.utils.utils.load_task_list(filepath)#

Load a list of tasks from a JSON file.

Parameters:

filepath (str) – The path to the JSON file.

Returns:

The loaded list of tasks

Return type:

list

codes.utils.utils.make_description(mode, device, metric, surrogate_name)#

Create a formatted description for the progress bar that ensures consistent alignment.

Parameters:
  • mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).

  • device (str) – The device to use for training (e.g., ‘cuda:0’).

  • metric (str) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).

  • surrogate_name (str) – The name of the surrogate model.

Returns:

A formatted description string for the progress bar.

Return type:

str

codes.utils.utils.nice_print(message, width=80)#

Print a message in a nicely formatted way with a fixed width.

Parameters:
  • message (str) – The message to print.

  • width (int) – The width of the printed box. Default is 80.

Return type:

None

codes.utils.utils.read_yaml_config(config_path)#
codes.utils.utils.save_task_list(tasks, filepath)#

Save a list of tasks to a JSON file.

Parameters:
  • tasks (list) – The list of tasks to save.

  • filepath (str) – The path to the JSON file.

Return type:

None

codes.utils.utils.set_random_seeds(seed)#

Set random seeds for reproducibility.

Parameters:

seed (int) – The random seed to set.

codes.utils.utils.time_execution(func)#

Decorator to time the execution of a function and store the duration as an attribute of the function.

Parameters:

func (callable) – The function to be timed.

codes.utils.utils.worker_init_fn(worker_id)#

Initialize the random seed for each worker in PyTorch DataLoader.

Parameters:

worker_id (int) – The worker ID.

Module contents#

codes.utils.check_and_load_data(dataset_name, verbose=True, log=True, normalisation_mode='standardise', tolerance=1e-20)#

Check the specified dataset and load the data based on the mode (train or test).

Parameters:
  • dataset_name (str) – The name of the dataset.

  • verbose (bool) – Whether to print information about the loaded data.

  • log (bool) – Whether to log-transform the data (log10).

  • normalisation_mode (str) – The normalization mode, either “disable”, “minmax”, or “standardise”.

  • tolerance (float, optional) – The tolerance value for log-transformation. Values below this will be set to the tolerance value. Pass None to disable.

Returns:

Loaded data and timesteps.

Return type:

tuple

Raises:

DatasetError – If the dataset or required data is missing or if the data shape is incorrect.

codes.utils.check_training_status(config)#

Check if the training is already completed by looking for a completion marker file. If the training is not complete, compare the configurations and ask for a confirmation if there are differences.

Parameters:

config (dict) – The configuration dictionary.

Returns:

The path to the task list file. bool: Whether to copy the configuration file.

Return type:

str

codes.utils.create_dataset(name, train_data, test_data=None, val_data=None, split=None, timesteps=None, labels=None)#

Creates a new dataset in the data directory.

Parameters:
  • name (str) – The name of the dataset.

  • train_data (np.ndarray | torch.Tensor) – The training data.

  • test_data (np.ndarray | torch.Tensor, optional) – The test data.

  • val_data (np.ndarray | torch.Tensor, optional) – The validation data.

  • tuple (split) – If test_data and val_data are not provided, train_data can be split into training, test and validation data.

  • timesteps (np.ndarray | torch.Tensor, optional) – The timesteps array.

  • labels (list[str], optional) – The labels for the chemicals.

Raises:
  • FileExistsError – If the dataset already exists.

  • TypeError – If the train_data is not a numpy array or torch tensor.

  • ValueError – If the train_data, test_data, and val_data do not have the correct shape.

codes.utils.create_hdf5_dataset(train_data, test_data, val_data, dataset_name, data_dir='datasets', timesteps=None, labels=None)#

Create an HDF5 file for a dataset with train and test data, and optionally timesteps. Additionally, store metadata about the dataset.

Parameters:
  • train_data (np.ndarray) – The training data array of shape (n_samples, n_timesteps, n_chemicals).

  • test_data (np.ndarray) – The test data array of shape (n_samples, n_timesteps, n_chemicals).

  • val_data (np.ndarray) – The validation data array of shape (n_samples, n_timesteps, n_chemicals).

  • dataset_name (str) – The name of the dataset.

  • data_dir (str) – The directory to save the dataset in.

  • timesteps (np.ndarray, optional) – The timesteps array. If None, integer timesteps will be generated.

  • labels (list[str], optional) – The labels for the chemicals.

codes.utils.create_model_dir(base_dir='.', subfolder='trained', unique_id='')#

Create a directory based on a unique identifier inside a specified subfolder of the base directory.

Parameters:
  • base_dir (str) – The base directory where the subfolder and unique directory will be created.

  • subfolder (str) – The subfolder inside the base directory to include before the unique directory.

  • unique_id (str) – A unique identifier to be included in the directory name.

Returns:

The path of the created unique directory within the specified subfolder.

Return type:

str

codes.utils.download_data(dataset_name, path=None)#

Download the specified dataset if it is not present :type dataset_name: str :param dataset_name: The name of the dataset. :type dataset_name: str :type path: Optional[str] :param path: The path to save the dataset. If None, the default data directory is used. :type path: str, optional

codes.utils.get_data_subset(full_train_data, full_test_data, timesteps, mode, metric)#

Get the appropriate data subset based on the mode and metric.

Parameters:
  • full_train_data (np.ndarray) – The full training data.

  • full_test_data (np.ndarray) – The full test data.

  • timesteps (np.ndarray) – The timesteps.

  • mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).

  • metric (int) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).

Returns:

The training data, test data, and timesteps subset.

Return type:

tuple

codes.utils.get_progress_bar(tasks)#

Create a progress bar with a specific description.

Parameters:

tasks (list) – The list of tasks to be executed.

Returns:

The created progress bar.

Return type:

tqdm

codes.utils.load_and_save_config(config_path='config.yaml', save=True)#

Load configuration from a YAML file and save a copy to the specified directory.

Parameters:
  • config_path (str) – The path to the configuration YAML file.

  • save (bool) – Whether to save a copy of the configuration file. Default is True.

Returns:

The loaded configuration dictionary.

Return type:

dict

codes.utils.load_task_list(filepath)#

Load a list of tasks from a JSON file.

Parameters:

filepath (str) – The path to the JSON file.

Returns:

The loaded list of tasks

Return type:

list

codes.utils.make_description(mode, device, metric, surrogate_name)#

Create a formatted description for the progress bar that ensures consistent alignment.

Parameters:
  • mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).

  • device (str) – The device to use for training (e.g., ‘cuda:0’).

  • metric (str) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).

  • surrogate_name (str) – The name of the surrogate model.

Returns:

A formatted description string for the progress bar.

Return type:

str

codes.utils.nice_print(message, width=80)#

Print a message in a nicely formatted way with a fixed width.

Parameters:
  • message (str) – The message to print.

  • width (int) – The width of the printed box. Default is 80.

Return type:

None

codes.utils.normalize_data(train_data, test_data=None, val_data=None, mode='standardise')#

Normalize the data based on the training data statistics.

Parameters:
  • train_data (np.ndarray) – Training data array.

  • test_data (np.ndarray, optional) – Test data array.

  • val_data (np.ndarray, optional) – Validation data array.

  • mode (str) – Normalization mode, either “minmax” or “standardise”.

Returns:

Normalized training data, test data, and validation data.

Return type:

tuple

codes.utils.read_yaml_config(config_path)#
codes.utils.save_task_list(tasks, filepath)#

Save a list of tasks to a JSON file.

Parameters:
  • tasks (list) – The list of tasks to save.

  • filepath (str) – The path to the JSON file.

Return type:

None

codes.utils.set_random_seeds(seed)#

Set random seeds for reproducibility.

Parameters:

seed (int) – The random seed to set.

codes.utils.time_execution(func)#

Decorator to time the execution of a function and store the duration as an attribute of the function.

Parameters:

func (callable) – The function to be timed.

codes.utils.worker_init_fn(worker_id)#

Initialize the random seed for each worker in PyTorch DataLoader.

Parameters:

worker_id (int) – The worker ID.