codes.utils package#
Submodules#
codes.utils.data_utils module#
- exception codes.utils.data_utils.DatasetError#
Bases:
Exception
Error for missing data or dataset or if the data shape is incorrect.
- class codes.utils.data_utils.DownloadProgressBar(*_, **__)#
Bases:
tqdm
- update_to(b=1, bsize=1, tsize=None)#
Update the progress bar. :type b: :param b: Number of blocks transferred so far. Default is 1. :type b: int, optional :type bsize: :param bsize: Size of each block (in tqdm units). Default is 1. :type bsize: int, optional :type tsize: :param tsize: Total size (in tqdm units). Default is None. :type tsize: int, optional
- codes.utils.data_utils.check_and_load_data(dataset_name, verbose=True, log=True, log_params=True, normalisation_mode='standardise', tolerance=None)#
Check the specified dataset and load the data based on the mode (train or test).
- Parameters:
dataset_name (str) – The name of the dataset.
verbose (bool) – Whether to print information about the loaded data.
log (bool) – Whether to log-transform the data (log10).
log_params (bool) – Whether to log-transform the parameters.
normalisation_mode (str) – The normalization mode, either “disable”, “minmax”, or “standardise”.
tolerance (float, optional) – The tolerance value for log-transformation. Values below this will be set to the tolerance value. Pass None to disable.
- Returns:
- A tuple containing:
(train_data, test_data, val_data)
(train_params, test_params, val_params) or (None, None, None) if parameters are absent
timesteps
n_train_samples
data_info (including transformation parameters for data and for parameters)
labels
- Return type:
tuple
- Raises:
DatasetError – If the dataset or required data is missing or if the data shape is incorrect.
- codes.utils.data_utils.create_dataset(name, data, params=None, split=(0.7, 0.1, 0.2), timesteps=None, labels=None)#
Creates a new dataset in the data directory.
- Parameters:
name (str) – The name of the dataset.
data (np.ndarray or tuple of np.ndarray) – Either a single 3D array of shape (n_samples, n_timesteps, n_quantities) or a tuple of three 3D arrays representing (train, test, val).
params (np.ndarray or tuple of np.ndarray, optional) – Either a single 2D array of shape (n_samples, n_parameters) corresponding to all samples, or a tuple of three 2D arrays representing (train, test, val) parameters. Must be provided in the same structure as data.
split (tuple of three floats, optional) – If data is provided as a single array, it is split into train, test, and validation sets according to these ratios (which must sum to 1).
timesteps (np.ndarray, optional) – A 1D array of timesteps. Its length must equal the number of timesteps in the data.
labels (list[str], optional) – Labels for the quantities. The number of labels must match the last dimension of the data.
- Raises:
FileExistsError – If the dataset directory already exists.
TypeError – If data (or params) are not of the expected type.
ValueError – If the shapes of data or params are inconsistent.
- codes.utils.data_utils.create_hdf5_dataset(train_data, test_data, val_data, dataset_name, data_dir='datasets', timesteps=None, labels=None, train_params=None, test_params=None, val_params=None)#
Create an HDF5 file for a dataset with train, test, and validation data, along with optional timesteps and parameters.
- Parameters:
train_data (np.ndarray) – The training data array of shape (n_samples, n_timesteps, n_quantities).
test_data (np.ndarray) – The test data array of shape (n_samples, n_timesteps, n_quantities).
val_data (np.ndarray) – The validation data array of shape (n_samples, n_timesteps, n_quantities).
dataset_name (str) – The name of the dataset.
data_dir (str) – The directory in which to save the dataset.
timesteps (np.ndarray, optional) – A 1D array of timesteps.
labels (list[str], optional) – Labels for the quantities.
train_params (np.ndarray, optional) – Training parameters of shape (n_samples, n_parameters).
test_params (np.ndarray, optional) – Testing parameters of shape (n_samples, n_parameters).
val_params (np.ndarray, optional) – Validation parameters of shape (n_samples, n_parameters).
- codes.utils.data_utils.download_data(dataset_name, path=None, verbose=True)#
Download the specified dataset if it is not present, with a progress bar. :type dataset_name:
str
:param dataset_name: The name of the dataset. :type dataset_name: str :type path:Optional
[str
] :param path: The path to save the dataset. If None, the default data directory is used. :type path: str, optional :type verbose:bool
:param verbose: Whether to print information about the download progress. :type verbose: bool
- codes.utils.data_utils.get_data_subset(data, timesteps, mode, metric, params=None, subset_factor=1)#
Get the appropriate data subset based on the mode and metric.
- Parameters:
data (tuple[np.ndarray, ...]) – A tuple of data arrays of shape (n_samples, n_timesteps, n_quantities).
timesteps (np.ndarray) – The timesteps.
mode (str) – The benchmark mode (must be one of “interpolation”, “extrapolation”, “sparse”, “batch_size”). For “batch_size”, we thin the dataset by a factor of 4 for faster processing.
metric (int) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
params (tuple[np.ndarray, ...] | None | tuple[None, ...]) – Optional parameters (or tuple of parameters) with shape (n_samples, n_parameters). If None, params_subset will be None. If it is a tuple of Nones, then params_subset will be that tuple of Nones.
subset_factor (int) – The factor to subset the data by. Default is 1 (use full train and test data).
- Returns:
(data_subset, params_subset, timesteps_subset)
- Return type:
tuple
- codes.utils.data_utils.normalize_data(train_data, test_data=None, val_data=None, mode='standardise')#
Normalize the data based on the training data statistics.
- Parameters:
train_data (np.ndarray) – Training data array.
test_data (np.ndarray, optional) – Test data array.
val_data (np.ndarray, optional) – Validation data array.
mode (str) – Normalization mode, either “minmax” or “standardise”.
- Returns:
Normalized training data, test data, and validation data.
- Return type:
tuple
- codes.utils.data_utils.print_data_info(data_path)#
codes.utils.utils module#
- codes.utils.utils.check_training_status(config)#
Check if the training is already completed by looking for a completion marker file. If the training is not complete, compare the configurations and ask for a confirmation if there are differences.
- Parameters:
config (dict) – The configuration dictionary.
- Returns:
The path to the task list file. bool: Whether to copy the configuration file.
- Return type:
str
- codes.utils.utils.create_model_dir(base_dir='.', subfolder='trained', unique_id='')#
Create a directory based on a unique identifier inside a specified subfolder of the base directory.
- Parameters:
base_dir (str) – The base directory where the subfolder and unique directory will be created.
subfolder (str) – The subfolder inside the base directory to include before the unique directory.
unique_id (str) – A unique identifier to be included in the directory name.
- Returns:
The path of the created unique directory within the specified subfolder.
- Return type:
str
- codes.utils.utils.determine_batch_size(config, surr_idx, mode, metric)#
Determine the appropriate batch size based on the config, surrogate index, mode, and metric.
- Parameters:
config (dict) – The configuration dictionary.
surr_idx (int) – Index of the surrogate model in the config.
mode (str) – The benchmark mode (e.g., “main”, “batch_size”).
metric (int) – Metric used for determining the batch size in “batch_size” mode.
- Returns:
The determined batch size.
- Return type:
int
- Raises:
ValueError – If the number of batch sizes does not match the number of surrogates.
- codes.utils.utils.get_progress_bar(tasks)#
Create a progress bar with a specific description.
- Parameters:
tasks (list) – The list of tasks to be executed.
- Returns:
The created progress bar.
- Return type:
tqdm
- codes.utils.utils.load_and_save_config(config_path='config.yaml', save=True)#
Load configuration from a YAML file and save a copy to the specified directory.
- Parameters:
config_path (str) – The path to the configuration YAML file.
save (bool) – Whether to save a copy of the configuration file. Default is True.
- Returns:
The loaded configuration dictionary.
- Return type:
dict
- codes.utils.utils.load_task_list(filepath)#
Load a list of tasks from a JSON file.
- Parameters:
filepath (str) – The path to the JSON file.
- Returns:
The loaded list of tasks
- Return type:
list
- codes.utils.utils.make_description(mode, device, metric, surrogate_name)#
Create a formatted description for the progress bar that ensures consistent alignment.
- Parameters:
mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).
device (str) – The device to use for training (e.g., ‘cuda:0’).
metric (str) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
surrogate_name (str) – The name of the surrogate model.
- Returns:
A formatted description string for the progress bar.
- Return type:
str
- codes.utils.utils.nice_print(message, width=80)#
Print a message in a nicely formatted way with a fixed width.
- Parameters:
message (str) – The message to print.
width (int) – The width of the printed box. Default is 80.
- Return type:
None
- codes.utils.utils.parse_for_none(dictionary)#
Parse a dictionary and replace all strings “None” with None. If value is a dictionary, recursively parse the dictionary.
- Parameters:
dictionary (dict) – The dictionary to parse.
- Returns:
The parsed dictionary.
- Return type:
dictionary (dict)
- codes.utils.utils.read_yaml_config(config_path)#
- codes.utils.utils.save_task_list(tasks, filepath)#
Save a list of tasks to a JSON file.
- Parameters:
tasks (list) – The list of tasks to save.
filepath (str) – The path to the JSON file.
- Return type:
None
- codes.utils.utils.set_random_seeds(seed, device)#
Set random seeds for reproducibility.
- Parameters:
seed (int) – The random seed to set.
device (str) – The device to use for training, e.g., ‘cuda:0’
- Return type:
None
- codes.utils.utils.time_execution(func)#
Decorator to time the execution of a function and store the duration as an attribute of the function.
- Parameters:
func (callable) – The function to be timed.
- codes.utils.utils.worker_init_fn(worker_id)#
Initialize the random seed for each worker in PyTorch DataLoader.
- Parameters:
worker_id (int) – The worker ID.
Module contents#
- codes.utils.check_and_load_data(dataset_name, verbose=True, log=True, log_params=True, normalisation_mode='standardise', tolerance=None)#
Check the specified dataset and load the data based on the mode (train or test).
- Parameters:
dataset_name (str) – The name of the dataset.
verbose (bool) – Whether to print information about the loaded data.
log (bool) – Whether to log-transform the data (log10).
log_params (bool) – Whether to log-transform the parameters.
normalisation_mode (str) – The normalization mode, either “disable”, “minmax”, or “standardise”.
tolerance (float, optional) – The tolerance value for log-transformation. Values below this will be set to the tolerance value. Pass None to disable.
- Returns:
- A tuple containing:
(train_data, test_data, val_data)
(train_params, test_params, val_params) or (None, None, None) if parameters are absent
timesteps
n_train_samples
data_info (including transformation parameters for data and for parameters)
labels
- Return type:
tuple
- Raises:
DatasetError – If the dataset or required data is missing or if the data shape is incorrect.
- codes.utils.check_training_status(config)#
Check if the training is already completed by looking for a completion marker file. If the training is not complete, compare the configurations and ask for a confirmation if there are differences.
- Parameters:
config (dict) – The configuration dictionary.
- Returns:
The path to the task list file. bool: Whether to copy the configuration file.
- Return type:
str
- codes.utils.create_dataset(name, data, params=None, split=(0.7, 0.1, 0.2), timesteps=None, labels=None)#
Creates a new dataset in the data directory.
- Parameters:
name (str) – The name of the dataset.
data (np.ndarray or tuple of np.ndarray) – Either a single 3D array of shape (n_samples, n_timesteps, n_quantities) or a tuple of three 3D arrays representing (train, test, val).
params (np.ndarray or tuple of np.ndarray, optional) – Either a single 2D array of shape (n_samples, n_parameters) corresponding to all samples, or a tuple of three 2D arrays representing (train, test, val) parameters. Must be provided in the same structure as data.
split (tuple of three floats, optional) – If data is provided as a single array, it is split into train, test, and validation sets according to these ratios (which must sum to 1).
timesteps (np.ndarray, optional) – A 1D array of timesteps. Its length must equal the number of timesteps in the data.
labels (list[str], optional) – Labels for the quantities. The number of labels must match the last dimension of the data.
- Raises:
FileExistsError – If the dataset directory already exists.
TypeError – If data (or params) are not of the expected type.
ValueError – If the shapes of data or params are inconsistent.
- codes.utils.create_hdf5_dataset(train_data, test_data, val_data, dataset_name, data_dir='datasets', timesteps=None, labels=None, train_params=None, test_params=None, val_params=None)#
Create an HDF5 file for a dataset with train, test, and validation data, along with optional timesteps and parameters.
- Parameters:
train_data (np.ndarray) – The training data array of shape (n_samples, n_timesteps, n_quantities).
test_data (np.ndarray) – The test data array of shape (n_samples, n_timesteps, n_quantities).
val_data (np.ndarray) – The validation data array of shape (n_samples, n_timesteps, n_quantities).
dataset_name (str) – The name of the dataset.
data_dir (str) – The directory in which to save the dataset.
timesteps (np.ndarray, optional) – A 1D array of timesteps.
labels (list[str], optional) – Labels for the quantities.
train_params (np.ndarray, optional) – Training parameters of shape (n_samples, n_parameters).
test_params (np.ndarray, optional) – Testing parameters of shape (n_samples, n_parameters).
val_params (np.ndarray, optional) – Validation parameters of shape (n_samples, n_parameters).
- codes.utils.create_model_dir(base_dir='.', subfolder='trained', unique_id='')#
Create a directory based on a unique identifier inside a specified subfolder of the base directory.
- Parameters:
base_dir (str) – The base directory where the subfolder and unique directory will be created.
subfolder (str) – The subfolder inside the base directory to include before the unique directory.
unique_id (str) – A unique identifier to be included in the directory name.
- Returns:
The path of the created unique directory within the specified subfolder.
- Return type:
str
- codes.utils.determine_batch_size(config, surr_idx, mode, metric)#
Determine the appropriate batch size based on the config, surrogate index, mode, and metric.
- Parameters:
config (dict) – The configuration dictionary.
surr_idx (int) – Index of the surrogate model in the config.
mode (str) – The benchmark mode (e.g., “main”, “batch_size”).
metric (int) – Metric used for determining the batch size in “batch_size” mode.
- Returns:
The determined batch size.
- Return type:
int
- Raises:
ValueError – If the number of batch sizes does not match the number of surrogates.
- codes.utils.download_data(dataset_name, path=None, verbose=True)#
Download the specified dataset if it is not present, with a progress bar. :type dataset_name:
str
:param dataset_name: The name of the dataset. :type dataset_name: str :type path:Optional
[str
] :param path: The path to save the dataset. If None, the default data directory is used. :type path: str, optional :type verbose:bool
:param verbose: Whether to print information about the download progress. :type verbose: bool
- codes.utils.get_data_subset(data, timesteps, mode, metric, params=None, subset_factor=1)#
Get the appropriate data subset based on the mode and metric.
- Parameters:
data (tuple[np.ndarray, ...]) – A tuple of data arrays of shape (n_samples, n_timesteps, n_quantities).
timesteps (np.ndarray) – The timesteps.
mode (str) – The benchmark mode (must be one of “interpolation”, “extrapolation”, “sparse”, “batch_size”). For “batch_size”, we thin the dataset by a factor of 4 for faster processing.
metric (int) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
params (tuple[np.ndarray, ...] | None | tuple[None, ...]) – Optional parameters (or tuple of parameters) with shape (n_samples, n_parameters). If None, params_subset will be None. If it is a tuple of Nones, then params_subset will be that tuple of Nones.
subset_factor (int) – The factor to subset the data by. Default is 1 (use full train and test data).
- Returns:
(data_subset, params_subset, timesteps_subset)
- Return type:
tuple
- codes.utils.get_progress_bar(tasks)#
Create a progress bar with a specific description.
- Parameters:
tasks (list) – The list of tasks to be executed.
- Returns:
The created progress bar.
- Return type:
tqdm
- codes.utils.load_and_save_config(config_path='config.yaml', save=True)#
Load configuration from a YAML file and save a copy to the specified directory.
- Parameters:
config_path (str) – The path to the configuration YAML file.
save (bool) – Whether to save a copy of the configuration file. Default is True.
- Returns:
The loaded configuration dictionary.
- Return type:
dict
- codes.utils.load_task_list(filepath)#
Load a list of tasks from a JSON file.
- Parameters:
filepath (str) – The path to the JSON file.
- Returns:
The loaded list of tasks
- Return type:
list
- codes.utils.make_description(mode, device, metric, surrogate_name)#
Create a formatted description for the progress bar that ensures consistent alignment.
- Parameters:
mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).
device (str) – The device to use for training (e.g., ‘cuda:0’).
metric (str) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
surrogate_name (str) – The name of the surrogate model.
- Returns:
A formatted description string for the progress bar.
- Return type:
str
- codes.utils.nice_print(message, width=80)#
Print a message in a nicely formatted way with a fixed width.
- Parameters:
message (str) – The message to print.
width (int) – The width of the printed box. Default is 80.
- Return type:
None
- codes.utils.normalize_data(train_data, test_data=None, val_data=None, mode='standardise')#
Normalize the data based on the training data statistics.
- Parameters:
train_data (np.ndarray) – Training data array.
test_data (np.ndarray, optional) – Test data array.
val_data (np.ndarray, optional) – Validation data array.
mode (str) – Normalization mode, either “minmax” or “standardise”.
- Returns:
Normalized training data, test data, and validation data.
- Return type:
tuple
- codes.utils.read_yaml_config(config_path)#
- codes.utils.save_task_list(tasks, filepath)#
Save a list of tasks to a JSON file.
- Parameters:
tasks (list) – The list of tasks to save.
filepath (str) – The path to the JSON file.
- Return type:
None
- codes.utils.set_random_seeds(seed, device)#
Set random seeds for reproducibility.
- Parameters:
seed (int) – The random seed to set.
device (str) – The device to use for training, e.g., ‘cuda:0’
- Return type:
None
- codes.utils.time_execution(func)#
Decorator to time the execution of a function and store the duration as an attribute of the function.
- Parameters:
func (callable) – The function to be timed.
- codes.utils.worker_init_fn(worker_id)#
Initialize the random seed for each worker in PyTorch DataLoader.
- Parameters:
worker_id (int) – The worker ID.