codes.utils package#
Submodules#
codes.utils.data_utils module#
- exception codes.utils.data_utils.DatasetError#
Bases:
Exception
Error for missing data or dataset or if the data shape is incorrect.
- codes.utils.data_utils.check_and_load_data(dataset_name, verbose=True, log=True, normalisation_mode='standardise', tolerance=1e-20)#
Check the specified dataset and load the data based on the mode (train or test).
- Parameters:
dataset_name (str) – The name of the dataset.
verbose (bool) – Whether to print information about the loaded data.
log (bool) – Whether to log-transform the data (log10).
normalisation_mode (str) – The normalization mode, either “disable”, “minmax”, or “standardise”.
tolerance (float, optional) – The tolerance value for log-transformation. Values below this will be set to the tolerance value. Pass None to disable.
- Returns:
Loaded data and timesteps.
- Return type:
tuple
- Raises:
DatasetError – If the dataset or required data is missing or if the data shape is incorrect.
- codes.utils.data_utils.create_dataset(name, train_data, test_data=None, val_data=None, split=None, timesteps=None, labels=None)#
Creates a new dataset in the data directory.
- Parameters:
name (str) – The name of the dataset.
train_data (np.ndarray | torch.Tensor) – The training data.
test_data (np.ndarray | torch.Tensor, optional) – The test data.
val_data (np.ndarray | torch.Tensor, optional) – The validation data.
tuple (split) – If test_data and val_data are not provided, train_data can be split into training, test and validation data.
timesteps (np.ndarray | torch.Tensor, optional) – The timesteps array.
labels (list[str], optional) – The labels for the chemicals.
- Raises:
FileExistsError – If the dataset already exists.
TypeError – If the train_data is not a numpy array or torch tensor.
ValueError – If the train_data, test_data, and val_data do not have the correct shape.
- codes.utils.data_utils.create_hdf5_dataset(train_data, test_data, val_data, dataset_name, data_dir='datasets', timesteps=None, labels=None)#
Create an HDF5 file for a dataset with train and test data, and optionally timesteps. Additionally, store metadata about the dataset.
- Parameters:
train_data (np.ndarray) – The training data array of shape (n_samples, n_timesteps, n_chemicals).
test_data (np.ndarray) – The test data array of shape (n_samples, n_timesteps, n_chemicals).
val_data (np.ndarray) – The validation data array of shape (n_samples, n_timesteps, n_chemicals).
dataset_name (str) – The name of the dataset.
data_dir (str) – The directory to save the dataset in.
timesteps (np.ndarray, optional) – The timesteps array. If None, integer timesteps will be generated.
labels (list[str], optional) – The labels for the chemicals.
- codes.utils.data_utils.download_data(dataset_name, path=None)#
Download the specified dataset if it is not present :type dataset_name:
str
:param dataset_name: The name of the dataset. :type dataset_name: str :type path:Optional
[str
] :param path: The path to save the dataset. If None, the default data directory is used. :type path: str, optional
- codes.utils.data_utils.get_data_subset(full_train_data, full_test_data, timesteps, mode, metric)#
Get the appropriate data subset based on the mode and metric.
- Parameters:
full_train_data (np.ndarray) – The full training data.
full_test_data (np.ndarray) – The full test data.
timesteps (np.ndarray) – The timesteps.
mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).
metric (int) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
- Returns:
The training data, test data, and timesteps subset.
- Return type:
tuple
- codes.utils.data_utils.normalize_data(train_data, test_data=None, val_data=None, mode='standardise')#
Normalize the data based on the training data statistics.
- Parameters:
train_data (np.ndarray) – Training data array.
test_data (np.ndarray, optional) – Test data array.
val_data (np.ndarray, optional) – Validation data array.
mode (str) – Normalization mode, either “minmax” or “standardise”.
- Returns:
Normalized training data, test data, and validation data.
- Return type:
tuple
codes.utils.utils module#
- codes.utils.utils.check_training_status(config)#
Check if the training is already completed by looking for a completion marker file. If the training is not complete, compare the configurations and ask for a confirmation if there are differences.
- Parameters:
config (dict) – The configuration dictionary.
- Returns:
The path to the task list file. bool: Whether to copy the configuration file.
- Return type:
str
- codes.utils.utils.create_model_dir(base_dir='.', subfolder='trained', unique_id='')#
Create a directory based on a unique identifier inside a specified subfolder of the base directory.
- Parameters:
base_dir (str) – The base directory where the subfolder and unique directory will be created.
subfolder (str) – The subfolder inside the base directory to include before the unique directory.
unique_id (str) – A unique identifier to be included in the directory name.
- Returns:
The path of the created unique directory within the specified subfolder.
- Return type:
str
- codes.utils.utils.get_progress_bar(tasks)#
Create a progress bar with a specific description.
- Parameters:
tasks (list) – The list of tasks to be executed.
- Returns:
The created progress bar.
- Return type:
tqdm
- codes.utils.utils.load_and_save_config(config_path='config.yaml', save=True)#
Load configuration from a YAML file and save a copy to the specified directory.
- Parameters:
config_path (str) – The path to the configuration YAML file.
save (bool) – Whether to save a copy of the configuration file. Default is True.
- Returns:
The loaded configuration dictionary.
- Return type:
dict
- codes.utils.utils.load_task_list(filepath)#
Load a list of tasks from a JSON file.
- Parameters:
filepath (str) – The path to the JSON file.
- Returns:
The loaded list of tasks
- Return type:
list
- codes.utils.utils.make_description(mode, device, metric, surrogate_name)#
Create a formatted description for the progress bar that ensures consistent alignment.
- Parameters:
mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).
device (str) – The device to use for training (e.g., ‘cuda:0’).
metric (str) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
surrogate_name (str) – The name of the surrogate model.
- Returns:
A formatted description string for the progress bar.
- Return type:
str
- codes.utils.utils.nice_print(message, width=80)#
Print a message in a nicely formatted way with a fixed width.
- Parameters:
message (str) – The message to print.
width (int) – The width of the printed box. Default is 80.
- Return type:
None
- codes.utils.utils.read_yaml_config(config_path)#
- codes.utils.utils.save_task_list(tasks, filepath)#
Save a list of tasks to a JSON file.
- Parameters:
tasks (list) – The list of tasks to save.
filepath (str) – The path to the JSON file.
- Return type:
None
- codes.utils.utils.set_random_seeds(seed)#
Set random seeds for reproducibility.
- Parameters:
seed (int) – The random seed to set.
- codes.utils.utils.time_execution(func)#
Decorator to time the execution of a function and store the duration as an attribute of the function.
- Parameters:
func (callable) – The function to be timed.
- codes.utils.utils.worker_init_fn(worker_id)#
Initialize the random seed for each worker in PyTorch DataLoader.
- Parameters:
worker_id (int) – The worker ID.
Module contents#
- codes.utils.check_and_load_data(dataset_name, verbose=True, log=True, normalisation_mode='standardise', tolerance=1e-20)#
Check the specified dataset and load the data based on the mode (train or test).
- Parameters:
dataset_name (str) – The name of the dataset.
verbose (bool) – Whether to print information about the loaded data.
log (bool) – Whether to log-transform the data (log10).
normalisation_mode (str) – The normalization mode, either “disable”, “minmax”, or “standardise”.
tolerance (float, optional) – The tolerance value for log-transformation. Values below this will be set to the tolerance value. Pass None to disable.
- Returns:
Loaded data and timesteps.
- Return type:
tuple
- Raises:
DatasetError – If the dataset or required data is missing or if the data shape is incorrect.
- codes.utils.check_training_status(config)#
Check if the training is already completed by looking for a completion marker file. If the training is not complete, compare the configurations and ask for a confirmation if there are differences.
- Parameters:
config (dict) – The configuration dictionary.
- Returns:
The path to the task list file. bool: Whether to copy the configuration file.
- Return type:
str
- codes.utils.create_dataset(name, train_data, test_data=None, val_data=None, split=None, timesteps=None, labels=None)#
Creates a new dataset in the data directory.
- Parameters:
name (str) – The name of the dataset.
train_data (np.ndarray | torch.Tensor) – The training data.
test_data (np.ndarray | torch.Tensor, optional) – The test data.
val_data (np.ndarray | torch.Tensor, optional) – The validation data.
tuple (split) – If test_data and val_data are not provided, train_data can be split into training, test and validation data.
timesteps (np.ndarray | torch.Tensor, optional) – The timesteps array.
labels (list[str], optional) – The labels for the chemicals.
- Raises:
FileExistsError – If the dataset already exists.
TypeError – If the train_data is not a numpy array or torch tensor.
ValueError – If the train_data, test_data, and val_data do not have the correct shape.
- codes.utils.create_hdf5_dataset(train_data, test_data, val_data, dataset_name, data_dir='datasets', timesteps=None, labels=None)#
Create an HDF5 file for a dataset with train and test data, and optionally timesteps. Additionally, store metadata about the dataset.
- Parameters:
train_data (np.ndarray) – The training data array of shape (n_samples, n_timesteps, n_chemicals).
test_data (np.ndarray) – The test data array of shape (n_samples, n_timesteps, n_chemicals).
val_data (np.ndarray) – The validation data array of shape (n_samples, n_timesteps, n_chemicals).
dataset_name (str) – The name of the dataset.
data_dir (str) – The directory to save the dataset in.
timesteps (np.ndarray, optional) – The timesteps array. If None, integer timesteps will be generated.
labels (list[str], optional) – The labels for the chemicals.
- codes.utils.create_model_dir(base_dir='.', subfolder='trained', unique_id='')#
Create a directory based on a unique identifier inside a specified subfolder of the base directory.
- Parameters:
base_dir (str) – The base directory where the subfolder and unique directory will be created.
subfolder (str) – The subfolder inside the base directory to include before the unique directory.
unique_id (str) – A unique identifier to be included in the directory name.
- Returns:
The path of the created unique directory within the specified subfolder.
- Return type:
str
- codes.utils.download_data(dataset_name, path=None)#
Download the specified dataset if it is not present :type dataset_name:
str
:param dataset_name: The name of the dataset. :type dataset_name: str :type path:Optional
[str
] :param path: The path to save the dataset. If None, the default data directory is used. :type path: str, optional
- codes.utils.get_data_subset(full_train_data, full_test_data, timesteps, mode, metric)#
Get the appropriate data subset based on the mode and metric.
- Parameters:
full_train_data (np.ndarray) – The full training data.
full_test_data (np.ndarray) – The full test data.
timesteps (np.ndarray) – The timesteps.
mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).
metric (int) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
- Returns:
The training data, test data, and timesteps subset.
- Return type:
tuple
- codes.utils.get_progress_bar(tasks)#
Create a progress bar with a specific description.
- Parameters:
tasks (list) – The list of tasks to be executed.
- Returns:
The created progress bar.
- Return type:
tqdm
- codes.utils.load_and_save_config(config_path='config.yaml', save=True)#
Load configuration from a YAML file and save a copy to the specified directory.
- Parameters:
config_path (str) – The path to the configuration YAML file.
save (bool) – Whether to save a copy of the configuration file. Default is True.
- Returns:
The loaded configuration dictionary.
- Return type:
dict
- codes.utils.load_task_list(filepath)#
Load a list of tasks from a JSON file.
- Parameters:
filepath (str) – The path to the JSON file.
- Returns:
The loaded list of tasks
- Return type:
list
- codes.utils.make_description(mode, device, metric, surrogate_name)#
Create a formatted description for the progress bar that ensures consistent alignment.
- Parameters:
mode (str) – The benchmark mode (e.g., “accuracy”, “interpolation”, “extrapolation”, “sparse”, “UQ”).
device (str) – The device to use for training (e.g., ‘cuda:0’).
metric (str) – The specific metric for the mode (e.g., interval, cutoff, factor, batch size).
surrogate_name (str) – The name of the surrogate model.
- Returns:
A formatted description string for the progress bar.
- Return type:
str
- codes.utils.nice_print(message, width=80)#
Print a message in a nicely formatted way with a fixed width.
- Parameters:
message (str) – The message to print.
width (int) – The width of the printed box. Default is 80.
- Return type:
None
- codes.utils.normalize_data(train_data, test_data=None, val_data=None, mode='standardise')#
Normalize the data based on the training data statistics.
- Parameters:
train_data (np.ndarray) – Training data array.
test_data (np.ndarray, optional) – Test data array.
val_data (np.ndarray, optional) – Validation data array.
mode (str) – Normalization mode, either “minmax” or “standardise”.
- Returns:
Normalized training data, test data, and validation data.
- Return type:
tuple
- codes.utils.read_yaml_config(config_path)#
- codes.utils.save_task_list(tasks, filepath)#
Save a list of tasks to a JSON file.
- Parameters:
tasks (list) – The list of tasks to save.
filepath (str) – The path to the JSON file.
- Return type:
None
- codes.utils.set_random_seeds(seed)#
Set random seeds for reproducibility.
- Parameters:
seed (int) – The random seed to set.
- codes.utils.time_execution(func)#
Decorator to time the execution of a function and store the duration as an attribute of the function.
- Parameters:
func (callable) – The function to be timed.
- codes.utils.worker_init_fn(worker_id)#
Initialize the random seed for each worker in PyTorch DataLoader.
- Parameters:
worker_id (int) – The worker ID.