codes.benchmark package

codes.benchmark package#

Submodules#

codes.benchmark.bench_fcts module#

codes.benchmark.bench_fcts.compare_UQ(all_metrics, config)#

Compare the uncertainty quantification (UQ) metrics of different surrogate models.

Parameters:

all_metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_batchsize(all_metrics, config)#

Compare the batch size training errors of different surrogate models.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_dynamic_accuracy(metrics, config)#

Compare the gradients of different surrogate models.

Parameters:

metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_extrapolation(all_metrics, config)#

Compare the extrapolation errors of different surrogate models.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_inference_time(metrics, config, save=True)#

Compare the mean inference time of different surrogate models.

Parameters:

metrics (dict[str, dict]) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.
save (bool, optional) – Whether to save the plot. Defaults to True.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_interpolation(all_metrics, config)#

Compare the interpolation errors of different surrogate models.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_main_losses(metrics, config)#

Compare the training and test losses of the main models for different surrogate models.

Parameters:

metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_models(metrics, config)#

codes.benchmark.bench_fcts.compare_relative_errors(metrics, config)#

Compare the relative errors over time for different surrogate models.

Parameters:

metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_sparse(all_metrics, config)#

Compare the sparse training errors of different surrogate models.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.evaluate_UQ(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the uncertainty quantification (UQ) performance of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
timesteps (np.ndarray) – The timesteps array.
conf (dict) – The configuration dictionary.
labels (list, optional) – The labels for the quantities.

Returns:

A dictionary containing UQ metrics.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_accuracy(model, surr_name, timesteps, test_loader, conf, labels=None)#

Evaluate the accuracy of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
timesteps (np.ndarray) – The timesteps array.
test_loader (DataLoader) – The DataLoader object containing the test data.
conf (dict) – The configuration dictionary.
labels (list, optional) – The labels for the quantities.
percentile (int, optional) – The percentile for error metrics.

Returns:

A dictionary containing accuracy metrics.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_batchsize(model, surr_name, test_loader, timesteps, conf)#

Evaluate the performance of the surrogate model with different batch sizes.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
timesteps (np.ndarray) – The timesteps array.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing batch size training metrics.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_compute(model, surr_name, test_loader, conf)#

Evaluate the computational resource requirements of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing model complexity metrics.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_dynamic_accuracy(model, surr_name, test_loader, conf, species_names=None)#

Evaluate the gradients of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing gradients metrics.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_extrapolation(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the extrapolation performance of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
timesteps (np.ndarray) – The timesteps array.
conf (dict) – The configuration dictionary.
labels (list, optional) – The labels for the quantities.

Returns:

A dictionary containing extrapolation metrics.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_interpolation(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the interpolation performance of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
timesteps (np.ndarray) – The timesteps array.
conf (dict) – The configuration dictionary.
labels (list, optional) – The labels for the quantities.

Returns:

A dictionary containing interpolation metrics.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_iterative_predictions(model, surr_name, timesteps, val_loader, conf, labels=None)#

Evaluate the iterative predictions of the surrogate model.

Returns the same set of error metrics as evaluate_accuracy, but over the full trajectory built by re-feeding the last prediction as the next initial state.

Return type:: dict[str, Any]

codes.benchmark.bench_fcts.evaluate_sparse(model, surr_name, test_loader, timesteps, n_train_samples, conf)#

Evaluate the performance of the surrogate model with sparse training data.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
n_train_samples (int) – The number of training samples in the full dataset.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing sparse training metrics.

Return type:

dict

codes.benchmark.bench_fcts.run_benchmark(surr_name, surrogate_class, conf)#

Run benchmarks for a given surrogate model.

Parameters:

surr_name (str) – The name of the surrogate model to benchmark.
surrogate_class – The class of the surrogate model.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing all relevant metrics for the given model.

Return type:

dict

codes.benchmark.bench_fcts.tabular_comparison(all_metrics, config)#

Compare the metrics of different surrogate models in a tabular format. Prints a table to the CLI, saves the table into a text file, and saves a CSV file with all metrics. Also saves a CSV file with only the metrics that appear in the CLI table.

Parameters:

all_metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.time_inference(model, surr_name, test_loader, conf, n_test_samples, n_runs=5)#

Time the inference of the surrogate model (full version with metrics).

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
conf (dict) – The configuration dictionary.
n_test_samples (int) – The number of test samples.
n_runs (int, optional) – Number of times to run the inference for timing.

Returns:

A dictionary containing timing metrics.

Return type:

dict

codes.benchmark.bench_plots module#

codes.benchmark.bench_plots.get_custom_palette(num_colors)#

Returns a list of colors sampled from a custom color palette.

Parameters:: num_colors (int) – The number of colors needed.
Returns:: A list of RGBA color tuples.
Return type:: list

codes.benchmark.bench_plots.inference_time_bar_plot(surrogates, means, stds, config, save=True, show_title=True)#

Plot the mean inference time with standard deviation for different surrogate models.

Parameters:

surrogates (List[str]) – List of surrogate model names.
means (List[float]) – List of mean inference times for each surrogate model.
stds (List[float]) – List of standard deviation of inference times for each surrogate model.
config (dict) – Configuration dictionary.
save (bool, optional) – Whether to save the plot. Defaults to True.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.int_ext_sparse(all_metrics, config, show_title=True)#

Function to make one comparative plot of the interpolation, extrapolation, and sparse training errors.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_MAE_comparison(MAEs, labels, config, save=True, show_title=True)#

Plot the MAE for different surrogate models.

Parameters:

MAE (tuple) – Tuple of accuracy arrays for each surrogate model.
labels (tuple) – Tuple of labels for each surrogate model.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_all_generalization_errors(all_metrics, config, show_title=True)#

Function to make one comparative plot of the interpolation, extrapolation, sparse, and batch size errors. Only the modalities present in all_metrics will be plotted.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_average_errors_over_time(surr_name, conf, errors, metrics, timesteps, mode, save=False, show_title=True)#

Plot the errors over time for different modes (interpolation, extrapolation, sparse, batchsize).

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
errors (np.ndarray) – Errors array of shape [N_metrics, n_timesteps].
metrics (np.ndarray) – Metrics array of shape [N_metrics].
timesteps (np.ndarray) – Timesteps array.
mode (str) – The mode of evaluation (‘interpolation’, ‘extrapolation’, ‘sparse’, ‘batchsize’).
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_average_uncertainty_over_time(surr_name, conf, errors_time, preds_std, timesteps, save=False, show_title=True)#

Plot the average uncertainty over time.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
errors_time (np.ndarray) – Prediction errors over time.
preds_std (np.ndarray) – Standard deviation over time of predictions from the ensemble of models.
timesteps (np.ndarray) – Timesteps array.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_comparative_dynamic_correlation_heatmaps(gradients, errors, avg_correlations, max_grad, max_err, max_count, config, save=True, show_title=True)#

Plot comparative heatmaps of correlation between gradient and prediction errors for multiple surrogate models.

Parameters:

gradients (dict[str, np.ndarray]) – Dictionary of gradients from the ensemble of models.
errors (dict[str, np.ndarray]) – Dictionary of prediction errors.
avg_correlations (dict[str, float]) – Dictionary of average correlations between gradients and prediction errors.
max_grad (dict[str, float]) – Dictionary of maximum gradient values for axis scaling across models.
max_err (dict[str, float]) – Dictionary of maximum error values for axis scaling across models.
max_count (dict[str, float]) – Dictionary of maximum count values for heatmap normalization across models.
config (dict) – Configuration dictionary.
save (bool, optional) – Whether to save the plot. Defaults to True.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_comparative_error_correlation_heatmaps(preds_std, errors, avg_correlations, axis_max, max_count, config, save=True, show_title=True)#

Plot comparative heatmaps of correlation between predictive uncertainty and prediction errors for multiple surrogate models.

Parameters:

preds_std (dict[str, np.ndarray]) – Dictionary of standard deviation of predictions from the ensemble of models.
errors (dict[str, np.ndarray]) – Dictionary of prediction errors.
avg_correlations (dict[str, float]) – Dictionary of average correlations between gradients and prediction errors.
axis_max (dict[str, float]) – Dictionary of maximum values for axis scaling across models.
max_count (dict[str, float]) – Dictionary of maximum count values for heatmap normalization across models.
config (dict) – Configuration dictionary.
save (bool, optional) – Whether to save the plot. Defaults to True.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_dynamic_correlation(surr_name, conf, gradients, errors, save=False, show_title=True)#

Plot the correlation between the gradients of the data and the prediction errors.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
gradients (np.ndarray) – The gradients of the data.
errors (np.ndarray) – The prediction errors.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

codes.benchmark.bench_plots.plot_dynamic_correlation_heatmap(surr_name, conf, preds_std, errors, average_correlation, save=False, cutoff_mass=0.98, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors using a heatmap.

This version uses a cutoff_mass approach. We choose the x_max and y_max such that the marginal distributions each include sqrt(cutoff_mass) of the total mass. For example, for cutoff_mass=0.95, we set:

x_max = percentile(preds_std, 100*np.sqrt(0.95)) y_max = percentile(errors, 100*np.sqrt(0.95))

Normalized gradients are assumed to start at 0.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
errors (np.ndarray) – Prediction errors.
average_correlation (float) – The average correlation between gradients and prediction errors.
save (bool, optional) – Whether to save the plot as a file.
cutoff_mass (float, optional) – Fraction of total mass to include in the heatmap (e.g. 0.95).
show_title (bool) – Whether to show the title on the plot.

Returns:

The maximum count in the updated histogram. x_max (float): The computed upper limit for the x-axis. y_max (float): The computed upper limit for the y-axis.

Return type:

max_value (float)

codes.benchmark.bench_plots.plot_dynamic_correlation_heatmap_old(surr_name, conf, preds_std, errors, average_correlation, save=False, threshold_factor=0.0001, xcut_percent=0.003, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors using a heatmap.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
errors (np.ndarray) – Prediction errors.
average_correlation (float) – The average correlation between gradients and prediction errors (pearson correlation).
save (bool, optional) – Whether to save the plot as a file.
threshold_factor (float, optional) – Fraction of max value below which cells are set to 0. Default is 5e-5.
cutoff_percent (float, optional) – The percentage of total counts to include in the heatmap. Default is 0.95.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_error_correlation_heatmap(surr_name, conf, preds_std, errors, average_correlation, save=True, cutoff_mass=0.98, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors using a heatmap.

Instead of using a fixed threshold factor, this function determines axis limits based on a cutoff_mass: the axis_max is chosen such that the histogram contains cutoff_mass (e.g. 95%) of the total data mass, with the top (1 - cutoff_mass) fraction cut off. The plot is symmetric (max_x = max_y) so that the diagonal retains its meaning.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
errors (np.ndarray) – Prediction errors.
average_correlation (float) – The average correlation between gradients and prediction errors.
save (bool, optional) – Whether to save the plot as a file.
cutoff_mass (float, optional) – Fraction of total mass to include (e.g. 0.95). Default is 0.95.
show_title (bool) – Whether to show the title on the plot.

Returns:

The maximum count from the initial histogram. axis_max (float): The computed upper axis limit such that cutoff_mass of the data is included.

Return type:

max_value (float)

codes.benchmark.bench_plots.plot_error_correlation_heatmap_old(surr_name, conf, preds_std, errors, average_correlation, save=False, threshold_factor=0.0001, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors using a heatmap.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
errors (np.ndarray) – Prediction errors.
average_correlation (float) – The average correlation between gradients and prediction errors (pearson correlation).
save (bool, optional) – Whether to save the plot as a file.
threshold_factor (float, optional) – Fraction of max value below which cells are set to 0. Default is 0.001.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_error_distribution_comparative(errors, conf, save=True, mode='main', show_title=True)#

Plot the comparative distribution of errors for each surrogate model as a smoothed histogram plot.

Parameters:

conf (dict) – The configuration dictionary.
errors (dict) – Dictionary containing numpy arrays of shape [num_samples, num_timesteps, num_quantities] for each model.
save (bool, optional) – Whether to save the plot as a file.
mode (str, optional) – The mode of the plot. Default is “main”.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_error_distribution_per_quantity(surr_name, conf, errors, quantity_names=None, num_quantities=10, save=True, show_title=True)#

Plot the distribution of errors for each quantity as a smoothed histogram plot.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
errors (np.ndarray) – Errors array of shape [num_samples, num_timesteps, num_quantities].
quantity_names (list, optional) – List of quantity names for labeling the lines.
num_quantities (int, optional) – Number of quantities to plot. Default is 10.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_error_percentiles_over_time(surr_name, conf, errors, timesteps, title, mode='relative', save=False, show_title=True)#

Plot mean, median, and percentiles (50th, 90th, 99th) over time. mode=”relative” treats errors as relative errors; mode=”deltadex” treats them as log-space absolute errors.

Return type:: None

codes.benchmark.bench_plots.plot_example_iterative_predictions(surr_name, conf, iterative_preds, full_preds, targets, timesteps, iter_interval, example_idx=None, num_quantities=100, labels=None, save=False, show_title=True)#

Plot one sample’s full iterative trajectory: ground truth vs. chained predictions, with retrigger lines.

Return type:: None

codes.benchmark.bench_plots.plot_example_mode_predictions(surr_name, conf, preds, targets, timesteps, metric, mode='interpolation', example_idx=0, num_quantities=100, labels=None, save=False, show_title=True)#

Plot example predictions from a single model alongside targets, and highlight either the training timesteps (interpolation mode) or the cutoff point (extrapolation mode).

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds (np.ndarray) – Predictions from the single model.
targets (np.ndarray) – True targets.
timesteps (np.ndarray) – Array of timesteps.
metric (int) – In interpolation mode, indicates the training interval (e.g., 10 means every tenth timestep was used). In extrapolation mode, indicates the index of the cutoff timestep.
mode (str, optional) – “interpolation” or “extrapolation”. Default is “interpolation”.
example_idx (int, optional) – Index of the example to plot. Default is 0.
num_quantities (int, optional) – Number of quantities to plot. Default is 100.
labels (list, optional) – List of labels for the quantities.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_example_predictions_with_uncertainty(surr_name, conf, preds_mean, preds_std, targets, timesteps, example_idx=0, num_quantities=100, labels=None, save=False, show_title=True)#

Plot example predictions with uncertainty.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_mean (np.ndarray) – Mean predictions from the ensemble of models.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
targets (np.ndarray) – True targets.
timesteps (np.ndarray) – Timesteps array.
example_idx (int, optional) – Index of the example to plot. Default is 0.
num_quantities (int, optional) – Number of quantities to plot. Default is 100.
labels (list, optional) – List of labels for the quantities.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_generalization_error_comparison(surrogates, metrics_list, model_errors_list, xlabel, filename, config, save=True, xlog=False, show_title=True)#

Plot the generalization errors of different surrogate models.

Parameters:

surrogates (list) – List of surrogate model names.
metrics_list (list[np.array]) – List of numpy arrays containing the metrics for each surrogate model.
model_errors_list (list[np.array]) – List of numpy arrays containing the errors for each surrogate model.
xlabel (str) – Label for the x-axis.
filename (str) – Filename to save the plot.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
xlog (bool) – Whether to use a log scale for the x-axis.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_generalization_errors(surr_name, conf, metrics, model_errors, mode, save=False, show_title=True)#

Plot the generalization errors of a model for various metrics.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
metrics (np.ndarray) – The metrics (e.g., intervals, cutoffs, batch sizes, number of training samples).
model_errors (np.ndarray) – The model errors.
mode (str) – The mode of generalization (“interpolation”, “extrapolation”, “sparse”, “batchsize”).
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_loss_comparison(train_losses, test_losses, labels, config, save=True, show_title=True)#

Plot the training and test losses for different surrogate models.

Parameters:

train_losses (tuple) – Tuple of training loss arrays for each surrogate model.
test_losses (tuple) – Tuple of test loss arrays for each surrogate model.
labels (tuple) – Tuple of labels for each surrogate model.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_loss_comparison_equal(train_losses, test_losses, labels, config, save=True, show_title=True)#

Plot the test loss trajectories for different surrogate models on a single plot, after log-transforming and normalizing each trajectory. This makes it easier to see convergence behavior even when the losses span several orders of magnitude. Numeric y-axis labels are removed.

Each loss trajectory is processed as follows: 1. Log-transform the loss values. 2. Normalize the log-transformed values to the range [0, 1]. 3. Plot the normalized trajectory on a normalized x-axis.

Parameters:

train_losses (tuple) – Tuple of training loss arrays for each surrogate model.
test_losses (tuple) – Tuple of test loss arrays for each surrogate model.
labels (tuple) – Tuple of labels for each surrogate model.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_loss_comparison_train_duration(test_losses, labels, train_durations, config, save=True, show_title=True)#

Plot the test loss trajectories for different surrogate models over training duration.

Parameters:

test_losses (tuple) – Tuple of test loss arrays for each surrogate model.
labels (tuple) – Tuple of labels for each surrogate model.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.#
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_losses(loss_histories, epochs, labels, title='Losses', save=False, conf=None, surr_name=None, mode='main', percentage=2.0, show_title=True)#

Plot the loss trajectories for the training of multiple models.

Parameters:

loss_histories (tuple[array, ...]) – List of loss history arrays.
epochs (int) – Number of epochs.
labels (tuple[str, ...]) – List of labels for each loss history.
title (str) – Title of the plot.
save (bool) – Whether to save the plot as an image file.
conf (Optional[dict]) – The configuration dictionary.
surr_name (Optional[str]) – The name of the surrogate model.
mode (str) – The mode of the training.
percentage (float) – Percentage of initial values to exclude from min-max calculation.

Return type:

None

show_title (bool): Whether to show the title on the plot.

codes.benchmark.bench_plots.plot_losses_dual_axis(train_loss, test_loss, labels=('Train Loss', 'Test Loss'), title='Losses', save=False, conf=None, surr_name=None, show_title=True)#

Plot the training and test loss trajectories with dual y-axes.

Parameters:

train_loss (array) – Training loss history array.
test_loss (array) – Test loss history array.
labels (tuple[str, str]) – Labels for the losses (train and test).
title (str) – Title of the plot.
save (bool) – Whether to save the plot as an image file.
conf (Optional[dict]) – The configuration dictionary.
surr_name (Optional[str]) – The name of the surrogate model.

Return type:

None

show_title (bool): Whether to show the title on the plot.

codes.benchmark.bench_plots.plot_relative_errors(mean_errors, median_errors, timesteps, config, save=True, show_title=True)#

Plot the relative errors over time for different surrogate models.

Parameters:

mean_errors (dict) – Dictionary containing the mean relative errors for each surrogate model.
median_errors (dict) – Dictionary containing the median relative errors for each surrogate model.
timesteps (np.ndarray) – Array of timesteps.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_surr_losses(model, surr_name, conf, timesteps, show_title=True)#

Plot the training and test losses for the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
timesteps (np.ndarray) – The timesteps array.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_uncertainty_confidence(weighted_diffs, conf, save=True, percentile=2, summary_stat='mean', show_title=True)#

Plot a comparative grouped bar chart of catastrophic confidence measures and return a metric quantifying the net skew of over- versus underconfidence.

For each surrogate model, the target-weighted difference is computed as:: weighted_diff = (predicted uncertainty - |prediction - target|) / target.

Negative values indicate overconfidence (i.e. the model’s uncertainty is too low relative to its error), while positive values indicate underconfidence.

Catastrophic events are defined as those samples in the lowest percentile (e.g. 2nd percentile) for overconfidence and in the highest percentile (i.e. 100 - percentile) for underconfidence.

For each surrogate, this function computes the mean and standard deviation of the weighted differences in both tails, then plots them as grouped bars (overconfidence bars on the left, underconfidence bars on the right) with standard error bars (thin, with capsize=3). The bar heights are expressed in percentages.

The text labels for the bars are placed on the opposite side of the x-axis: for negative (overconfident) values the annotation is shown a few pixels above the x-axis, and for positive (underconfident) values it is shown a few pixels below the x-axis.

The plot title includes the metric (mean ± std) and the number of samples (per tail).

Additionally, if the range between the smallest and largest bar is more than two orders of magnitude, the y-axis is set to a symmetric log scale.

Parameters:

weighted_diffs (dict[str, np.ndarray]) – Dictionary of weighted_diff arrays for each surrogate model.
conf (dict) – The configuration dictionary.
save (bool, optional) – Whether to save the plot.
percentile (float, optional) – Percentile threshold for defining catastrophic events (default is 2).
summary_stat (str, optional) – Currently only “mean” is implemented.
show_title (bool) – Whether to show the title on the plot.

Returns:

A dictionary mapping surrogate names to the net difference (over_summary + under_summary).

Return type:

dict[str, float]

codes.benchmark.bench_plots.plot_uncertainty_over_time_comparison(uncertainties, absolute_errors, timesteps, config, save=True, show_title=True)#

Plot the uncertainty and true MAE over time for different surrogate models.

Parameters:

uncertainties (dict) – Dictionary containing the uncertainties for each surrogate model.
absolute_errors (dict) – Dictionary containing the absolute errors for each surrogate model.
timesteps (np.ndarray) – Array of timesteps.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_uncertainty_vs_errors(surr_name, conf, preds_std, errors, save=False, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
errors (np.ndarray) – Prediction errors.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.rel_errors_and_uq(metrics, config, save=True, show_title=True)#

Create a figure with two subplots: relative errors over time and uncertainty over time for different surrogate models.

Parameters:

metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.save_plot(plt, filename, conf, surr_name='', dpi=300, base_dir='plots', increase_count=False, format='jpg')#

Save the plot to a file, creating necessary directories if they don’t exist.

Parameters:

plt (matplotlib.pyplot) – The plot object to save.
filename (str) – The desired filename for the plot.
conf (dict) – The configuration dictionary.
surr_name (str) – The name of the surrogate model.
dpi (int) – The resolution of the saved plot.
base_dir (str, optional) – The base directory where plots will be saved. Default is “plots”.
increase_count (bool, optional) – Whether to increment the filename count if a file already exists. Default is True.
format (str, optional) – The format for saving the plot. Default is “png”. Can be “png”, “pdf”, “svg”, etc.

Raises:

ValueError – If the configuration dictionary does not contain the required keys.

Return type:

None

codes.benchmark.bench_plots.save_plot_counter(filename, directory, increase_count=True)#

Save a plot with an incremented filename if a file with the same name already exists.

Parameters:

filename (str) – The desired filename for the plot.
directory (str) – The directory to save the plot in.
increase_count (bool, optional) – Whether to increment the filename count if a file already exists. Default is True.

Returns:

The full path to the saved plot.

Return type:

str

codes.benchmark.bench_utils module#

codes.benchmark.bench_utils.check_benchmark(conf)#

Check whether there are any configuration issues with the benchmark.

Parameters:

conf (dict) – The configuration dictionary.

Raises:

FileNotFoundError – If the training ID directory is missing or if the .yaml file is missing.
ValueError – If the configuration is missing required keys or the values do not match the training configuration.

Return type:

None

codes.benchmark.bench_utils.check_surrogate(surrogate, conf)#

Check whether the required models for the benchmark are present in the expected directories.

Parameters:

surrogate (str) – The name of the surrogate model to check.
conf (dict) – The configuration dictionary.

Raises:

FileNotFoundError – If any required models are missing.

Return type:

None

codes.benchmark.bench_utils.clean_metrics(metrics, conf)#

Clean the metrics dictionary to remove problematic entries.

Parameters:

metrics (dict) – The benchmark metrics.
conf (dict) – The configuration dictionary.

Returns:

The cleaned metrics dictionary.

Return type:

dict

codes.benchmark.bench_utils.convert_dict_to_scientific_notation(d, precision=8)#

Convert all numerical values in a dictionary to scientific notation.

Parameters:: d (dict) – The input dictionary.
Returns:: The dictionary with numerical values in scientific notation.
Return type:: dict

codes.benchmark.bench_utils.convert_to_standard_types(data)#

Recursively convert data to standard types that can be serialized to YAML.

Parameters:: data – The data to convert.
Returns:: The converted data.

codes.benchmark.bench_utils.count_trainable_parameters(model)#

Count the number of trainable parameters in the model.

Parameters:: model (torch.nn.Module) – The PyTorch model.
Returns:: The number of trainable parameters.
Return type:: int

codes.benchmark.bench_utils.discard_numpy_entries(d)#

Recursively remove dictionary entries that contain NumPy arrays.

Parameters:: d (dict) – The input dictionary.
Returns:: A new dictionary without entries containing NumPy arrays.
Return type:: dict

codes.benchmark.bench_utils.flatten_dict(d, parent_key='', sep=' - ')#

Flatten a nested dictionary.

Parameters:

d (dict) – The dictionary to flatten.
parent_key (str) – The base key string.
sep (str) – The separator between keys.

Returns:

Flattened dictionary with composite keys.

Return type:

dict

codes.benchmark.bench_utils.format_seconds(seconds)#

Format a duration given in seconds as hh:mm:ss.

Parameters:: seconds (int) – The duration in seconds.
Returns:: The formatted duration string.
Return type:: str

codes.benchmark.bench_utils.format_time(mean_time, std_time)#

Format mean and std time consistently in ns, µs, ms, or s.

Parameters:

mean_time – The mean time.
std_time – The standard deviation of the time.

Returns:

The formatted time string.

Return type:

str

codes.benchmark.bench_utils.get_model_config(surr_name, config)#

Get the model configuration for a specific surrogate model from the dataset folder. Returns an empty dictionary if config[“dataset”][“use_optimal_params”] is False, or if no configuration file is found in the dataset folder.

Parameters:

surr_name (str) – The name of the surrogate model.
config (dict) – The configuration dictionary.

Returns:

The model configuration dictionary.

Return type:

dict

codes.benchmark.bench_utils.get_required_models_list(surrogate, conf)#

Generate a list of required models based on the configuration settings.

Parameters:

surrogate (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.

Returns:

A list of required model names.

Return type:

list

codes.benchmark.bench_utils.get_surrogate(surrogate_name)#

Check if the surrogate model exists.

Parameters:: surrogate_name (str) – The name of the surrogate model.
Returns:: The surrogate model class if it exists, otherwise None.
Return type:: SurrogateModel | None

codes.benchmark.bench_utils.load_model(model, training_id, surr_name, model_identifier)#

Load a trained surrogate model.

Parameters:

model – Instance of the surrogate model class.
training_id (str) – The training identifier.
surr_name (str) – The name of the surrogate model.
model_identifier (str) – The identifier of the model (e.g., ‘main’).

Return type:

Module

Returns:

The loaded surrogate model.

codes.benchmark.bench_utils.make_comparison_csv(metrics, config)#

Generate a CSV file comparing metrics for different surrogate models.

Parameters:

metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_utils.measure_inference_time(model, test_loader, n_runs=5)#

Measure total inference time over a DataLoader across multiple runs.

Parameters:

model – Model instance with a .forward() method.
test_loader (DataLoader) – Loader with test data.
n_runs (int) – Number of repeated runs for averaging.

Returns:

List of total inference times per run (in seconds).

Return type:

list[float]

codes.benchmark.bench_utils.measure_memory_footprint(model, inputs)#

Measure the memory footprint of a model during forward and backward passes using peak memory tracking and explicit synchronization.

Parameters:

model (torch.nn.Module) – The PyTorch model.
inputs (tuple) – The input data for the model.

Returns:

A dictionary containing measured memory usages for:

model_memory: Additional memory used when moving the model to GPU.
forward_memory: Peak additional memory during the forward pass with gradients.
backward_memory: Peak additional memory during the backward pass.
forward_memory_nograd: Peak additional memory during the forward pass without gradients.

model: The model (possibly moved back to the original device).

Return type:

dict

codes.benchmark.bench_utils.save_table_csv(headers, rows, config)#

Save the CLI table (headers and rows) to a CSV file. This version strips out any formatting (like asterisks) from the table cells.

Parameters:

headers (list) – The list of header names.
rows (list) – The list of rows, where each row is a list of string values.
config (dict) – Configuration dictionary that contains ‘training_id’.

Return type:

None

Returns:

None

codes.benchmark.bench_utils.write_metrics_to_yaml(surr_name, conf, metrics)#

Write the benchmark metrics to a YAML file.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
metrics (dict) – The benchmark metrics.

Return type:

None

Module contents#

codes.benchmark.check_benchmark(conf)#

Check whether there are any configuration issues with the benchmark.

Parameters:

conf (dict) – The configuration dictionary.

Raises:

FileNotFoundError – If the training ID directory is missing or if the .yaml file is missing.
ValueError – If the configuration is missing required keys or the values do not match the training configuration.

Return type:

None

codes.benchmark.check_surrogate(surrogate, conf)#

Check whether the required models for the benchmark are present in the expected directories.

Parameters:

surrogate (str) – The name of the surrogate model to check.
conf (dict) – The configuration dictionary.

Raises:

FileNotFoundError – If any required models are missing.

Return type:

None

codes.benchmark.clean_metrics(metrics, conf)#

Clean the metrics dictionary to remove problematic entries.

Parameters:

metrics (dict) – The benchmark metrics.
conf (dict) – The configuration dictionary.

Returns:

The cleaned metrics dictionary.

Return type:

dict

codes.benchmark.compare_UQ(all_metrics, config)#

Compare the uncertainty quantification (UQ) metrics of different surrogate models.

Parameters:

all_metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_batchsize(all_metrics, config)#

Compare the batch size training errors of different surrogate models.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_dynamic_accuracy(metrics, config)#

Compare the gradients of different surrogate models.

Parameters:

metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_extrapolation(all_metrics, config)#

Compare the extrapolation errors of different surrogate models.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_inference_time(metrics, config, save=True)#

Compare the mean inference time of different surrogate models.

Parameters:

metrics (dict[str, dict]) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.
save (bool, optional) – Whether to save the plot. Defaults to True.

Return type:

None

Returns:

None

codes.benchmark.compare_interpolation(all_metrics, config)#

Compare the interpolation errors of different surrogate models.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_main_losses(metrics, config)#

Compare the training and test losses of the main models for different surrogate models.

Parameters:

metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_models(metrics, config)#

codes.benchmark.compare_relative_errors(metrics, config)#

Compare the relative errors over time for different surrogate models.

Parameters:

metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_sparse(all_metrics, config)#

Compare the sparse training errors of different surrogate models.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.convert_dict_to_scientific_notation(d, precision=8)#

Convert all numerical values in a dictionary to scientific notation.

Parameters:: d (dict) – The input dictionary.
Returns:: The dictionary with numerical values in scientific notation.
Return type:: dict

codes.benchmark.convert_to_standard_types(data)#

Recursively convert data to standard types that can be serialized to YAML.

Parameters:: data – The data to convert.
Returns:: The converted data.

codes.benchmark.count_trainable_parameters(model)#

Count the number of trainable parameters in the model.

Parameters:: model (torch.nn.Module) – The PyTorch model.
Returns:: The number of trainable parameters.
Return type:: int

codes.benchmark.discard_numpy_entries(d)#

Recursively remove dictionary entries that contain NumPy arrays.

Parameters:: d (dict) – The input dictionary.
Returns:: A new dictionary without entries containing NumPy arrays.
Return type:: dict

codes.benchmark.evaluate_UQ(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the uncertainty quantification (UQ) performance of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
timesteps (np.ndarray) – The timesteps array.
conf (dict) – The configuration dictionary.
labels (list, optional) – The labels for the quantities.

Returns:

A dictionary containing UQ metrics.

Return type:

dict

codes.benchmark.evaluate_accuracy(model, surr_name, timesteps, test_loader, conf, labels=None)#

Evaluate the accuracy of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
timesteps (np.ndarray) – The timesteps array.
test_loader (DataLoader) – The DataLoader object containing the test data.
conf (dict) – The configuration dictionary.
labels (list, optional) – The labels for the quantities.
percentile (int, optional) – The percentile for error metrics.

Returns:

A dictionary containing accuracy metrics.

Return type:

dict

codes.benchmark.evaluate_batchsize(model, surr_name, test_loader, timesteps, conf)#

Evaluate the performance of the surrogate model with different batch sizes.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
timesteps (np.ndarray) – The timesteps array.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing batch size training metrics.

Return type:

dict

codes.benchmark.evaluate_compute(model, surr_name, test_loader, conf)#

Evaluate the computational resource requirements of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing model complexity metrics.

Return type:

dict

codes.benchmark.evaluate_dynamic_accuracy(model, surr_name, test_loader, conf, species_names=None)#

Evaluate the gradients of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing gradients metrics.

Return type:

dict

codes.benchmark.evaluate_extrapolation(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the extrapolation performance of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
timesteps (np.ndarray) – The timesteps array.
conf (dict) – The configuration dictionary.
labels (list, optional) – The labels for the quantities.

Returns:

A dictionary containing extrapolation metrics.

Return type:

dict

codes.benchmark.evaluate_interpolation(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the interpolation performance of the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
timesteps (np.ndarray) – The timesteps array.
conf (dict) – The configuration dictionary.
labels (list, optional) – The labels for the quantities.

Returns:

A dictionary containing interpolation metrics.

Return type:

dict

codes.benchmark.evaluate_sparse(model, surr_name, test_loader, timesteps, n_train_samples, conf)#

Evaluate the performance of the surrogate model with sparse training data.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
n_train_samples (int) – The number of training samples in the full dataset.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing sparse training metrics.

Return type:

dict

codes.benchmark.flatten_dict(d, parent_key='', sep=' - ')#

Flatten a nested dictionary.

Parameters:

d (dict) – The dictionary to flatten.
parent_key (str) – The base key string.
sep (str) – The separator between keys.

Returns:

Flattened dictionary with composite keys.

Return type:

dict

codes.benchmark.format_seconds(seconds)#

Format a duration given in seconds as hh:mm:ss.

Parameters:: seconds (int) – The duration in seconds.
Returns:: The formatted duration string.
Return type:: str

codes.benchmark.format_time(mean_time, std_time)#

Format mean and std time consistently in ns, µs, ms, or s.

Parameters:

mean_time – The mean time.
std_time – The standard deviation of the time.

Returns:

The formatted time string.

Return type:

str

codes.benchmark.get_custom_palette(num_colors)#

Returns a list of colors sampled from a custom color palette.

Parameters:: num_colors (int) – The number of colors needed.
Returns:: A list of RGBA color tuples.
Return type:: list

codes.benchmark.get_model_config(surr_name, config)#

Parameters:

surr_name (str) – The name of the surrogate model.
config (dict) – The configuration dictionary.

Returns:

The model configuration dictionary.

Return type:

dict

codes.benchmark.get_required_models_list(surrogate, conf)#

Generate a list of required models based on the configuration settings.

Parameters:

surrogate (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.

Returns:

A list of required model names.

Return type:

list

codes.benchmark.get_surrogate(surrogate_name)#

Check if the surrogate model exists.

Parameters:: surrogate_name (str) – The name of the surrogate model.
Returns:: The surrogate model class if it exists, otherwise None.
Return type:: SurrogateModel | None

codes.benchmark.inference_time_bar_plot(surrogates, means, stds, config, save=True, show_title=True)#

Plot the mean inference time with standard deviation for different surrogate models.

Parameters:

surrogates (List[str]) – List of surrogate model names.
means (List[float]) – List of mean inference times for each surrogate model.
stds (List[float]) – List of standard deviation of inference times for each surrogate model.
config (dict) – Configuration dictionary.
save (bool, optional) – Whether to save the plot. Defaults to True.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.int_ext_sparse(all_metrics, config, show_title=True)#

Function to make one comparative plot of the interpolation, extrapolation, and sparse training errors.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.load_model(model, training_id, surr_name, model_identifier)#

Load a trained surrogate model.

Parameters:

model – Instance of the surrogate model class.
training_id (str) – The training identifier.
surr_name (str) – The name of the surrogate model.
model_identifier (str) – The identifier of the model (e.g., ‘main’).

Return type:

Module

Returns:

The loaded surrogate model.

codes.benchmark.make_comparison_csv(metrics, config)#

Generate a CSV file comparing metrics for different surrogate models.

Parameters:

metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.measure_inference_time(model, test_loader, n_runs=5)#

Measure total inference time over a DataLoader across multiple runs.

Parameters:

model – Model instance with a .forward() method.
test_loader (DataLoader) – Loader with test data.
n_runs (int) – Number of repeated runs for averaging.

Returns:

List of total inference times per run (in seconds).

Return type:

list[float]

codes.benchmark.measure_memory_footprint(model, inputs)#

Measure the memory footprint of a model during forward and backward passes using peak memory tracking and explicit synchronization.

Parameters:

model (torch.nn.Module) – The PyTorch model.
inputs (tuple) – The input data for the model.

Returns:

A dictionary containing measured memory usages for:

model_memory: Additional memory used when moving the model to GPU.
forward_memory: Peak additional memory during the forward pass with gradients.
backward_memory: Peak additional memory during the backward pass.
forward_memory_nograd: Peak additional memory during the forward pass without gradients.

model: The model (possibly moved back to the original device).

Return type:

dict

codes.benchmark.plot_MAE_comparison(MAEs, labels, config, save=True, show_title=True)#

Plot the MAE for different surrogate models.

Parameters:

MAE (tuple) – Tuple of accuracy arrays for each surrogate model.
labels (tuple) – Tuple of labels for each surrogate model.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_all_generalization_errors(all_metrics, config, show_title=True)#

Function to make one comparative plot of the interpolation, extrapolation, sparse, and batch size errors. Only the modalities present in all_metrics will be plotted.

Parameters:

all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_average_errors_over_time(surr_name, conf, errors, metrics, timesteps, mode, save=False, show_title=True)#

Plot the errors over time for different modes (interpolation, extrapolation, sparse, batchsize).

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
errors (np.ndarray) – Errors array of shape [N_metrics, n_timesteps].
metrics (np.ndarray) – Metrics array of shape [N_metrics].
timesteps (np.ndarray) – Timesteps array.
mode (str) – The mode of evaluation (‘interpolation’, ‘extrapolation’, ‘sparse’, ‘batchsize’).
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_average_uncertainty_over_time(surr_name, conf, errors_time, preds_std, timesteps, save=False, show_title=True)#

Plot the average uncertainty over time.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
errors_time (np.ndarray) – Prediction errors over time.
preds_std (np.ndarray) – Standard deviation over time of predictions from the ensemble of models.
timesteps (np.ndarray) – Timesteps array.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_comparative_dynamic_correlation_heatmaps(gradients, errors, avg_correlations, max_grad, max_err, max_count, config, save=True, show_title=True)#

Plot comparative heatmaps of correlation between gradient and prediction errors for multiple surrogate models.

Parameters:

gradients (dict[str, np.ndarray]) – Dictionary of gradients from the ensemble of models.
errors (dict[str, np.ndarray]) – Dictionary of prediction errors.
avg_correlations (dict[str, float]) – Dictionary of average correlations between gradients and prediction errors.
max_grad (dict[str, float]) – Dictionary of maximum gradient values for axis scaling across models.
max_err (dict[str, float]) – Dictionary of maximum error values for axis scaling across models.
max_count (dict[str, float]) – Dictionary of maximum count values for heatmap normalization across models.
config (dict) – Configuration dictionary.
save (bool, optional) – Whether to save the plot. Defaults to True.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_comparative_error_correlation_heatmaps(preds_std, errors, avg_correlations, axis_max, max_count, config, save=True, show_title=True)#

Plot comparative heatmaps of correlation between predictive uncertainty and prediction errors for multiple surrogate models.

Parameters:

preds_std (dict[str, np.ndarray]) – Dictionary of standard deviation of predictions from the ensemble of models.
errors (dict[str, np.ndarray]) – Dictionary of prediction errors.
avg_correlations (dict[str, float]) – Dictionary of average correlations between gradients and prediction errors.
axis_max (dict[str, float]) – Dictionary of maximum values for axis scaling across models.
max_count (dict[str, float]) – Dictionary of maximum count values for heatmap normalization across models.
config (dict) – Configuration dictionary.
save (bool, optional) – Whether to save the plot. Defaults to True.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_dynamic_correlation(surr_name, conf, gradients, errors, save=False, show_title=True)#

Plot the correlation between the gradients of the data and the prediction errors.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
gradients (np.ndarray) – The gradients of the data.
errors (np.ndarray) – The prediction errors.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

codes.benchmark.plot_dynamic_correlation_heatmap(surr_name, conf, preds_std, errors, average_correlation, save=False, cutoff_mass=0.98, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors using a heatmap.

x_max = percentile(preds_std, 100*np.sqrt(0.95)) y_max = percentile(errors, 100*np.sqrt(0.95))

Normalized gradients are assumed to start at 0.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
errors (np.ndarray) – Prediction errors.
average_correlation (float) – The average correlation between gradients and prediction errors.
save (bool, optional) – Whether to save the plot as a file.
cutoff_mass (float, optional) – Fraction of total mass to include in the heatmap (e.g. 0.95).
show_title (bool) – Whether to show the title on the plot.

Returns:

The maximum count in the updated histogram. x_max (float): The computed upper limit for the x-axis. y_max (float): The computed upper limit for the y-axis.

Return type:

max_value (float)

codes.benchmark.plot_error_correlation_heatmap(surr_name, conf, preds_std, errors, average_correlation, save=True, cutoff_mass=0.98, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors using a heatmap.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
errors (np.ndarray) – Prediction errors.
average_correlation (float) – The average correlation between gradients and prediction errors.
save (bool, optional) – Whether to save the plot as a file.
cutoff_mass (float, optional) – Fraction of total mass to include (e.g. 0.95). Default is 0.95.
show_title (bool) – Whether to show the title on the plot.

Returns:

The maximum count from the initial histogram. axis_max (float): The computed upper axis limit such that cutoff_mass of the data is included.

Return type:

max_value (float)

codes.benchmark.plot_error_distribution_comparative(errors, conf, save=True, mode='main', show_title=True)#

Plot the comparative distribution of errors for each surrogate model as a smoothed histogram plot.

Parameters:

conf (dict) – The configuration dictionary.
errors (dict) – Dictionary containing numpy arrays of shape [num_samples, num_timesteps, num_quantities] for each model.
save (bool, optional) – Whether to save the plot as a file.
mode (str, optional) – The mode of the plot. Default is “main”.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_error_distribution_per_quantity(surr_name, conf, errors, quantity_names=None, num_quantities=10, save=True, show_title=True)#

Plot the distribution of errors for each quantity as a smoothed histogram plot.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
errors (np.ndarray) – Errors array of shape [num_samples, num_timesteps, num_quantities].
quantity_names (list, optional) – List of quantity names for labeling the lines.
num_quantities (int, optional) – Number of quantities to plot. Default is 10.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_error_percentiles_over_time(surr_name, conf, errors, timesteps, title, mode='relative', save=False, show_title=True)#

Plot mean, median, and percentiles (50th, 90th, 99th) over time. mode=”relative” treats errors as relative errors; mode=”deltadex” treats them as log-space absolute errors.

Return type:: None

codes.benchmark.plot_example_iterative_predictions(surr_name, conf, iterative_preds, full_preds, targets, timesteps, iter_interval, example_idx=None, num_quantities=100, labels=None, save=False, show_title=True)#

Plot one sample’s full iterative trajectory: ground truth vs. chained predictions, with retrigger lines.

Return type:: None

codes.benchmark.plot_example_mode_predictions(surr_name, conf, preds, targets, timesteps, metric, mode='interpolation', example_idx=0, num_quantities=100, labels=None, save=False, show_title=True)#

Plot example predictions from a single model alongside targets, and highlight either the training timesteps (interpolation mode) or the cutoff point (extrapolation mode).

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds (np.ndarray) – Predictions from the single model.
targets (np.ndarray) – True targets.
timesteps (np.ndarray) – Array of timesteps.
metric (int) – In interpolation mode, indicates the training interval (e.g., 10 means every tenth timestep was used). In extrapolation mode, indicates the index of the cutoff timestep.
mode (str, optional) – “interpolation” or “extrapolation”. Default is “interpolation”.
example_idx (int, optional) – Index of the example to plot. Default is 0.
num_quantities (int, optional) – Number of quantities to plot. Default is 100.
labels (list, optional) – List of labels for the quantities.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_example_predictions_with_uncertainty(surr_name, conf, preds_mean, preds_std, targets, timesteps, example_idx=0, num_quantities=100, labels=None, save=False, show_title=True)#

Plot example predictions with uncertainty.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_mean (np.ndarray) – Mean predictions from the ensemble of models.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
targets (np.ndarray) – True targets.
timesteps (np.ndarray) – Timesteps array.
example_idx (int, optional) – Index of the example to plot. Default is 0.
num_quantities (int, optional) – Number of quantities to plot. Default is 100.
labels (list, optional) – List of labels for the quantities.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_generalization_error_comparison(surrogates, metrics_list, model_errors_list, xlabel, filename, config, save=True, xlog=False, show_title=True)#

Plot the generalization errors of different surrogate models.

Parameters:

surrogates (list) – List of surrogate model names.
metrics_list (list[np.array]) – List of numpy arrays containing the metrics for each surrogate model.
model_errors_list (list[np.array]) – List of numpy arrays containing the errors for each surrogate model.
xlabel (str) – Label for the x-axis.
filename (str) – Filename to save the plot.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
xlog (bool) – Whether to use a log scale for the x-axis.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_generalization_errors(surr_name, conf, metrics, model_errors, mode, save=False, show_title=True)#

Plot the generalization errors of a model for various metrics.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
metrics (np.ndarray) – The metrics (e.g., intervals, cutoffs, batch sizes, number of training samples).
model_errors (np.ndarray) – The model errors.
mode (str) – The mode of generalization (“interpolation”, “extrapolation”, “sparse”, “batchsize”).
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_loss_comparison(train_losses, test_losses, labels, config, save=True, show_title=True)#

Plot the training and test losses for different surrogate models.

Parameters:

train_losses (tuple) – Tuple of training loss arrays for each surrogate model.
test_losses (tuple) – Tuple of test loss arrays for each surrogate model.
labels (tuple) – Tuple of labels for each surrogate model.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_loss_comparison_train_duration(test_losses, labels, train_durations, config, save=True, show_title=True)#

Plot the test loss trajectories for different surrogate models over training duration.

Parameters:

test_losses (tuple) – Tuple of test loss arrays for each surrogate model.
labels (tuple) – Tuple of labels for each surrogate model.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.#
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_losses(loss_histories, epochs, labels, title='Losses', save=False, conf=None, surr_name=None, mode='main', percentage=2.0, show_title=True)#

Plot the loss trajectories for the training of multiple models.

Parameters:

loss_histories (tuple[array, ...]) – List of loss history arrays.
epochs (int) – Number of epochs.
labels (tuple[str, ...]) – List of labels for each loss history.
title (str) – Title of the plot.
save (bool) – Whether to save the plot as an image file.
conf (Optional[dict]) – The configuration dictionary.
surr_name (Optional[str]) – The name of the surrogate model.
mode (str) – The mode of the training.
percentage (float) – Percentage of initial values to exclude from min-max calculation.

Return type:

None

show_title (bool): Whether to show the title on the plot.

codes.benchmark.plot_relative_errors(mean_errors, median_errors, timesteps, config, save=True, show_title=True)#

Plot the relative errors over time for different surrogate models.

Parameters:

mean_errors (dict) – Dictionary containing the mean relative errors for each surrogate model.
median_errors (dict) – Dictionary containing the median relative errors for each surrogate model.
timesteps (np.ndarray) – Array of timesteps.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_surr_losses(model, surr_name, conf, timesteps, show_title=True)#

Plot the training and test losses for the surrogate model.

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
timesteps (np.ndarray) – The timesteps array.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_uncertainty_confidence(weighted_diffs, conf, save=True, percentile=2, summary_stat='mean', show_title=True)#

Plot a comparative grouped bar chart of catastrophic confidence measures and return a metric quantifying the net skew of over- versus underconfidence.

For each surrogate model, the target-weighted difference is computed as:: weighted_diff = (predicted uncertainty - |prediction - target|) / target.

Negative values indicate overconfidence (i.e. the model’s uncertainty is too low relative to its error), while positive values indicate underconfidence.

Catastrophic events are defined as those samples in the lowest percentile (e.g. 2nd percentile) for overconfidence and in the highest percentile (i.e. 100 - percentile) for underconfidence.

The plot title includes the metric (mean ± std) and the number of samples (per tail).

Additionally, if the range between the smallest and largest bar is more than two orders of magnitude, the y-axis is set to a symmetric log scale.

Parameters:

weighted_diffs (dict[str, np.ndarray]) – Dictionary of weighted_diff arrays for each surrogate model.
conf (dict) – The configuration dictionary.
save (bool, optional) – Whether to save the plot.
percentile (float, optional) – Percentile threshold for defining catastrophic events (default is 2).
summary_stat (str, optional) – Currently only “mean” is implemented.
show_title (bool) – Whether to show the title on the plot.

Returns:

A dictionary mapping surrogate names to the net difference (over_summary + under_summary).

Return type:

dict[str, float]

codes.benchmark.plot_uncertainty_over_time_comparison(uncertainties, absolute_errors, timesteps, config, save=True, show_title=True)#

Plot the uncertainty and true MAE over time for different surrogate models.

Parameters:

uncertainties (dict) – Dictionary containing the uncertainties for each surrogate model.
absolute_errors (dict) – Dictionary containing the absolute errors for each surrogate model.
timesteps (np.ndarray) – Array of timesteps.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_uncertainty_vs_errors(surr_name, conf, preds_std, errors, save=False, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.
errors (np.ndarray) – Prediction errors.
save (bool, optional) – Whether to save the plot as a file.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.read_yaml_config(config_path)#

codes.benchmark.rel_errors_and_uq(metrics, config, save=True, show_title=True)#

Create a figure with two subplots: relative errors over time and uncertainty over time for different surrogate models.

Parameters:

metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.
save (bool) – Whether to save the plot.
show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.run_benchmark(surr_name, surrogate_class, conf)#

Run benchmarks for a given surrogate model.

Parameters:

surr_name (str) – The name of the surrogate model to benchmark.
surrogate_class – The class of the surrogate model.
conf (dict) – The configuration dictionary.

Returns:

A dictionary containing all relevant metrics for the given model.

Return type:

dict

codes.benchmark.save_plot(plt, filename, conf, surr_name='', dpi=300, base_dir='plots', increase_count=False, format='jpg')#

Save the plot to a file, creating necessary directories if they don’t exist.

Parameters:

plt (matplotlib.pyplot) – The plot object to save.
filename (str) – The desired filename for the plot.
conf (dict) – The configuration dictionary.
surr_name (str) – The name of the surrogate model.
dpi (int) – The resolution of the saved plot.
base_dir (str, optional) – The base directory where plots will be saved. Default is “plots”.
increase_count (bool, optional) – Whether to increment the filename count if a file already exists. Default is True.
format (str, optional) – The format for saving the plot. Default is “png”. Can be “png”, “pdf”, “svg”, etc.

Raises:

ValueError – If the configuration dictionary does not contain the required keys.

Return type:

None

codes.benchmark.save_plot_counter(filename, directory, increase_count=True)#

Save a plot with an incremented filename if a file with the same name already exists.

Parameters:

filename (str) – The desired filename for the plot.
directory (str) – The directory to save the plot in.
increase_count (bool, optional) – Whether to increment the filename count if a file already exists. Default is True.

Returns:

The full path to the saved plot.

Return type:

str

codes.benchmark.save_table_csv(headers, rows, config)#

Save the CLI table (headers and rows) to a CSV file. This version strips out any formatting (like asterisks) from the table cells.

Parameters:

headers (list) – The list of header names.
rows (list) – The list of rows, where each row is a list of string values.
config (dict) – Configuration dictionary that contains ‘training_id’.

Return type:

None

Returns:

None

codes.benchmark.tabular_comparison(all_metrics, config)#

Parameters:

all_metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.
config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.time_inference(model, surr_name, test_loader, conf, n_test_samples, n_runs=5)#

Time the inference of the surrogate model (full version with metrics).

Parameters:

model – Instance of the surrogate model class.
surr_name (str) – The name of the surrogate model.
test_loader (DataLoader) – The DataLoader object containing the test data.
conf (dict) – The configuration dictionary.
n_test_samples (int) – The number of test samples.
n_runs (int, optional) – Number of times to run the inference for timing.

Returns:

A dictionary containing timing metrics.

Return type:

dict

codes.benchmark.write_metrics_to_yaml(surr_name, conf, metrics)#

Write the benchmark metrics to a YAML file.

Parameters:

surr_name (str) – The name of the surrogate model.
conf (dict) – The configuration dictionary.
metrics (dict) – The benchmark metrics.

Return type:

None

codes.benchmark package

Contents

codes.benchmark package#

Submodules#

codes.benchmark.bench_fcts module#

codes.benchmark.bench_plots module#

codes.benchmark.bench_utils module#

Module contents#