codes.benchmark package

Contents

codes.benchmark package#

Submodules#

codes.benchmark.bench_fcts module#

codes.benchmark.bench_fcts.compare_UQ(all_metrics, config)#
Compare log-space UQ across surrogates, focusing on:
  • Ensemble vs Main errors (Δdex) over time

  • Correlation between log-space uncertainty and errors

  • Catastrophic-error detection from uncertainty thresholds

Parameters:
  • all_metrics (dict) – Benchmark metrics per surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_batchsize(all_metrics, config)#

Compare the batch size training errors of different surrogate models.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_errors(metrics, config)#

Compare relative errors and Δdex errors over time for different surrogate models.

Parameters:
  • metrics (dict) – Benchmark metrics for each surrogate.

  • config (dict) – Configuration dictionary.

Return type:

None

codes.benchmark.bench_fcts.compare_extrapolation(all_metrics, config)#

Compare the extrapolation errors of different surrogate models.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_gradients(metrics, config)#

Compare the gradients of different surrogate models.

Parameters:
  • metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_inference_time(metrics, config, save=True)#

Compare the mean inference time of different surrogate models.

Parameters:
  • metrics (dict[str, dict]) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool, optional) – Whether to save the plot. Defaults to True.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_interpolation(all_metrics, config)#

Compare the interpolation errors of different surrogate models.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_main_losses(metrics, config)#

Compare the training and test losses of the main models for different surrogate models.

Parameters:
  • metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.compare_models(metrics, config)#
codes.benchmark.bench_fcts.compare_sparse(all_metrics, config)#

Compare the sparse training errors of different surrogate models.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_fcts.evaluate_UQ(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the uncertainty quantification (UQ) performance of the surrogate model.

Predictions and targets are kept in log10 space (leave_log=True). All UQ metrics are computed in log space (Δdex).

Parameters:
  • model – The surrogate model instance.

  • surr_name (str) – The name of the surrogate model.

  • test_loader (DataLoader) – DataLoader with the test data.

  • timesteps (np.ndarray) – Array of timesteps.

  • conf (dict) – Configuration dictionary.

  • labels (list, optional) – Labels for the predicted quantities.

Returns:

Dictionary containing log-space UQ metrics and arrays.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_accuracy(model, surr_name, timesteps, test_loader, conf, labels=None)#

Evaluate the accuracy of the surrogate model.

Parameters:
  • model – Instance of the surrogate model class.

  • surr_name (str) – The name of the surrogate model.

  • timesteps (np.ndarray) – The timesteps array.

  • test_loader (DataLoader) – The DataLoader object containing the test data.

  • conf (dict) – The configuration dictionary.

  • labels (list, optional) – The labels for the quantities.

  • percentile (int, optional) – The percentile for error metrics.

Returns:

A dictionary containing accuracy metrics.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_batchsize(model, surr_name, test_loader, timesteps, conf)#

Evaluate the batch-size scaling performance of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Name of the surrogate.

  • test_loader (DataLoader) – DataLoader with test data.

  • timesteps (np.ndarray) – Timesteps array.

  • conf (dict) – Configuration dictionary.

Returns:

Batch-size scaling metrics in log-space.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_compute(model, surr_name, test_loader, conf)#

Evaluate the computational resource requirements of the surrogate model.

Parameters:
  • model – Instance of the surrogate model class.

  • surr_name (str) – The name of the surrogate model.

  • test_loader (DataLoader) – The DataLoader object containing the test data.

  • conf (dict) – The configuration dictionary.

Returns:

A dictionary containing model complexity metrics.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_extrapolation(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the extrapolation performance of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Name of the surrogate.

  • test_loader (DataLoader) – DataLoader with test data.

  • timesteps (np.ndarray) – Timesteps array.

  • conf (dict) – Configuration dictionary.

  • labels (list, optional) – Labels for the predicted quantities.

Returns:

Extrapolation metrics in log-space.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_gradients(model, surr_name, test_loader, conf, species_names=None)#

Evaluate the gradients of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Surrogate name.

  • test_loader (DataLoader) – Test data.

  • conf (dict) – Configuration dictionary.

  • species_names (list, optional) – Names of the species/quantities.

Returns:

Gradient–error correlation metrics in log-space.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_interpolation(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the interpolation performance of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Name of the surrogate.

  • test_loader (DataLoader) – DataLoader with test data.

  • timesteps (np.ndarray) – Timesteps array.

  • conf (dict) – Configuration dictionary.

  • labels (list, optional) – Labels for the predicted quantities.

Returns:

Interpolation metrics in log-space.

Return type:

dict

codes.benchmark.bench_fcts.evaluate_iterative_predictions(model, surr_name, timesteps, val_loader, val_params, conf, labels=None)#

Evaluate the iterative predictions of the surrogate model.

Returns the same set of error metrics as evaluate_accuracy, but over the full trajectory built by re-feeding the last prediction as the next initial state.

Return type:

dict[str, Any]

codes.benchmark.bench_fcts.evaluate_sparse(model, surr_name, test_loader, timesteps, n_train_samples, conf)#

Evaluate the sparse-data training performance of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Name of the surrogate.

  • test_loader (DataLoader) – DataLoader with test data.

  • timesteps (np.ndarray) – Timesteps array.

  • n_train_samples (int) – Number of training samples in the full dataset.

  • conf (dict) – Configuration dictionary.

Returns:

Sparse training metrics in log-space.

Return type:

dict

codes.benchmark.bench_fcts.run_benchmark(surr_name, surrogate_class, conf)#

Run benchmarks for a given surrogate model.

Parameters:
  • surr_name (str) – The name of the surrogate model to benchmark.

  • surrogate_class – The class of the surrogate model.

  • conf (dict) – The configuration dictionary.

Returns:

A dictionary containing all relevant metrics for the given model.

Return type:

dict

codes.benchmark.bench_fcts.tabular_comparison(all_metrics, config)#

Compare the metrics of different surrogate models in a tabular format. Prints a table to the CLI, saves the table into a text file, and saves a CSV file with all metrics. Also saves a CSV file with only the metrics that appear in the CLI table.

Return type:

None

codes.benchmark.bench_fcts.time_inference(model, surr_name, test_loader, conf, n_test_samples, n_runs=5)#

Time the inference of the surrogate model (full version with metrics).

Parameters:
  • model – Instance of the surrogate model class.

  • surr_name (str) – The name of the surrogate model.

  • test_loader (DataLoader) – The DataLoader object containing the test data.

  • conf (dict) – The configuration dictionary.

  • n_test_samples (int) – The number of test samples.

  • n_runs (int, optional) – Number of times to run the inference for timing.

Returns:

A dictionary containing timing metrics.

Return type:

dict

codes.benchmark.bench_plots module#

codes.benchmark.bench_plots.get_custom_palette(num_colors)#

Returns a list of colors sampled from a custom color palette.

Parameters:

num_colors (int) – The number of colors needed.

Returns:

A list of RGBA color tuples.

Return type:

list

codes.benchmark.bench_plots.inference_time_bar_plot(surrogates, means, stds, config, save=True, show_title=True)#

Plot the mean inference time with standard deviation for different surrogate models.

Parameters:
  • surrogates (List[str]) – List of surrogate model names.

  • means (List[float]) – List of mean inference times for each surrogate model.

  • stds (List[float]) – List of standard deviation of inference times for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool, optional) – Whether to save the plot. Defaults to True.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_MAE_comparison(MAEs, labels, config, save=True, show_title=True)#

Plot the MAE for different surrogate models.

Parameters:
  • MAE (tuple) – Tuple of accuracy arrays for each surrogate model.

  • labels (tuple) – Tuple of labels for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_all_generalization_errors(all_metrics, config, show_title=True)#

Function to make one comparative plot of the interpolation, extrapolation, sparse, and batch size errors. Only the modalities present in all_metrics will be plotted.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_average_errors_over_time(surr_name, conf, errors, metrics, timesteps, mode, save=False, show_title=True)#

Plot Δdex errors over time for different evaluation modes.

Parameters:
  • surr_name (str) – Surrogate name.

  • conf (dict) – Config dictionary.

  • errors (np.ndarray) – Errors [N_metrics, n_timesteps].

  • metrics (np.ndarray) – Metrics [N_metrics].

  • timesteps (np.ndarray) – Timesteps.

  • mode (str) – One of ‘interpolation’, ‘extrapolation’, ‘sparse’, ‘batchsize’.

Return type:

None

codes.benchmark.bench_plots.plot_average_uncertainty_over_time(surr_name, conf, errors_time, preds_std, timesteps, save=False, show_title=True)#

Plot average predictive uncertainty and errors over time in log-space (dex).

Parameters:
  • surr_name (str) – Name of the surrogate model.

  • conf (dict) – Configuration dictionary.

  • errors_time (np.ndarray) – Log-space prediction errors over time.

  • preds_std (np.ndarray) – Log-space ensemble standard deviation over time.

  • timesteps (np.ndarray) – Array of timesteps.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show a title.

Return type:

None

codes.benchmark.bench_plots.plot_catastrophic_detection_curves(errors_log, std_log, conf, percentiles=(99.0, 90.0), flag_fractions=(0.0, 0.01, 0.05, 0.1, 0.2, 0.3), save=True, show_title=True)#

Plot catastrophic error recall (Δdex) vs fraction flagged by uncertainty, across multiple catastrophic percentiles, plus performance improvement curves.

Catastrophic errors are those with Δdex >= P_cat (e.g. 99th percentile). For each flag fraction f, the uncertainty threshold is the (1 - f) quantile of std. We then flag samples with std >= threshold and measure recall among catastrophic samples.

Additionally, computes the effective log-space MAE if flagged samples are replaced by solver outputs (0 error). This measures how much UQ-guided deferral improves performance.

Parameters:
  • errors_log (dict) – Δdex arrays [N, T, Q] per surrogate.

  • std_log (dict) – Log-space std arrays [N, T, Q] per surrogate.

  • conf (dict) – Configuration dictionary.

  • percentiles (tuple) – Catastrophic percentiles to evaluate (default: (90, 95, 99)).

  • flag_fractions (tuple) – Fractions of predictions to flag (includes 0 for baseline).

  • save (bool) – Save figure.

  • show_title (bool) – Add title.

Returns:

Nested dict of results:
summary[surrogate][percentile] = {

‘flag_fraction’: f, ‘recall’: r, ‘cat_threshold’: thr, ‘mae_curve’: [(f, mae), …]

}

Return type:

dict

codes.benchmark.bench_plots.plot_comparative_error_correlation_heatmaps(preds_std, errors, avg_correlations, axis_max, max_count, config, save=True, show_title=True)#

Comparative heatmaps of log-space uncertainty vs Δdex.

Parameters:
  • preds_std (dict) – Log-space std arrays per surrogate.

  • errors (dict) – Δdex arrays per surrogate.

  • avg_correlations (dict) – Pearson r per surrogate (log-space).

  • axis_max (dict) – Axis maxima from per-surrogate plots.

  • max_count (dict) – Peak counts for normalization per surrogate.

  • config (dict) – Configuration dictionary.

  • save (bool) – Save figure.

  • show_title (bool) – Add title.

Return type:

None

codes.benchmark.bench_plots.plot_comparative_gradient_heatmaps(gradients, errors, avg_correlations, max_grad, max_err, max_count, config, save=True, show_title=True)#

Plot comparative heatmaps of correlation between gradient and prediction errors for multiple surrogate models.

Parameters:
  • gradients (dict[str, np.ndarray]) – Dictionary of gradients from the ensemble of models.

  • errors (dict[str, np.ndarray]) – Dictionary of prediction errors.

  • avg_correlations (dict[str, float]) – Dictionary of average correlations between gradients and prediction errors.

  • max_grad (dict[str, float]) – Dictionary of maximum gradient values for axis scaling across models.

  • max_err (dict[str, float]) – Dictionary of maximum error values for axis scaling across models.

  • max_count (dict[str, float]) – Dictionary of maximum count values for heatmap normalization across models.

  • config (dict) – Configuration dictionary.

  • save (bool, optional) – Whether to save the plot. Defaults to True.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_dynamic_correlation(surr_name, conf, gradients, errors, save=False, show_title=True)#

Plot the correlation between the gradients of the data and the prediction errors.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • gradients (np.ndarray) – The gradients of the data.

  • errors (np.ndarray) – The prediction errors.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

codes.benchmark.bench_plots.plot_error_distribution_comparative(errors, conf, save=True, show_title=True, mode='relative')#

Plot comparative error distributions for each surrogate model.

Parameters:
  • errors (dict) – Model → array of errors [num_samples, num_timesteps, num_quantities].

  • conf (dict) – Configuration dictionary.

  • save (bool) – Whether to save the figure.

  • show_title (bool) – Whether to add a title.

  • mode (str) – “relative” (unitless %) or “deltadex” (log-space abs. errors).

Return type:

None

codes.benchmark.bench_plots.plot_error_distribution_per_quantity(surr_name, conf, errors, quantity_names=None, num_quantities=10, mode='relative', save=True, show_title=True)#

Plot the distribution of errors for each quantity as a smoothed histogram plot.

  • mode=”relative”:

    Errors are relative (0..∞). Histogram is plotted in log-space (x-axis log-scaled).

  • mode=”deltadex”:

    Errors are absolute log-space errors (Δdex ≥ 0). Histogram is plotted on linear scale.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • errors (np.ndarray) – Errors array of shape [num_samples, num_timesteps, num_quantities].

  • quantity_names (list, optional) – List of quantity names for labeling the lines.

  • num_quantities (int, optional) – Number of quantities to plot. Default is 10.

  • mode (str, optional) – “relative” or “deltadex”. Default is “relative”.

  • save (bool, optional) – Whether to save the plot as a file.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_error_percentiles_over_time(surr_name, conf, errors, timesteps, title, mode='relative', save=False, show_title=True)#

Plot mean, median, and percentile error envelopes over time.

  • mode=”relative”:

    Treats errors as relative errors (0..∞). Plots bidirectional percentile bands (25-75, 5-95, 0.5-99.5). Y-axis is log-scaled.

  • mode=”deltadex”:

    Treats errors as log-space absolute errors (Δdex ≥ 0). Plots one-sided percentile bands (0-50, 0-90, 0-99). Y-axis is linear, starting at 0.

Parameters:
  • surr_name (str) – Name of the surrogate model (used for saving).

  • conf (dict) – Configuration dictionary containing dataset and output settings.

  • errors (np.ndarray) – Error array of shape [N_samples, N_timesteps, N_quantities]. Values are either relative errors or Δdex depending on mode.

  • timesteps (np.ndarray) – Array of timesteps corresponding to the second axis of errors.

  • title (str) – Title for the plot.

  • mode (str, optional) – “relative” for relative errors, “deltadex” for log-space absolute errors. Defaults to “relative”.

  • save (bool, optional) – Whether to save the plot to disk. Defaults to False.

  • show_title (bool, optional) – Whether to show the plot title. Defaults to True.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_errors_over_time(mean_errors, median_errors, timesteps, config, save=True, show_title=True, mode='relative')#

Plot errors over time for different surrogate models (relative or Δdex).

Parameters:
  • mean_errors (dict) – Mean errors for each surrogate.

  • median_errors (dict) – Median errors for each surrogate.

  • timesteps (np.ndarray) – Array of timesteps.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the figure.

  • show_title (bool) – Whether to add a title.

  • mode (str) – “relative” (percentage errors) or “deltadex” (log-space abs. errors).

Return type:

None

codes.benchmark.bench_plots.plot_example_iterative_predictions(surr_name, conf, iterative_preds, full_preds, targets, timesteps, iter_interval, example_idx=None, num_quantities=100, labels=None, save=False, show_title=True)#

Plot one sample’s full iterative trajectory: ground truth vs. chained predictions, with retrigger lines.

Return type:

None

codes.benchmark.bench_plots.plot_example_mode_predictions(surr_name, conf, preds_log, preds_main_log, targets_log, timesteps, metric, mode='interpolation', example_idx=0, num_quantities=100, labels=None, save=False, show_title=True)#

Plot example predictions in log-space (Δdex) alongside ground truth targets for either interpolation or extrapolation mode.

Predictions and targets are assumed to be in log10 space (leave_log=True). Axis labels and plotted values are consistent with this log representation.

Parameters:
  • surr_name (str) – Name of the surrogate model.

  • conf (dict) – Configuration dictionary.

  • preds_log (np.ndarray) – Predictions in log-space of shape [N_samples, T, Q].

  • preds_main_log (np.ndarray) – Main model (reference) predictions in log-space of shape [N_samples, T, Q].

  • targets_log (np.ndarray) – Targets in log-space of shape [N_samples, T, Q].

  • timesteps (np.ndarray) – Array of timesteps.

  • metric (int) –

    • In interpolation mode: the training interval (e.g., 10 means every 10th timestep was used).

    • In extrapolation mode: the cutoff timestep index.

  • mode (str, optional) – Either “interpolation” or “extrapolation”. Default is “interpolation”.

  • example_idx (int, optional) – Index of the example to plot. Default is 0.

  • num_quantities (int, optional) – Maximum number of quantities to plot. Default is 100.

  • labels (list[str], optional) – Names of the quantities to display in legends.

  • save (bool, optional) – Whether to save the figure. Default is False.

  • show_title (bool, optional) – Whether to add a title to the figure. Default is True.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_example_predictions_with_uncertainty(surr_name, conf, log_mean, log_std, log_targets, timesteps, example_idx=0, num_quantities=100, labels=None, save=False, show_title=True)#

Plot example predictions with uncertainty in log10 space (dex).

Parameters:
  • surr_name (str) – Name of the surrogate model.

  • conf (dict) – Configuration dictionary.

  • log_mean (np.ndarray) – Ensemble mean predictions in log10 space.

  • log_std (np.ndarray) – Ensemble standard deviation in log10 space.

  • log_targets (np.ndarray) – Ground truth targets in log10 space.

  • timesteps (np.ndarray) – Array of timesteps.

  • example_idx (int) – Index of the example to plot.

  • num_quantities (int) – Number of species/quantities to plot.

  • labels (list, optional) – Quantity labels.

  • save (bool) – Whether to save the figure.

  • show_title (bool) – Whether to display a title.

Return type:

None

codes.benchmark.bench_plots.plot_generalization_error_comparison(surrogates, metrics_list, model_errors_list, xlabel, filename, config, save=True, xlog=False, show_title=True)#

Plot the generalization errors of different surrogate models.

Parameters:
  • surrogates (list) – List of surrogate model names.

  • metrics_list (list[np.array]) – List of numpy arrays containing the metrics for each surrogate model.

  • model_errors_list (list[np.array]) – List of numpy arrays containing the errors for each surrogate model.

  • xlabel (str) – Label for the x-axis.

  • filename (str) – Filename to save the plot.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • xlog (bool) – Whether to use a log scale for the x-axis.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_generalization_errors(surr_name, conf, metrics, model_errors, mode, save=False, show_title=True)#

Plot the generalization errors of a model for various metrics.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • metrics (np.ndarray) – The metrics (e.g., intervals, cutoffs, batch sizes, number of training samples).

  • model_errors (np.ndarray) – The model errors.

  • mode (str) – The mode of generalization (“interpolation”, “extrapolation”, “sparse”, “batchsize”).

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_gradients_heatmap(surr_name, conf, gradients, errors_log, average_correlation, save=False, cutoff_mass=0.98, show_title=True)#

Plot correlation between gradients (normalized) and Δdex errors using a heatmap.

Both gradients and errors are in log-space. Gradients are normalized, errors are absolute log differences (Δdex).

Parameters:
  • surr_name (str) – Surrogate name.

  • conf (dict) – Config dictionary.

  • gradients (np.ndarray) – Normalized log-space gradients.

  • errors_log (np.ndarray) – Δdex errors.

  • average_correlation (float) – Mean correlation value.

  • save (bool) – Save plot.

  • cutoff_mass (float) – Fraction of mass to retain in axes.

  • show_title (bool) – Show title.

Returns:

Histogram stats for reuse.

Return type:

(max_value, x_max, y_max)

codes.benchmark.bench_plots.plot_loss_comparison(train_losses, test_losses, labels, config, save=True, show_title=True)#

Plot the training and test losses for different surrogate models.

Parameters:
  • train_losses (tuple) – Tuple of training loss arrays for each surrogate model.

  • test_losses (tuple) – Tuple of test loss arrays for each surrogate model.

  • labels (tuple) – Tuple of labels for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_loss_comparison_equal(train_losses, test_losses, labels, config, save=True, show_title=True)#

Plot the test loss trajectories for different surrogate models on a single plot, after log-transforming and normalizing each trajectory. This makes it easier to see convergence behavior even when the losses span several orders of magnitude. Numeric y-axis labels are removed.

Each loss trajectory is processed as follows: 1. Log-transform the loss values. 2. Normalize the log-transformed values to the range [0, 1]. 3. Plot the normalized trajectory on a normalized x-axis.

Parameters:
  • train_losses (tuple) – Tuple of training loss arrays for each surrogate model.

  • test_losses (tuple) – Tuple of test loss arrays for each surrogate model.

  • labels (tuple) – Tuple of labels for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_loss_comparison_train_duration(test_losses, labels, train_durations, config, save=True, show_title=True)#

Plot the test loss trajectories for different surrogate models over training duration.

Parameters:
  • test_losses (tuple) – Tuple of test loss arrays for each surrogate model.

  • labels (tuple) – Tuple of labels for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.#

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_losses(loss_histories, epochs, labels, title='Losses', save=False, conf=None, surr_name=None, mode='main', percentage=2.0, show_title=True)#

Plot the loss trajectories for the training of multiple models.

Parameters:
  • loss_histories (tuple[array, ...]) – List of loss history arrays.

  • epochs (int) – Number of epochs.

  • labels (tuple[str, ...]) – List of labels for each loss history.

  • title (str) – Title of the plot.

  • save (bool) – Whether to save the plot as an image file.

  • conf (Optional[dict]) – The configuration dictionary.

  • surr_name (Optional[str]) – The name of the surrogate model.

  • mode (str) – The mode of the training.

  • percentage (float) – Percentage of initial values to exclude from min-max calculation.

Return type:

None

show_title (bool): Whether to show the title on the plot.

codes.benchmark.bench_plots.plot_losses_dual_axis(train_loss, test_loss, labels=('Train Loss', 'Test Loss'), title='Losses', save=False, conf=None, surr_name=None, show_title=True)#

Plot the training and test loss trajectories with dual y-axes.

Parameters:
  • train_loss (array) – Training loss history array.

  • test_loss (array) – Test loss history array.

  • labels (tuple[str, str]) – Labels for the losses (train and test).

  • title (str) – Title of the plot.

  • save (bool) – Whether to save the plot as an image file.

  • conf (Optional[dict]) – The configuration dictionary.

  • surr_name (Optional[str]) – The name of the surrogate model.

Return type:

None

show_title (bool): Whether to show the title on the plot.

codes.benchmark.bench_plots.plot_mean_deltadex_over_time_main_vs_ensemble(main_errors, ensemble_errors, timesteps, config, save=True, show_title=True)#

Plot mean Δdex over time for each surrogate: main vs ensemble.

Parameters:
  • main_errors (dict) – Main model Δdex arrays [N, T, Q] per surrogate.

  • ensemble_errors (dict) – Ensemble Δdex arrays [N, T, Q] per surrogate.

  • timesteps (np.ndarray) – Timesteps array.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save figure.

  • show_title (bool) – Whether to add a title.

Return type:

None

codes.benchmark.bench_plots.plot_relative_errors(mean_errors, median_errors, timesteps, config, save=True, show_title=True)#

Plot the relative errors over time for different surrogate models.

Parameters:
  • mean_errors (dict) – Dictionary containing the mean relative errors for each surrogate model.

  • median_errors (dict) – Dictionary containing the median relative errors for each surrogate model.

  • timesteps (np.ndarray) – Array of timesteps.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.plot_surr_losses(model, surr_name, conf, timesteps, show_title=True)#

Plot the training and test losses for the surrogate model.

Parameters:
  • model – Instance of the surrogate model class.

  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • timesteps (np.ndarray) – The timesteps array.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.plot_uncertainty_confidence(weighted_diffs, conf, save=True, percentile=2, summary_stat='mean', show_title=True)#

Plot a comparative grouped bar chart of catastrophic confidence measures and return a metric quantifying the net skew of over- versus underconfidence.

For each surrogate model, the target-weighted difference is computed as:

weighted_diff = (predicted uncertainty - |prediction - target|) / target.

Negative values indicate overconfidence (i.e. the model’s uncertainty is too low relative to its error), while positive values indicate underconfidence.

Catastrophic events are defined as those samples in the lowest percentile (e.g. 2nd percentile) for overconfidence and in the highest percentile (i.e. 100 - percentile) for underconfidence.

For each surrogate, this function computes the mean and standard deviation of the weighted differences in both tails, then plots them as grouped bars (overconfidence bars on the left, underconfidence bars on the right) with standard error bars (thin, with capsize=3). The bar heights are expressed in percentages.

The text labels for the bars are placed on the opposite side of the x-axis: for negative (overconfident) values the annotation is shown a few pixels above the x-axis, and for positive (underconfident) values it is shown a few pixels below the x-axis.

The plot title includes the metric (mean ± std) and the number of samples (per tail).

Additionally, if the range between the smallest and largest bar is more than two orders of magnitude, the y-axis is set to a symmetric log scale.

Parameters:
  • weighted_diffs (dict[str, np.ndarray]) – Dictionary of weighted_diff arrays for each surrogate model.

  • conf (dict) – The configuration dictionary.

  • save (bool, optional) – Whether to save the plot.

  • percentile (float, optional) – Percentile threshold for defining catastrophic events (default is 2).

  • summary_stat (str, optional) – Currently only “mean” is implemented.

  • show_title (bool) – Whether to show the title on the plot.

Returns:

A dictionary mapping surrogate names to the net difference (over_summary + under_summary).

Return type:

dict[str, float]

codes.benchmark.bench_plots.plot_uncertainty_heatmap(surr_name, conf, preds_std, errors, average_correlation, save=True, cutoff_mass=0.98, show_title=True)#

Plot correlation between predictive log-space uncertainty and log-space errors (delta dex).

Parameters:
  • surr_name (str) – Name of the surrogate model.

  • conf (dict) – Configuration dictionary.

  • preds_std (np.ndarray) – Log-space ensemble standard deviation.

  • errors (np.ndarray) – Log-space prediction errors.

  • average_correlation (float) – Correlation between log uncertainty and log error.

  • save (bool) – Whether to save the figure.

  • cutoff_mass (float) – Fraction of mass to keep in histogram.

  • show_title (bool) – Whether to show a title.

Returns:

(max histogram count, axis_max used for plotting).

Return type:

tuple

codes.benchmark.bench_plots.plot_uncertainty_over_time_comparison(uncertainties, absolute_errors, timesteps, config, save=True, show_title=True)#

Plot log-space uncertainty and Δdex over time for multiple surrogates.

Parameters:
  • uncertainties (dict) – Mean log-space std over time per surrogate (1σ time series).

  • absolute_errors (dict) – Δdex arrays [N, T, Q] per surrogate.

  • timesteps (np.ndarray) – Timesteps array.

  • config (dict) – Configuration dictionary.

  • save (bool) – Save figure.

  • show_title (bool) – Add title.

Return type:

None

codes.benchmark.bench_plots.plot_uncertainty_vs_errors(surr_name, conf, preds_std, errors, save=False, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.

  • errors (np.ndarray) – Prediction errors.

  • save (bool, optional) – Whether to save the plot as a file.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.bench_plots.rel_errors_and_uq(metrics, config, save=True, show_title=True)#

Create a figure with two subplots: relative errors over time and uncertainty over time for different surrogate models.

Parameters:
  • metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.bench_plots.save_plot(plt, filename, conf, surr_name='', dpi=300, base_dir='plots', increase_count=False, format='jpg')#

Save the plot to a file, creating necessary directories if they don’t exist.

Parameters:
  • plt (matplotlib.pyplot) – The plot object to save.

  • filename (str) – The desired filename for the plot.

  • conf (dict) – The configuration dictionary.

  • surr_name (str) – The name of the surrogate model.

  • dpi (int) – The resolution of the saved plot.

  • base_dir (str, optional) – The base directory where plots will be saved. Default is “plots”.

  • increase_count (bool, optional) – Whether to increment the filename count if a file already exists. Default is True.

  • format (str, optional) – The format for saving the plot. Default is “png”. Can be “png”, “pdf”, “svg”, etc.

Raises:

ValueError – If the configuration dictionary does not contain the required keys.

Return type:

None

codes.benchmark.bench_plots.save_plot_counter(filename, directory, increase_count=True)#

Save a plot with an incremented filename if a file with the same name already exists.

Parameters:
  • filename (str) – The desired filename for the plot.

  • directory (str) – The directory to save the plot in.

  • increase_count (bool, optional) – Whether to increment the filename count if a file already exists. Default is True.

Returns:

The full path to the saved plot.

Return type:

str

codes.benchmark.bench_utils module#

codes.benchmark.bench_utils.check_benchmark(conf)#

Check whether there are any configuration issues with the benchmark.

Parameters:

conf (dict) – The configuration dictionary.

Raises:
  • FileNotFoundError – If the training ID directory is missing or if the .yaml file is missing.

  • ValueError – If the configuration is missing required keys or the values do not match the training configuration.

Return type:

None

codes.benchmark.bench_utils.check_surrogate(surrogate, conf)#

Check whether the required models for the benchmark are present in the expected directories.

Parameters:
  • surrogate (str) – The name of the surrogate model to check.

  • conf (dict) – The configuration dictionary.

Raises:

FileNotFoundError – If any required models are missing.

Return type:

None

codes.benchmark.bench_utils.clean_metrics(metrics, conf)#

Clean the metrics dictionary to remove problematic entries.

Parameters:
  • metrics (dict) – The benchmark metrics.

  • conf (dict) – The configuration dictionary.

Returns:

The cleaned metrics dictionary.

Return type:

dict

codes.benchmark.bench_utils.convert_dict_to_scientific_notation(d, precision=8)#

Convert all numerical values in a dictionary to scientific notation.

Parameters:

d (dict) – The input dictionary.

Returns:

The dictionary with numerical values in scientific notation.

Return type:

dict

codes.benchmark.bench_utils.convert_to_standard_types(data)#

Recursively convert data to standard types that can be serialized to YAML.

Parameters:

data – The data to convert.

Returns:

The converted data.

codes.benchmark.bench_utils.count_trainable_parameters(model)#

Count the number of trainable parameters in the model.

Parameters:

model (torch.nn.Module) – The PyTorch model.

Returns:

The number of trainable parameters.

Return type:

int

codes.benchmark.bench_utils.discard_numpy_entries(d)#

Recursively remove dictionary entries that contain NumPy arrays.

Parameters:

d (dict) – The input dictionary.

Returns:

A new dictionary without entries containing NumPy arrays.

Return type:

dict

codes.benchmark.bench_utils.flatten_dict(d, parent_key='', sep=' - ')#

Flatten a nested dictionary.

Parameters:
  • d (dict) – The dictionary to flatten.

  • parent_key (str) – The base key string.

  • sep (str) – The separator between keys.

Returns:

Flattened dictionary with composite keys.

Return type:

dict

codes.benchmark.bench_utils.format_seconds(seconds)#

Format a duration given in seconds as hh:mm:ss.

Parameters:

seconds (int) – The duration in seconds.

Returns:

The formatted duration string.

Return type:

str

codes.benchmark.bench_utils.format_time(mean_time, std_time)#

Format mean and std time consistently in ns, µs, ms, or s.

Parameters:
  • mean_time – The mean time.

  • std_time – The standard deviation of the time.

Returns:

The formatted time string.

Return type:

str

codes.benchmark.bench_utils.format_value(v, suffix='')#

Format a float with ~3 significant digits, in fixed or scientific notation. Optionally append a suffix (e.g. ‘dex’, ‘MB’).

Parameters:
  • v (float) – Value to format.

  • suffix (str) – Unit/suffix to append, e.g. ‘dex’. Ignored if as_percent=True.

Returns:

Formatted string with optional suffix or percent.

Return type:

str

codes.benchmark.bench_utils.get_model_config(surr_name, config)#

Get the model configuration for a specific surrogate model from the dataset folder. Returns an empty dictionary if config[“dataset”][“use_optimal_params”] is False, or if no configuration file is found in the dataset folder.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • config (dict) – The configuration dictionary.

Returns:

The model configuration dictionary.

Return type:

dict

codes.benchmark.bench_utils.get_required_models_list(surrogate, conf)#

Generate a list of required models based on the configuration settings.

Parameters:
  • surrogate (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

Returns:

A list of required model names.

Return type:

list

codes.benchmark.bench_utils.get_surrogate(surrogate_name)#

Check if the surrogate model exists.

Parameters:

surrogate_name (str) – The name of the surrogate model.

Returns:

The surrogate model class if it exists, otherwise None.

Return type:

SurrogateModel | None

codes.benchmark.bench_utils.load_model(model, training_id, surr_name, model_identifier)#

Load a trained surrogate model.

Parameters:
  • model – Instance of the surrogate model class.

  • training_id (str) – The training identifier.

  • surr_name (str) – The name of the surrogate model.

  • model_identifier (str) – The identifier of the model (e.g., ‘main’).

Return type:

Module

Returns:

The loaded surrogate model.

codes.benchmark.bench_utils.make_comparison_csv(metrics, config)#

Generate a CSV file comparing metrics for different surrogate models.

Parameters:
  • metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.bench_utils.measure_inference_time(model, test_loader, n_runs=5)#

Measure total inference time over a DataLoader across multiple runs.

Parameters:
  • model – Model instance with a .forward() method.

  • test_loader (DataLoader) – Loader with test data.

  • n_runs (int) – Number of repeated runs for averaging.

Returns:

List of total inference times per run (in seconds).

Return type:

list[float]

codes.benchmark.bench_utils.measure_memory_footprint(model, inputs)#

Measure the memory footprint of a model during forward and backward passes using peak memory tracking and explicit synchronization.

Parameters:
  • model (torch.nn.Module) – The PyTorch model.

  • inputs (tuple) – The input data for the model.

Returns:

A dictionary containing measured memory usages for:
  • model_memory: Additional memory used when moving the model to GPU.

  • forward_memory: Peak additional memory during the forward pass with gradients.

  • backward_memory: Peak additional memory during the backward pass.

  • forward_memory_nograd: Peak additional memory during the forward pass without gradients.

model: The model (possibly moved back to the original device).

Return type:

dict

codes.benchmark.bench_utils.save_table_csv(headers, rows, config)#

Save the CLI table (headers and rows) to a CSV file. This version strips out any formatting (like asterisks) from the table cells.

Parameters:
  • headers (list) – The list of header names.

  • rows (list) – The list of rows, where each row is a list of string values.

  • config (dict) – Configuration dictionary that contains ‘training_id’.

Return type:

None

Returns:

None

codes.benchmark.bench_utils.write_metrics_to_yaml(surr_name, conf, metrics)#

Write the benchmark metrics to a YAML file.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • metrics (dict) – The benchmark metrics.

Return type:

None

Module contents#

codes.benchmark.check_benchmark(conf)#

Check whether there are any configuration issues with the benchmark.

Parameters:

conf (dict) – The configuration dictionary.

Raises:
  • FileNotFoundError – If the training ID directory is missing or if the .yaml file is missing.

  • ValueError – If the configuration is missing required keys or the values do not match the training configuration.

Return type:

None

codes.benchmark.check_surrogate(surrogate, conf)#

Check whether the required models for the benchmark are present in the expected directories.

Parameters:
  • surrogate (str) – The name of the surrogate model to check.

  • conf (dict) – The configuration dictionary.

Raises:

FileNotFoundError – If any required models are missing.

Return type:

None

codes.benchmark.clean_metrics(metrics, conf)#

Clean the metrics dictionary to remove problematic entries.

Parameters:
  • metrics (dict) – The benchmark metrics.

  • conf (dict) – The configuration dictionary.

Returns:

The cleaned metrics dictionary.

Return type:

dict

codes.benchmark.compare_UQ(all_metrics, config)#
Compare log-space UQ across surrogates, focusing on:
  • Ensemble vs Main errors (Δdex) over time

  • Correlation between log-space uncertainty and errors

  • Catastrophic-error detection from uncertainty thresholds

Parameters:
  • all_metrics (dict) – Benchmark metrics per surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_batchsize(all_metrics, config)#

Compare the batch size training errors of different surrogate models.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_errors(metrics, config)#

Compare relative errors and Δdex errors over time for different surrogate models.

Parameters:
  • metrics (dict) – Benchmark metrics for each surrogate.

  • config (dict) – Configuration dictionary.

Return type:

None

codes.benchmark.compare_extrapolation(all_metrics, config)#

Compare the extrapolation errors of different surrogate models.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_gradients(metrics, config)#

Compare the gradients of different surrogate models.

Parameters:
  • metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_inference_time(metrics, config, save=True)#

Compare the mean inference time of different surrogate models.

Parameters:
  • metrics (dict[str, dict]) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool, optional) – Whether to save the plot. Defaults to True.

Return type:

None

Returns:

None

codes.benchmark.compare_interpolation(all_metrics, config)#

Compare the interpolation errors of different surrogate models.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_main_losses(metrics, config)#

Compare the training and test losses of the main models for different surrogate models.

Parameters:
  • metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.compare_models(metrics, config)#
codes.benchmark.compare_sparse(all_metrics, config)#

Compare the sparse training errors of different surrogate models.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.convert_dict_to_scientific_notation(d, precision=8)#

Convert all numerical values in a dictionary to scientific notation.

Parameters:

d (dict) – The input dictionary.

Returns:

The dictionary with numerical values in scientific notation.

Return type:

dict

codes.benchmark.convert_to_standard_types(data)#

Recursively convert data to standard types that can be serialized to YAML.

Parameters:

data – The data to convert.

Returns:

The converted data.

codes.benchmark.count_trainable_parameters(model)#

Count the number of trainable parameters in the model.

Parameters:

model (torch.nn.Module) – The PyTorch model.

Returns:

The number of trainable parameters.

Return type:

int

codes.benchmark.discard_numpy_entries(d)#

Recursively remove dictionary entries that contain NumPy arrays.

Parameters:

d (dict) – The input dictionary.

Returns:

A new dictionary without entries containing NumPy arrays.

Return type:

dict

codes.benchmark.evaluate_UQ(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the uncertainty quantification (UQ) performance of the surrogate model.

Predictions and targets are kept in log10 space (leave_log=True). All UQ metrics are computed in log space (Δdex).

Parameters:
  • model – The surrogate model instance.

  • surr_name (str) – The name of the surrogate model.

  • test_loader (DataLoader) – DataLoader with the test data.

  • timesteps (np.ndarray) – Array of timesteps.

  • conf (dict) – Configuration dictionary.

  • labels (list, optional) – Labels for the predicted quantities.

Returns:

Dictionary containing log-space UQ metrics and arrays.

Return type:

dict

codes.benchmark.evaluate_accuracy(model, surr_name, timesteps, test_loader, conf, labels=None)#

Evaluate the accuracy of the surrogate model.

Parameters:
  • model – Instance of the surrogate model class.

  • surr_name (str) – The name of the surrogate model.

  • timesteps (np.ndarray) – The timesteps array.

  • test_loader (DataLoader) – The DataLoader object containing the test data.

  • conf (dict) – The configuration dictionary.

  • labels (list, optional) – The labels for the quantities.

  • percentile (int, optional) – The percentile for error metrics.

Returns:

A dictionary containing accuracy metrics.

Return type:

dict

codes.benchmark.evaluate_batchsize(model, surr_name, test_loader, timesteps, conf)#

Evaluate the batch-size scaling performance of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Name of the surrogate.

  • test_loader (DataLoader) – DataLoader with test data.

  • timesteps (np.ndarray) – Timesteps array.

  • conf (dict) – Configuration dictionary.

Returns:

Batch-size scaling metrics in log-space.

Return type:

dict

codes.benchmark.evaluate_compute(model, surr_name, test_loader, conf)#

Evaluate the computational resource requirements of the surrogate model.

Parameters:
  • model – Instance of the surrogate model class.

  • surr_name (str) – The name of the surrogate model.

  • test_loader (DataLoader) – The DataLoader object containing the test data.

  • conf (dict) – The configuration dictionary.

Returns:

A dictionary containing model complexity metrics.

Return type:

dict

codes.benchmark.evaluate_extrapolation(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the extrapolation performance of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Name of the surrogate.

  • test_loader (DataLoader) – DataLoader with test data.

  • timesteps (np.ndarray) – Timesteps array.

  • conf (dict) – Configuration dictionary.

  • labels (list, optional) – Labels for the predicted quantities.

Returns:

Extrapolation metrics in log-space.

Return type:

dict

codes.benchmark.evaluate_gradients(model, surr_name, test_loader, conf, species_names=None)#

Evaluate the gradients of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Surrogate name.

  • test_loader (DataLoader) – Test data.

  • conf (dict) – Configuration dictionary.

  • species_names (list, optional) – Names of the species/quantities.

Returns:

Gradient–error correlation metrics in log-space.

Return type:

dict

codes.benchmark.evaluate_interpolation(model, surr_name, test_loader, timesteps, conf, labels=None)#

Evaluate the interpolation performance of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Name of the surrogate.

  • test_loader (DataLoader) – DataLoader with test data.

  • timesteps (np.ndarray) – Timesteps array.

  • conf (dict) – Configuration dictionary.

  • labels (list, optional) – Labels for the predicted quantities.

Returns:

Interpolation metrics in log-space.

Return type:

dict

codes.benchmark.evaluate_sparse(model, surr_name, test_loader, timesteps, n_train_samples, conf)#

Evaluate the sparse-data training performance of the surrogate model in log-space (Δdex).

Predictions and targets are kept in log10 space (leave_log=True). Errors are computed as absolute log differences (Δdex).

Parameters:
  • model – Surrogate model instance.

  • surr_name (str) – Name of the surrogate.

  • test_loader (DataLoader) – DataLoader with test data.

  • timesteps (np.ndarray) – Timesteps array.

  • n_train_samples (int) – Number of training samples in the full dataset.

  • conf (dict) – Configuration dictionary.

Returns:

Sparse training metrics in log-space.

Return type:

dict

codes.benchmark.flatten_dict(d, parent_key='', sep=' - ')#

Flatten a nested dictionary.

Parameters:
  • d (dict) – The dictionary to flatten.

  • parent_key (str) – The base key string.

  • sep (str) – The separator between keys.

Returns:

Flattened dictionary with composite keys.

Return type:

dict

codes.benchmark.format_seconds(seconds)#

Format a duration given in seconds as hh:mm:ss.

Parameters:

seconds (int) – The duration in seconds.

Returns:

The formatted duration string.

Return type:

str

codes.benchmark.format_time(mean_time, std_time)#

Format mean and std time consistently in ns, µs, ms, or s.

Parameters:
  • mean_time – The mean time.

  • std_time – The standard deviation of the time.

Returns:

The formatted time string.

Return type:

str

codes.benchmark.format_value(v, suffix='')#

Format a float with ~3 significant digits, in fixed or scientific notation. Optionally append a suffix (e.g. ‘dex’, ‘MB’).

Parameters:
  • v (float) – Value to format.

  • suffix (str) – Unit/suffix to append, e.g. ‘dex’. Ignored if as_percent=True.

Returns:

Formatted string with optional suffix or percent.

Return type:

str

codes.benchmark.get_custom_palette(num_colors)#

Returns a list of colors sampled from a custom color palette.

Parameters:

num_colors (int) – The number of colors needed.

Returns:

A list of RGBA color tuples.

Return type:

list

codes.benchmark.get_model_config(surr_name, config)#

Get the model configuration for a specific surrogate model from the dataset folder. Returns an empty dictionary if config[“dataset”][“use_optimal_params”] is False, or if no configuration file is found in the dataset folder.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • config (dict) – The configuration dictionary.

Returns:

The model configuration dictionary.

Return type:

dict

codes.benchmark.get_required_models_list(surrogate, conf)#

Generate a list of required models based on the configuration settings.

Parameters:
  • surrogate (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

Returns:

A list of required model names.

Return type:

list

codes.benchmark.get_surrogate(surrogate_name)#

Check if the surrogate model exists.

Parameters:

surrogate_name (str) – The name of the surrogate model.

Returns:

The surrogate model class if it exists, otherwise None.

Return type:

SurrogateModel | None

codes.benchmark.inference_time_bar_plot(surrogates, means, stds, config, save=True, show_title=True)#

Plot the mean inference time with standard deviation for different surrogate models.

Parameters:
  • surrogates (List[str]) – List of surrogate model names.

  • means (List[float]) – List of mean inference times for each surrogate model.

  • stds (List[float]) – List of standard deviation of inference times for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool, optional) – Whether to save the plot. Defaults to True.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.load_model(model, training_id, surr_name, model_identifier)#

Load a trained surrogate model.

Parameters:
  • model – Instance of the surrogate model class.

  • training_id (str) – The training identifier.

  • surr_name (str) – The name of the surrogate model.

  • model_identifier (str) – The identifier of the model (e.g., ‘main’).

Return type:

Module

Returns:

The loaded surrogate model.

codes.benchmark.make_comparison_csv(metrics, config)#

Generate a CSV file comparing metrics for different surrogate models.

Parameters:
  • metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

Return type:

None

Returns:

None

codes.benchmark.measure_inference_time(model, test_loader, n_runs=5)#

Measure total inference time over a DataLoader across multiple runs.

Parameters:
  • model – Model instance with a .forward() method.

  • test_loader (DataLoader) – Loader with test data.

  • n_runs (int) – Number of repeated runs for averaging.

Returns:

List of total inference times per run (in seconds).

Return type:

list[float]

codes.benchmark.measure_memory_footprint(model, inputs)#

Measure the memory footprint of a model during forward and backward passes using peak memory tracking and explicit synchronization.

Parameters:
  • model (torch.nn.Module) – The PyTorch model.

  • inputs (tuple) – The input data for the model.

Returns:

A dictionary containing measured memory usages for:
  • model_memory: Additional memory used when moving the model to GPU.

  • forward_memory: Peak additional memory during the forward pass with gradients.

  • backward_memory: Peak additional memory during the backward pass.

  • forward_memory_nograd: Peak additional memory during the forward pass without gradients.

model: The model (possibly moved back to the original device).

Return type:

dict

codes.benchmark.plot_MAE_comparison(MAEs, labels, config, save=True, show_title=True)#

Plot the MAE for different surrogate models.

Parameters:
  • MAE (tuple) – Tuple of accuracy arrays for each surrogate model.

  • labels (tuple) – Tuple of labels for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_all_generalization_errors(all_metrics, config, show_title=True)#

Function to make one comparative plot of the interpolation, extrapolation, sparse, and batch size errors. Only the modalities present in all_metrics will be plotted.

Parameters:
  • all_metrics (dict) – dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_average_errors_over_time(surr_name, conf, errors, metrics, timesteps, mode, save=False, show_title=True)#

Plot Δdex errors over time for different evaluation modes.

Parameters:
  • surr_name (str) – Surrogate name.

  • conf (dict) – Config dictionary.

  • errors (np.ndarray) – Errors [N_metrics, n_timesteps].

  • metrics (np.ndarray) – Metrics [N_metrics].

  • timesteps (np.ndarray) – Timesteps.

  • mode (str) – One of ‘interpolation’, ‘extrapolation’, ‘sparse’, ‘batchsize’.

Return type:

None

codes.benchmark.plot_average_uncertainty_over_time(surr_name, conf, errors_time, preds_std, timesteps, save=False, show_title=True)#

Plot average predictive uncertainty and errors over time in log-space (dex).

Parameters:
  • surr_name (str) – Name of the surrogate model.

  • conf (dict) – Configuration dictionary.

  • errors_time (np.ndarray) – Log-space prediction errors over time.

  • preds_std (np.ndarray) – Log-space ensemble standard deviation over time.

  • timesteps (np.ndarray) – Array of timesteps.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show a title.

Return type:

None

codes.benchmark.plot_comparative_error_correlation_heatmaps(preds_std, errors, avg_correlations, axis_max, max_count, config, save=True, show_title=True)#

Comparative heatmaps of log-space uncertainty vs Δdex.

Parameters:
  • preds_std (dict) – Log-space std arrays per surrogate.

  • errors (dict) – Δdex arrays per surrogate.

  • avg_correlations (dict) – Pearson r per surrogate (log-space).

  • axis_max (dict) – Axis maxima from per-surrogate plots.

  • max_count (dict) – Peak counts for normalization per surrogate.

  • config (dict) – Configuration dictionary.

  • save (bool) – Save figure.

  • show_title (bool) – Add title.

Return type:

None

codes.benchmark.plot_comparative_gradient_heatmaps(gradients, errors, avg_correlations, max_grad, max_err, max_count, config, save=True, show_title=True)#

Plot comparative heatmaps of correlation between gradient and prediction errors for multiple surrogate models.

Parameters:
  • gradients (dict[str, np.ndarray]) – Dictionary of gradients from the ensemble of models.

  • errors (dict[str, np.ndarray]) – Dictionary of prediction errors.

  • avg_correlations (dict[str, float]) – Dictionary of average correlations between gradients and prediction errors.

  • max_grad (dict[str, float]) – Dictionary of maximum gradient values for axis scaling across models.

  • max_err (dict[str, float]) – Dictionary of maximum error values for axis scaling across models.

  • max_count (dict[str, float]) – Dictionary of maximum count values for heatmap normalization across models.

  • config (dict) – Configuration dictionary.

  • save (bool, optional) – Whether to save the plot. Defaults to True.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_dynamic_correlation(surr_name, conf, gradients, errors, save=False, show_title=True)#

Plot the correlation between the gradients of the data and the prediction errors.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • gradients (np.ndarray) – The gradients of the data.

  • errors (np.ndarray) – The prediction errors.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

codes.benchmark.plot_error_distribution_comparative(errors, conf, save=True, show_title=True, mode='relative')#

Plot comparative error distributions for each surrogate model.

Parameters:
  • errors (dict) – Model → array of errors [num_samples, num_timesteps, num_quantities].

  • conf (dict) – Configuration dictionary.

  • save (bool) – Whether to save the figure.

  • show_title (bool) – Whether to add a title.

  • mode (str) – “relative” (unitless %) or “deltadex” (log-space abs. errors).

Return type:

None

codes.benchmark.plot_error_distribution_per_quantity(surr_name, conf, errors, quantity_names=None, num_quantities=10, mode='relative', save=True, show_title=True)#

Plot the distribution of errors for each quantity as a smoothed histogram plot.

  • mode=”relative”:

    Errors are relative (0..∞). Histogram is plotted in log-space (x-axis log-scaled).

  • mode=”deltadex”:

    Errors are absolute log-space errors (Δdex ≥ 0). Histogram is plotted on linear scale.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • errors (np.ndarray) – Errors array of shape [num_samples, num_timesteps, num_quantities].

  • quantity_names (list, optional) – List of quantity names for labeling the lines.

  • num_quantities (int, optional) – Number of quantities to plot. Default is 10.

  • mode (str, optional) – “relative” or “deltadex”. Default is “relative”.

  • save (bool, optional) – Whether to save the plot as a file.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_error_percentiles_over_time(surr_name, conf, errors, timesteps, title, mode='relative', save=False, show_title=True)#

Plot mean, median, and percentile error envelopes over time.

  • mode=”relative”:

    Treats errors as relative errors (0..∞). Plots bidirectional percentile bands (25-75, 5-95, 0.5-99.5). Y-axis is log-scaled.

  • mode=”deltadex”:

    Treats errors as log-space absolute errors (Δdex ≥ 0). Plots one-sided percentile bands (0-50, 0-90, 0-99). Y-axis is linear, starting at 0.

Parameters:
  • surr_name (str) – Name of the surrogate model (used for saving).

  • conf (dict) – Configuration dictionary containing dataset and output settings.

  • errors (np.ndarray) – Error array of shape [N_samples, N_timesteps, N_quantities]. Values are either relative errors or Δdex depending on mode.

  • timesteps (np.ndarray) – Array of timesteps corresponding to the second axis of errors.

  • title (str) – Title for the plot.

  • mode (str, optional) – “relative” for relative errors, “deltadex” for log-space absolute errors. Defaults to “relative”.

  • save (bool, optional) – Whether to save the plot to disk. Defaults to False.

  • show_title (bool, optional) – Whether to show the plot title. Defaults to True.

Return type:

None

Returns:

None

codes.benchmark.plot_example_iterative_predictions(surr_name, conf, iterative_preds, full_preds, targets, timesteps, iter_interval, example_idx=None, num_quantities=100, labels=None, save=False, show_title=True)#

Plot one sample’s full iterative trajectory: ground truth vs. chained predictions, with retrigger lines.

Return type:

None

codes.benchmark.plot_example_mode_predictions(surr_name, conf, preds_log, preds_main_log, targets_log, timesteps, metric, mode='interpolation', example_idx=0, num_quantities=100, labels=None, save=False, show_title=True)#

Plot example predictions in log-space (Δdex) alongside ground truth targets for either interpolation or extrapolation mode.

Predictions and targets are assumed to be in log10 space (leave_log=True). Axis labels and plotted values are consistent with this log representation.

Parameters:
  • surr_name (str) – Name of the surrogate model.

  • conf (dict) – Configuration dictionary.

  • preds_log (np.ndarray) – Predictions in log-space of shape [N_samples, T, Q].

  • preds_main_log (np.ndarray) – Main model (reference) predictions in log-space of shape [N_samples, T, Q].

  • targets_log (np.ndarray) – Targets in log-space of shape [N_samples, T, Q].

  • timesteps (np.ndarray) – Array of timesteps.

  • metric (int) –

    • In interpolation mode: the training interval (e.g., 10 means every 10th timestep was used).

    • In extrapolation mode: the cutoff timestep index.

  • mode (str, optional) – Either “interpolation” or “extrapolation”. Default is “interpolation”.

  • example_idx (int, optional) – Index of the example to plot. Default is 0.

  • num_quantities (int, optional) – Maximum number of quantities to plot. Default is 100.

  • labels (list[str], optional) – Names of the quantities to display in legends.

  • save (bool, optional) – Whether to save the figure. Default is False.

  • show_title (bool, optional) – Whether to add a title to the figure. Default is True.

Return type:

None

Returns:

None

codes.benchmark.plot_example_predictions_with_uncertainty(surr_name, conf, log_mean, log_std, log_targets, timesteps, example_idx=0, num_quantities=100, labels=None, save=False, show_title=True)#

Plot example predictions with uncertainty in log10 space (dex).

Parameters:
  • surr_name (str) – Name of the surrogate model.

  • conf (dict) – Configuration dictionary.

  • log_mean (np.ndarray) – Ensemble mean predictions in log10 space.

  • log_std (np.ndarray) – Ensemble standard deviation in log10 space.

  • log_targets (np.ndarray) – Ground truth targets in log10 space.

  • timesteps (np.ndarray) – Array of timesteps.

  • example_idx (int) – Index of the example to plot.

  • num_quantities (int) – Number of species/quantities to plot.

  • labels (list, optional) – Quantity labels.

  • save (bool) – Whether to save the figure.

  • show_title (bool) – Whether to display a title.

Return type:

None

codes.benchmark.plot_generalization_error_comparison(surrogates, metrics_list, model_errors_list, xlabel, filename, config, save=True, xlog=False, show_title=True)#

Plot the generalization errors of different surrogate models.

Parameters:
  • surrogates (list) – List of surrogate model names.

  • metrics_list (list[np.array]) – List of numpy arrays containing the metrics for each surrogate model.

  • model_errors_list (list[np.array]) – List of numpy arrays containing the errors for each surrogate model.

  • xlabel (str) – Label for the x-axis.

  • filename (str) – Filename to save the plot.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • xlog (bool) – Whether to use a log scale for the x-axis.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_generalization_errors(surr_name, conf, metrics, model_errors, mode, save=False, show_title=True)#

Plot the generalization errors of a model for various metrics.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • metrics (np.ndarray) – The metrics (e.g., intervals, cutoffs, batch sizes, number of training samples).

  • model_errors (np.ndarray) – The model errors.

  • mode (str) – The mode of generalization (“interpolation”, “extrapolation”, “sparse”, “batchsize”).

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_gradients_heatmap(surr_name, conf, gradients, errors_log, average_correlation, save=False, cutoff_mass=0.98, show_title=True)#

Plot correlation between gradients (normalized) and Δdex errors using a heatmap.

Both gradients and errors are in log-space. Gradients are normalized, errors are absolute log differences (Δdex).

Parameters:
  • surr_name (str) – Surrogate name.

  • conf (dict) – Config dictionary.

  • gradients (np.ndarray) – Normalized log-space gradients.

  • errors_log (np.ndarray) – Δdex errors.

  • average_correlation (float) – Mean correlation value.

  • save (bool) – Save plot.

  • cutoff_mass (float) – Fraction of mass to retain in axes.

  • show_title (bool) – Show title.

Returns:

Histogram stats for reuse.

Return type:

(max_value, x_max, y_max)

codes.benchmark.plot_loss_comparison(train_losses, test_losses, labels, config, save=True, show_title=True)#

Plot the training and test losses for different surrogate models.

Parameters:
  • train_losses (tuple) – Tuple of training loss arrays for each surrogate model.

  • test_losses (tuple) – Tuple of test loss arrays for each surrogate model.

  • labels (tuple) – Tuple of labels for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_loss_comparison_train_duration(test_losses, labels, train_durations, config, save=True, show_title=True)#

Plot the test loss trajectories for different surrogate models over training duration.

Parameters:
  • test_losses (tuple) – Tuple of test loss arrays for each surrogate model.

  • labels (tuple) – Tuple of labels for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.#

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_losses(loss_histories, epochs, labels, title='Losses', save=False, conf=None, surr_name=None, mode='main', percentage=2.0, show_title=True)#

Plot the loss trajectories for the training of multiple models.

Parameters:
  • loss_histories (tuple[array, ...]) – List of loss history arrays.

  • epochs (int) – Number of epochs.

  • labels (tuple[str, ...]) – List of labels for each loss history.

  • title (str) – Title of the plot.

  • save (bool) – Whether to save the plot as an image file.

  • conf (Optional[dict]) – The configuration dictionary.

  • surr_name (Optional[str]) – The name of the surrogate model.

  • mode (str) – The mode of the training.

  • percentage (float) – Percentage of initial values to exclude from min-max calculation.

Return type:

None

show_title (bool): Whether to show the title on the plot.

codes.benchmark.plot_relative_errors(mean_errors, median_errors, timesteps, config, save=True, show_title=True)#

Plot the relative errors over time for different surrogate models.

Parameters:
  • mean_errors (dict) – Dictionary containing the mean relative errors for each surrogate model.

  • median_errors (dict) – Dictionary containing the median relative errors for each surrogate model.

  • timesteps (np.ndarray) – Array of timesteps.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.plot_surr_losses(model, surr_name, conf, timesteps, show_title=True)#

Plot the training and test losses for the surrogate model.

Parameters:
  • model – Instance of the surrogate model class.

  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • timesteps (np.ndarray) – The timesteps array.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.plot_uncertainty_confidence(weighted_diffs, conf, save=True, percentile=2, summary_stat='mean', show_title=True)#

Plot a comparative grouped bar chart of catastrophic confidence measures and return a metric quantifying the net skew of over- versus underconfidence.

For each surrogate model, the target-weighted difference is computed as:

weighted_diff = (predicted uncertainty - |prediction - target|) / target.

Negative values indicate overconfidence (i.e. the model’s uncertainty is too low relative to its error), while positive values indicate underconfidence.

Catastrophic events are defined as those samples in the lowest percentile (e.g. 2nd percentile) for overconfidence and in the highest percentile (i.e. 100 - percentile) for underconfidence.

For each surrogate, this function computes the mean and standard deviation of the weighted differences in both tails, then plots them as grouped bars (overconfidence bars on the left, underconfidence bars on the right) with standard error bars (thin, with capsize=3). The bar heights are expressed in percentages.

The text labels for the bars are placed on the opposite side of the x-axis: for negative (overconfident) values the annotation is shown a few pixels above the x-axis, and for positive (underconfident) values it is shown a few pixels below the x-axis.

The plot title includes the metric (mean ± std) and the number of samples (per tail).

Additionally, if the range between the smallest and largest bar is more than two orders of magnitude, the y-axis is set to a symmetric log scale.

Parameters:
  • weighted_diffs (dict[str, np.ndarray]) – Dictionary of weighted_diff arrays for each surrogate model.

  • conf (dict) – The configuration dictionary.

  • save (bool, optional) – Whether to save the plot.

  • percentile (float, optional) – Percentile threshold for defining catastrophic events (default is 2).

  • summary_stat (str, optional) – Currently only “mean” is implemented.

  • show_title (bool) – Whether to show the title on the plot.

Returns:

A dictionary mapping surrogate names to the net difference (over_summary + under_summary).

Return type:

dict[str, float]

codes.benchmark.plot_uncertainty_heatmap(surr_name, conf, preds_std, errors, average_correlation, save=True, cutoff_mass=0.98, show_title=True)#

Plot correlation between predictive log-space uncertainty and log-space errors (delta dex).

Parameters:
  • surr_name (str) – Name of the surrogate model.

  • conf (dict) – Configuration dictionary.

  • preds_std (np.ndarray) – Log-space ensemble standard deviation.

  • errors (np.ndarray) – Log-space prediction errors.

  • average_correlation (float) – Correlation between log uncertainty and log error.

  • save (bool) – Whether to save the figure.

  • cutoff_mass (float) – Fraction of mass to keep in histogram.

  • show_title (bool) – Whether to show a title.

Returns:

(max histogram count, axis_max used for plotting).

Return type:

tuple

codes.benchmark.plot_uncertainty_over_time_comparison(uncertainties, absolute_errors, timesteps, config, save=True, show_title=True)#

Plot log-space uncertainty and Δdex over time for multiple surrogates.

Parameters:
  • uncertainties (dict) – Mean log-space std over time per surrogate (1σ time series).

  • absolute_errors (dict) – Δdex arrays [N, T, Q] per surrogate.

  • timesteps (np.ndarray) – Timesteps array.

  • config (dict) – Configuration dictionary.

  • save (bool) – Save figure.

  • show_title (bool) – Add title.

Return type:

None

codes.benchmark.plot_uncertainty_vs_errors(surr_name, conf, preds_std, errors, save=False, show_title=True)#

Plot the correlation between predictive uncertainty and prediction errors.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • preds_std (np.ndarray) – Standard deviation of predictions from the ensemble of models.

  • errors (np.ndarray) – Prediction errors.

  • save (bool, optional) – Whether to save the plot as a file.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

codes.benchmark.read_yaml_config(config_path)#
codes.benchmark.rel_errors_and_uq(metrics, config, save=True, show_title=True)#

Create a figure with two subplots: relative errors over time and uncertainty over time for different surrogate models.

Parameters:
  • metrics (dict) – Dictionary containing the benchmark metrics for each surrogate model.

  • config (dict) – Configuration dictionary.

  • save (bool) – Whether to save the plot.

  • show_title (bool) – Whether to show the title on the plot.

Return type:

None

Returns:

None

codes.benchmark.run_benchmark(surr_name, surrogate_class, conf)#

Run benchmarks for a given surrogate model.

Parameters:
  • surr_name (str) – The name of the surrogate model to benchmark.

  • surrogate_class – The class of the surrogate model.

  • conf (dict) – The configuration dictionary.

Returns:

A dictionary containing all relevant metrics for the given model.

Return type:

dict

codes.benchmark.save_plot(plt, filename, conf, surr_name='', dpi=300, base_dir='plots', increase_count=False, format='jpg')#

Save the plot to a file, creating necessary directories if they don’t exist.

Parameters:
  • plt (matplotlib.pyplot) – The plot object to save.

  • filename (str) – The desired filename for the plot.

  • conf (dict) – The configuration dictionary.

  • surr_name (str) – The name of the surrogate model.

  • dpi (int) – The resolution of the saved plot.

  • base_dir (str, optional) – The base directory where plots will be saved. Default is “plots”.

  • increase_count (bool, optional) – Whether to increment the filename count if a file already exists. Default is True.

  • format (str, optional) – The format for saving the plot. Default is “png”. Can be “png”, “pdf”, “svg”, etc.

Raises:

ValueError – If the configuration dictionary does not contain the required keys.

Return type:

None

codes.benchmark.save_plot_counter(filename, directory, increase_count=True)#

Save a plot with an incremented filename if a file with the same name already exists.

Parameters:
  • filename (str) – The desired filename for the plot.

  • directory (str) – The directory to save the plot in.

  • increase_count (bool, optional) – Whether to increment the filename count if a file already exists. Default is True.

Returns:

The full path to the saved plot.

Return type:

str

codes.benchmark.save_table_csv(headers, rows, config)#

Save the CLI table (headers and rows) to a CSV file. This version strips out any formatting (like asterisks) from the table cells.

Parameters:
  • headers (list) – The list of header names.

  • rows (list) – The list of rows, where each row is a list of string values.

  • config (dict) – Configuration dictionary that contains ‘training_id’.

Return type:

None

Returns:

None

codes.benchmark.tabular_comparison(all_metrics, config)#

Compare the metrics of different surrogate models in a tabular format. Prints a table to the CLI, saves the table into a text file, and saves a CSV file with all metrics. Also saves a CSV file with only the metrics that appear in the CLI table.

Return type:

None

codes.benchmark.time_inference(model, surr_name, test_loader, conf, n_test_samples, n_runs=5)#

Time the inference of the surrogate model (full version with metrics).

Parameters:
  • model – Instance of the surrogate model class.

  • surr_name (str) – The name of the surrogate model.

  • test_loader (DataLoader) – The DataLoader object containing the test data.

  • conf (dict) – The configuration dictionary.

  • n_test_samples (int) – The number of test samples.

  • n_runs (int, optional) – Number of times to run the inference for timing.

Returns:

A dictionary containing timing metrics.

Return type:

dict

codes.benchmark.write_metrics_to_yaml(surr_name, conf, metrics)#

Write the benchmark metrics to a YAML file.

Parameters:
  • surr_name (str) – The name of the surrogate model.

  • conf (dict) – The configuration dictionary.

  • metrics (dict) – The benchmark metrics.

Return type:

None