CODES Benchmark

We introduce CODES, a benchmark for comprehensive evaluation of surrogate architectures for coupled ODE systems. Besides standard metrics like mean squared error (MSE) and inference time, CODES provides insights into surrogate behaviour across multiple dimensions like interpolation, extrapolation, sparse data, uncertainty quantification and gradient correlation. The benchmark emphasizes usability through features such as integrated parallel training, a web-based configuration generator, and pre-implemented baseline models and datasets. Extensive documentation ensures sustainability and provides the foundation for collaborative improvement. By offering a fair and multi-faceted comparison, CODES helps researchers select the most suitable surrogate for their specific dataset and application while deepening our understanding of surrogate learning behaviour.

Motivation

There are many efforts to use machine learning models ("surrogates") to replace the costly numerics required involved in solving coupled ODEs. But for the end user, it is not obvious how to choose the right surrogate for a given task. Usually, the best choice depends on both the dataset and the target application.

Dataset specifics - how "complex" is the dataset?

  • How many samples are there?
  • Are the trajectories very dynamic or are the developments rather slow?
  • How dense is the distribution of initial conditions?
  • Is the data domain of interest well-covered by the domain of the training set?

Task requirements:

  • What is the required accuracy?
  • How important is inference time? Is the training time limited?
  • Are there computational constraints (memory or processing power)?
  • Is uncertainty estimation required (e.g. to replace uncertain predictions by numerics)?
  • How much predictive flexibility is required? Do we need to interpolate or extrapolate across time?

Besides these practical considerations, one overarching question is always: Does the model only learn the data, or does it "understand" something about the underlying dynamics?

Goals

This benchmark aims to aid in choosing the best surrogate model for the task at hand and additionally to shed some light on the above questions.

To achieve this, a selection of surrogate models are implemented in this repository. They can be trained on one of the included datasets or a custom dataset and then benchmarked on the corresponding test dataset.

Some metrics included in the benchmark (but there is much more!):

  • Absolute and relative error of the models.
  • Inference time.
  • Number of trainable parameters.
  • Memory requirements (WIP).

Besides this, there are plenty of plots and visualisations providing insights into the models behaviour:

  • Error distributions - per model, across time or per quantity.
  • Insights into interpolation and extrapolation across time.
  • Behaviour when training with sparse data or varying batch size.
  • Predictions with uncertainty and predictive uncertainty across time.
  • Correlations between the either predictive uncertainty or dynamics (gradients) of the data and the prediction error.

Some prime use-cases of the benchmark are:

  • Finding the best-performing surrogate on a dataset. Here, best-performing could mean high accuracy, low inference times or any other metric of interest (e.g. most accurate uncertainty estimates, ...).
  • Comparing performance of a novel surrogate architecture against the implemented baseline models.
  • Gaining insights into a dataset or comparing datasets using the built-in dataset insights.