pyabc.sumstat

Summary statistics

Summary statistics generally yield a lower-dimensional informative representation of the model output. Distance comparisons are then performed in summary statistics space.

The pyabc.sumstat.Sumstat base class allows to chain statistics, and to use self-learned and adaptive statistics. It is directly integrated in distance functions such as the pyabc.distance.PNormDistance and derived clases.

Note

Besides this summary statistics class integrated in the distance calculation, the main ABCSMC class allows to specify models and summary_statistics. It is the output of summary_statistics(model(…)) that is saved in the database. However, in general it does not have to be the final summary statistics, which are given by this module here. We acknowledge that the naming (due to legacy) may be confusing. The different layers all make sense, as they allow to separately specify what the model does, what information is to be saved, and on what representation to calculate distances.

class pyabc.sumstat.GMMSubsetter(n_components_min: int = 1, n_components_max: int = 5, min_fraction: float = 0.3, normalize_labels: bool = True, gmm_args: dict = None)[source]

Bases: Subsetter

Using a Gaussian mixed model for subset identification.

Performs model selection over Gaussian mixed models with up to n_components_max components and returns all samples belonging to the same cluster as the posterior mean. Optionally, this set is augmented by the nearest neighbors to reach a fraction min_fraction of the original sample size.

Parameters:

n_components_min (Minimum candidate number of clusters.)
n_components_max (Maximum candidate number of clusters.)
min_fraction – Minimum fraction of samples in the result. If the obtained cluster has less samples, it is augmented by nearby samples.
normalize_labels – Whether to z-score normalize labels internally prior to cluster analysis.
gmm_args – Keyword arguments that are passed on to the sklearn GaussianMixture, see https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. # noqa
Properties
----------
gmm (The best fitted Gaussian mixture model.)
n_components (The corresponding number of components.)
bics (All BIC values used in model selection.)

__init__(n_components_min: int = 1, n_components_max: int = 5, min_fraction: float = 0.3, normalize_labels: bool = True, gmm_args: dict = None)[source]

select(x: ndarray, y: ndarray, w: ndarray) → tuple[ndarray, ndarray, ndarray][source]: Select based on GMM clusters.

class pyabc.sumstat.IdSubsetter[source]

Bases: Subsetter

Identity subset mapping.

select(x: ndarray, y: ndarray, w: ndarray) → tuple[ndarray, ndarray, ndarray][source]: Just return x, y, w unchanged.

class pyabc.sumstat.IdentitySumstat(trafos: list[Callable[[ndarray], ndarray]] = None, pre: Sumstat = None, shape_out: tuple[int, ...] = (-1,))[source]

Bases: Sumstat

Identity mapping with optional transformations.

__call__(data: dict | ndarray, *args, **kwargs)

Calculate summary statistics.

Parameters:: data (Model output or observed data.)
Returns:: sumstat
Return type:: Summary statistics of the data, a np.ndarray.

__init__(trafos: list[Callable[[ndarray], ndarray]] = None, pre: Sumstat = None, shape_out: tuple[int, ...] = (-1,))[source]

Parameters:

pre – Previously applied summary statistics, enables chaining.
trafos – Optional transformations to apply, should be vectorized. Note that if the original data should be contained, a transformation s.a. lambda x: x must be added.
shape_out – Shape the (otherwise flat) output is converted to, via numpy.reshape(). Defaults to (-1,) and thus a flat array. Sometimes a row vector (1, -1) may be preferable, e.g. to treat simulations as replicates. For more complex shapes, tailored mappings may be preferable by deriving from Sumstat or IdentitySumstat.

get_ids()[source]

Get ids/labels for the summary statistics.

Uses the more meaningful data labels if the transformation is id.

class pyabc.sumstat.PredictorSumstat(predictor: Predictor | Callable, fit_ixs: EventIxs | Collection[int] | int = None, all_particles: bool = True, normalize_labels: bool = True, fitted: bool = False, subsetter: Subsetter = None, pre: Sumstat = None, pre_before_fit: bool = False, par_trafo: ParTrafoBase = None)[source]

Bases: Sumstat

Summary statistics based on a model predicting parameters from data, y -> theta. For some predictor models, there exist dedicated subclasses.

The predictor should define:

fit(X, y) to fit the model on a sample of data X and outputs y, where X has shape (n_sample, n_feature), and y has shape (n_sample, n_out), with n_out either the parameter dimension or 1, depending on joint. Further, fit(X, y, weights) gets as a third argument the sample weights if weight_samples is set. Not all predictors support this.
predict(X) to predict outputs of shape (n_out,), where X has shape (n_sample, n_feature).

__call__(data: dict | ndarray, *args, **kwargs)

Calculate summary statistics.

Parameters:: data (Model output or observed data.)
Returns:: sumstat
Return type:: Summary statistics of the data, a np.ndarray.

__init__(predictor: Predictor | Callable, fit_ixs: EventIxs | Collection[int] | int = None, all_particles: bool = True, normalize_labels: bool = True, fitted: bool = False, subsetter: Subsetter = None, pre: Sumstat = None, pre_before_fit: bool = False, par_trafo: ParTrafoBase = None)[source]

Parameters:

predictor – The predictor mapping data (inputs) to parameters (targets). See Predictor for the functionality contract.
fit_ixs – Generation indices when to (re)fit the model, e.g. {9, 15}. See pyabc.EventIxs for possible values. In generations before the first fit, the output of pre is returned as-is.
all_particles – Whether to base the predictors on all samples, or only accepted ones. Basing it on all samples reflects the sampling process, while only considering accepted particles (and additionally weighting them) reflects the posterior approximation.
normalize_labels – Whether the outputs in __call__ are normalized according to potentially applied internal normalization of the predictor. This allows to level the influence of labels.
fitted – Set to True if the predictor model passed has aready been fitted externally. If False, the __call__ function will return the output of pre until the first time index in fit_ixs.
subsetter – Sample subset/cluster selection method. Defaults to just taking all samples. In the presence of e.g. multi-modalities it may make sense to reduce.
pre – Previously applied summary statistics, enables chaining. Should usually not be adaptive.
pre_before_fit – Apply previous summary statistics also before any fit is performed, or just return the input then and only apply pre when regression-based summary statistics are calculated.
par_trafo – Parameter transformations to use as targets. Defaults to identity.

configure_sampler(sampler) → None[source]

Configure the sampler.

This method is called by the inference routine at the beginning. A possible configuration would be to request also the storing of rejected particles. The default is to do nothing.

Parameters:: sampler (Sampler) – The used sampler.

get_ids() → list[str][source]

Get ids/labels for the summary statistics.

Defaults to indexing the statistics as S_{ix}.

initialize(t: int, get_sample: Callable[[], Sample], x_0: dict, total_sims: int) → None[source]

Initialize before the first generation.

Called at the beginning by the inference routine, can be used for calibration to the problem.

Parameters:

t – Time point for which to initialize the distance.
get_sample – Returns on command the initial sample.
x_0 – The observed summary statistics.
total_sims – The total number of simulations so far.

is_adaptive() → bool[source]: Whether the class is dynamically updated after each generation, based on the last generation’s available data. Default: False.

requires_calibration() → bool[source]: Whether the class requires an initial calibration, based on samples from the prior. Default: False.

update(t: int, get_sample: Callable[[], Sample], total_sims: int) → bool[source]

Update for the upcoming generation t.

Similar as initialize, however called for every subsequent iteration.

Parameters:

t – Time point for which to update the distance.
get_sample – Returns on demand the last generation’s complete sample.
total_sims – The total number of simulations so far.

Returns:

is_updated – Whether something has changed compared to beforehand. Depending on the result, the population needs to be updated before preparing the next generation. Defaults to False.

Return type:

bool

class pyabc.sumstat.Subsetter[source]

Bases: ABC

Select a localized sample subset for model training.

E.g. in the pyabc.PredictorSumstat class, we employ predictor models y -> p from data to parameters. These models should be local, e.g. trained on samples from a high-density region. This is because the inverse mapping of p -> y, y -> p, does in general not exist globally, e.g. due to parameter non-identifiability, multiple modes, and model stochasticity. Therefore, it is important to train the models on a sample set in which a functional form is roughly given. This class allows to subset a given sample to generate a localized sample.

abstractmethod select(x: ndarray, y: ndarray, w: ndarray) → tuple[ndarray, ndarray, ndarray][source]

Select samples for model training. This is the main method.

Parameters:

x (Samples, shape (n_sample, n_feature).)
y (Targets, shape (n_sample, n_out).)
w (Weights, shape (n_sample,).)

Returns:

A tuple x_, y_, w_ of the subsetted samples, targets and weights with
n_sample -> n_sample_used <= n_sample.

class pyabc.sumstat.Sumstat(pre: Sumstat = None)[source]

Bases: ABC

Summary statistics.

Summary statistics operate on and transform the model output. They can e.g. rotate, augment, or extract features. Via the pre argument, summary statistics operations can be concatenated/chained.

abstractmethod __call__(data: dict | ndarray) → ndarray | dict[str, float][source]

Calculate summary statistics.

Parameters:: data (Model output or observed data.)
Returns:: sumstat
Return type:: Summary statistics of the data, a np.ndarray.

__init__(pre: Sumstat = None)[source]

Parameters:: pre (Previously applied summary statistics, enables chaining.)

configure_sampler(sampler) → None[source]

Configure the sampler.

This method is called by the inference routine at the beginning. A possible configuration would be to request also the storing of rejected particles. The default is to do nothing.

Parameters:: sampler (Sampler) – The used sampler.

get_ids() → list[str][source]

Get ids/labels for the summary statistics.

Defaults to indexing the statistics as S_{ix}.

initialize(t: int, get_sample: Callable[[], Sample], x_0: dict, total_sims: int) → None[source]

Initialize before the first generation.

Called at the beginning by the inference routine, can be used for calibration to the problem.

Parameters:

t – Time point for which to initialize the distance.
get_sample – Returns on command the initial sample.
x_0 – The observed summary statistics.
total_sims – The total number of simulations so far.

is_adaptive() → bool[source]: Whether the class is dynamically updated after each generation, based on the last generation’s available data. Default: False.

requires_calibration() → bool[source]: Whether the class requires an initial calibration, based on samples from the prior. Default: False.

update(t: int, get_sample: Callable[[], Sample], total_sims: int) → bool[source]

Update for the upcoming generation t.

Similar as initialize, however called for every subsequent iteration.

Parameters:

t – Time point for which to update the distance.
get_sample – Returns on demand the last generation’s complete sample.
total_sims – The total number of simulations so far.

Returns:

is_updated – Whether something has changed compared to beforehand. Depending on the result, the population needs to be updated before preparing the next generation. Defaults to False.

Return type:

bool