Parallel Approximate Bayesian Computation - Sequential Monte Carlo

The ABCSMC class is the most central class of the pyABC package. Most of the other classes serve to configure it. (I.e. the other classes implement a Strategy pattern.)

class pyabc.smc.ABCSMC(models: Union[typing.List[pyabc.model.Model], pyabc.model.Model], parameter_priors: Union[typing.List[pyabc.random_variables.Distribution], pyabc.random_variables.Distribution, typing.Callable], distance_function, population_size: Union[pyabc.populationstrategy.PopulationStrategy, int] = 100, summary_statistics: Callable[model_output, dict] = <function identity>, model_prior: pyabc.random_variables.RV = None, model_perturbation_kernel: pyabc.random_variables.ModelPerturbationKernel = None, transitions: List[pyabc.transition.base.Transition] = None, eps: pyabc.epsilon.Epsilon = None, sampler=None)

Bases: object

Approximate Bayesian Computation - Sequential Monte Carlo (ABCSMC).

This is an implementation of an ABCSMC algorithm similar to [1]

Parameters:
  • models (list of models, single model, single function or list of functions) –
    • If models is a function, then the function should have a single parameter, which is of dictionary type, and should return a single dictionary, which contains the simulated data.
    • If models is a list of functions, then the first point applies to each function.
    • Models can also be a list of Model instances or a single Model instance.

    This model’s output is passed to the summary statistics calculation. Per default, the model is assumed to already return the calculated summary statistics. Accordingly, the default summary_statistics function is just the identity.

  • parameter_priors (List[Distribution]) – A list of prior distributions for the models’ parameters. Each list entry is the prior distribution for the corresponding model.
  • distance_function (DistanceFunction, optional) – Measures the distance of the tentatively sampled particle to the measured data.
  • population_size (int, PopulationStrategy, optional) – Specify the size of the population. If population_specification is an int, then the size is constant. Adaptive population sizes are also possible by passing a pyabc.populationstrategy.PopulationStrategy object. The default is 100 particles per population.
  • summary_statistics (Callable[[model_output], dict]) – A function which takes the raw model output as returned by any ot the models and calculates the corresponding summary statistics. Note that the default value is just the identity function. I.e. the model is assumed to already calculate the summary statistics. However, in a model selection setting it can make sense to have the model produce some kind or raw output and then take the same summary statistics function for all the models.
  • model_prior (RV, optional) – A random variable giving the prior weights of the model classes. The default is a uniform prior over the model classes, RV("randint", 0, len(models)).
  • model_perturbation_kernel (ModelPerturbationKernel) – Kernel which governs with which probability to switch from one model to another model for a given sample while generating proposals for the subsequent population from the current population.
  • transitions (List[Transition], Transition, optional) – A list of pyabc.transition.Transition objects or a single pyabc.transition.Transition in case of a single model. Defaults to multivariate normal transitions for every model.
  • eps (Epsilon, optional) – Accepts any pyabc.epsilon.Epsilon subclass. The default is the pyabc.epsilon.MedianEpsilon which adapts automatically. The object passed here determines how the acceptance threshold scheduling is performed.
  • sampler (Sampler, optional) – In some cases, a mapper implementation will require initialization to run properly, e.g. database connection, grid setup, etc.. The sampler is an object that encapsulates this information. The default sampler pyabc.sampler.MulticoreEvalParallelSampler will parallelize across the cores of a single machine only.
stop_if_only_single_model_alive

bool – Defaults to False. Set this to true if you want to stop ABCSMC automatically as soon as only a single model has survived.

[1]Toni, Tina, and Michael P. H. Stumpf. “Simulation-Based Model Selection for Dynamical Systems in Systems and Population Biology.” Bioinformatics 26, no. 1 (2010): 104–10. doi:10.1093/bioinformatics/btp619.
load(db: str, abc_id: int = 1) → int

Load an ABC-SMC run for continuation.

Parameters:
  • db (str) – A SQLAlchemy database identifier pointing to the database from which to continue a run.
  • abc_id (int, optional) – The id of the ABC-SMC run in the database which is to be continued. The default is 1. If more than one ABC-SMC run is stored, use the abc_id parameter to indicate which one to continue.

Note

The Epsilon’s and distance function’s initialize methods are not called when an ABCSMC run is loaded.

new(db: str, observed_sum_stat: dict = None, *, gt_model: int = None, gt_par: dict = None, meta_info=None) → int

Make a new ABCSMC run.

Parameters:
  • db (str) –

    Has to be a valid SQLAlchemy database identifier. This indicates the database to be used (and created if necessary and possible) for the ABC-SMC run.

    To use an in-memory database pass “sqlite://”. Note that in-memory databases are only available on the master mode. If workers are started on different nodes they won’t be able to access the database. This should not be a problem in most scenarios. The in-memory option is mainly useful for benchmarking (and maybe) for testing.

  • observed_sum_stat (dict, optional) –

    This is the really important parameter here. It is of the form {'statistic_1': val_1, 'statistic_2': val_2, ... }.

    The dictionary provided here represents the measured data. Particle during ABCSMC sampling are compared against the summary statistics provided here.

    The summary statistics’ values can be integers, floats, strings and everything which is a numpy array or can be converted to one (e.g. lists). In addition, pandas.DataFrames can also be used as summary statistics. Note that storage of pandas DataFrames in pyABC’s database is still considered experimental.

    This parameter is optional, as the distance function might implement comparison to the observed data on its own. Not giving this parameter is equivalent to passing an empty dictionary {}.

  • gt_model (int, optional) – This is only meta data stored to the database, but not actually used for the ABCSMC algorithm If you want to predict your ABCSMC procedure against synthetic samples, you can use this parameter to indicate the ground truth model number. This helps with further analysis. If you use actually measured data (and don’t know the ground truth) you don’t have to set this.
  • gt_par (dict, optional) – Similar to ground_truth_model, this is only for recording purposes in the database, but not used in the ABCSMC algorithm. This stores the parameters of the ground truth model if it was synthetically obtained. Don’t give this parameter if you don’t know the ground truth.
  • meta_info (dict, optional) – Can contain an arbitrary number of keys, only for recording purposes. Store arbitrary meta information in this dictionary. Can be used for really anything. This dictionary is stored in the database.
run(minimum_epsilon: float, max_nr_populations: int, min_acceptance_rate: float = 0.0, **kwargs) → pyabc.storage.history.History

Run the ABCSMC model selection until either of the stopping criteria is met.

Parameters:
  • minimum_epsilon (float) – Stop if epsilon is smaller than minimum epsilon specified here.
  • max_nr_populations (int) – The maximum number of populations. Stop if this number is reached.
  • min_acceptance_rate (float, optional) – Minimal allowed acceptance rate. Sampling stops if a population has a lower rate.

Population after population is sampled and particles which are close enough to the observed data are accepted and added to the next population. If an adaptive Epsilon is specified (this is the default), then the acceptance threshold decreases from population to population automatically in a data dependent way.

Sampling of further populations is stopped, when either of the three stopping criteria is met:

  • the maximum number of populations max_nr_populations is reached,
  • the acceptance threshold for the last sampled population was smaller than minimum_epsilon,
  • or the acceptance rate dropped below acceptance_rate.

The value of minimum_epsilon determines the quality of the ABCSMC approximation. The smaller the better. But sampling time also increases with decreasing minimum_epsilon.

This method can be called repeatedly to sample further populations after sampling was stopped once.

set_data(observed_summary_statistics: dict, abc_options: Union[dict, str], ground_truth_model: int = -1, ground_truth_parameter: dict = None)

This method is an alias for the new method. This method is deprecated and to be removed in future releases. Use the new method instead. Note that the argument order has changed!

Warning

The method “set_data” is deprecated.