Quickstart

Model selection

Here is a small example on how to do Bayesian model selection. There are more examples in the examples section of the documentation, such as a parameter inference example with a single model only.

The notebook can be downloaded here: Quickstart.

The following classes from the pyABC package are used for this example:

Step by step explanation

Defining a model

To do model selection, we need first some models. A model, in the simplest case, is just a callable which takes a single dict as input and returns a single dict as output. The keys of the input dictionary are the parameters of the model, the output keys denote the summary statistics. Here, the dict is passed as parameters and has the parameter x which denotes the mean of a Gaussian. It returns the observed summary statistics y, which is just the sampled value.

In [1]:
%matplotlib inline
import os
import tempfile

import scipy.stats as st

from pyabc import (ABCSMC, RV, Distribution,
                   PercentileDistanceFunction)

# Define a gaussian model
sigma = .5


def model(parameters):
    # sample from a gaussian
    y = st.norm(parameters.x, sigma).rvs()
    # return the sample as dictionary
    return {"y": y}

For model selection we usually have more than one model. These are assembled in a list. We require a Bayesian prior over the models. The default is to have a uniform prior over the model classes. This concludes the model definition.

In [2]:
# We define two models, but they are identical so far
models = [model, model]


# However, our models' priors are not the same.
# Their mean differs.
mu_x_1, mu_x_2 = 0, 1
parameter_priors = [
    Distribution(x=RV("norm", mu_x_1, sigma)),
    Distribution(x=RV("norm", mu_x_2, sigma))
]

Configuring the ABCSMC run

Having the models defined, we can plug together the ABCSMC class. We need a distance function, to measure the distance of obtained samples.

In [3]:
# We plug all the ABC options together
abc = ABCSMC(
    models, parameter_priors,
    PercentileDistanceFunction(measures_to_use=["y"]))

Setting the observed data

Actually measured data can now be passed to the ABCSMC. This is set via the new method, indicating that we start a new run as opposed to resuming a stored run. Moreover we have to set the output database where the ABC-SMC run is logged.

In [4]:
# y_observed is the important piece here: our actual observation.
y_observed = 1
# and we define where to store the results
db_path = ("sqlite:///" +
           os.path.join(tempfile.gettempdir(), "test.db"))
abc_id = abc.new(db_path, {"y": y_observed})
INFO:Epsilon:initial epsilon is 0.4949724470321585
INFO:History:Start <ABCSMC(id=3, start_time=2018-04-08 22:10:37.202603, end_time=None)>

The new method returns an id, which is the id of the ABC-SMC run in the database. We’re not usint this id for now. But it might be important when you load the stored data or want to continue an ABC-SMC run in the case of having more than one ABC-SMC run stored in a single database.

In [5]:
print("ABC-SMC run ID:", abc_id)
ABC-SMC run ID: 3

Running the ABC

We run the ABCSMC specifying the epsilon value at which to terminate. The default epsilon strategy is the pyabc.epsilon.MedianEpsilon. Whatever is reached first, the epsilon or the maximum number allowed populations, terminates the ABC run. The method returns a History object. which can, for example, be queried for the posterior probabilities.

In [6]:
# We run the ABC until either criterion is met
history = abc.run(minimum_epsilon=0.05, max_nr_populations=2)
INFO:ABC:t:0 eps:0.4949724470321585
INFO:ABC:t:1 eps:0.27733174471374944
INFO:History:Done <ABCSMC(id=3, start_time=2018-04-08 22:10:37.202603, end_time=2018-04-08 22:10:45.112760)>

Note that the history object is also always accessible from the abcsmc object:

In [7]:
history is abc.history
Out[7]:
True

The History object can, for example, be queried for the posterior probabilities in the populations.

In [8]:
# Evaluate the model probabililties
model_probabilities = history.get_model_probabilities()
model_probabilities
Out[8]:
m 0 1
t
0 0.350000 0.650000
1 0.320771 0.679229

And now, let’s visualize the results:

In [9]:
model_probabilities.plot.bar();
../_images/examples_quickstart_19_0.png
So model 1 is the more probable one. Which is expected as it was centered at 1 and the observed data was also 1, whereas model 0 was centered at 0, which is farther away from the observed data.