Datastore¶
Purpose of the datastore¶
The most important class here is the History class. The History class is the interface to the database in which pyABC stores and logs information during the ABCSMC run, but also the interface which allows you to query that information later on.
Initializing the databse interface from a file¶
For querying, you initialize a History object with a valid SQLAlchmey database identifier. For example, if you ABCSMC data is stored in a file “data.db”, you initialize the History with:
history = History("sqlite:///data.db")
Don’t mind the three slashes. This is SQLAlchemy syntax.
If more than one ABCSMC run is stored in your database file, these runs will have IDs. The first run has ID=1, the second run ID=2 and so on. Per default, the first run found in the database is automatically selected. To select a specific run by manually, do
history.id = n
if n is the run number, e.g. n=3.
Querying the database¶
The History class has a number of methods which are relevant for querying the stored data. The most important ones are:
History.get_distribution
to retrieve information on the parameter posteriors,History.get_model_probabilities
to retrieve information on the model probabilities in case you’re doing model selection,History.get_all_populations
, to retrieve information on the evolution of the acceptance threshold and the number of sample attempts per population,History.get_nr_particles_per_population
, to retrieve the number of particles per population (this number os not necessariliy constant),History.get_weighted_distances
, to retrieve the distances the parameter samples achieved,History.n_populations
to get the total number of populations, andHistory.total_nr_simulations
to get the total number of simulations, i.e. sample attempts.
Use get_distribution
to retrieve your posterior particle population. For
example,
df, w = history.get_distribution(m)
will return a DataFrame df of parameters and an array w of weights of the particles of model m in the last available population. If you’re interested in intermediate populations, add the optional t parameter, which indicates the population number (the first population is t=0)
df, w = history.get_distribution(m, t)
What can be stored as summary statistics¶
Currently, integers, floats, strings and in general everything that can be converted to a numpy array can be stored. In addition it is also possible to store pandas DataFrames.
Warning
Storage of pandas DataFrames is considered experimental at this point.

class
pyabc.storage.
History
(db: str)¶ Bases:
object
History for ABCSMC.
This class records the evolution of the populations and stores the ABCSMC results.
Parameters: db (str) – SQLAlchemy database identifier. 
alive_models
(t) → List¶ Get the models which are still alive at time t.
Parameters: t (int) – Population nr Returns: alive – A list which contains the indices of those models which are still alive Return type: List

all_runs
()¶ Get all ABCSMC runs which are stored in the database.

append_population
(t: int, current_epsilon: float, population: pyabc.storage.db_model.Population, nr_simulations: int, model_names)¶ Append population to database.
Parameters:  t (int) – Population number.
 current_epsilon (float) – Current epsilon value.
 population (Population) – List of sampled particles.
 nr_simulations (int) – The number of model evaluations for this population.
 model_names (list) – The model names.
Note. This function is called by the
pyabc.ABCSMC
class internally. You should most likely not find it necessary to call this method under normal circumstances.

db_size
¶ Size of the database.
Returns: db_size – Size of the SQLite database in MB. Currently this only works for SQLite databases. Returns an error string if the DB size cannot be calculated.
Return type: int, str

done
()¶ Close database sessions and store end time of population.
Note. This function is called by the
pyabc.ABCSMC
class internally. You should most likely not find it necessary to call this method under normal circumstances.

get_all_populations
()¶ Returns a pandas Dataframe with columns
 t: Popultion number
 population_end_time: The end time of the population
 samples: The number of sample attempts performed
 for a population
 epsilon: The acceptence threshold for the population.
Returns: all_populations – DataFrame with population info Return type: pd.DataFrame

get_distribution
(m: int, t: int = None) > (<class 'pandas.core.frame.DataFrame'>, <class 'numpy.ndarray'>)¶ Returns the weighted population sample as pandas DataFrame.
Parameters:  m (int) – model index
 t (int, optional) – Population number. If t is not specified, then the last population is returned.
Returns:  df, w (pandas.DataFrame, np.ndarray)
 df – is a DataFrame of parameters
 w – are the weights associated with each parameter

get_model_probabilities
(t=None) → pandas.core.frame.DataFrame¶ Model probabilities.
Parameters: t (int or None) – Population. Defaults to None, i.e. the last population. Returns: probabilities – Model probabilities Return type: np.ndarray

get_nr_particles_per_population
() → pandas.core.series.Series¶ Returns: nr_particles_per_population – A pandas DataFrame containing the number of particles for each population Return type: pd.DataFrame

get_population_extended
(*, m=None, t='last', tidy=True) → pandas.core.frame.DataFrame¶ Get extended population information, including parameters, distances, summary statistics, weights and more.
Parameters:  m (int, optional) – The model to query. If omitted, all models are returned
 t (str, optional) – Can be “last” or “all” In case of “all”, all populations are returned. If “last”, only the last population is returned.
 tidy (bool, optional) – If True, try to return a tidy DataFrame, where the individual parameters and summary statistics are pivoted. Setting tidy to true will only work for a single model and a single population.
Returns: full_population
Return type: DataFrame

get_population_strategy
()¶ Returns: The population strategy. Return type: population_strategy

get_sum_stats
(t: int, m: int) > (<class 'numpy.ndarray'>, typing.List)¶ Summary statistics
Parameters:  t (int) – Population number
 m (int) – Model index
Returns: w, sum_stats –
 w: the weights associated with the summary statistics
 sum_stats: list of summary statistics
Return type: np.ndarray, list

get_weighted_distances
(t: Union[int, NoneType]) → pandas.core.frame.DataFrame¶ Population’s weighted distances to the measured sample. These weights do not necessarily sum up to 1. In case more than one simulation per parameter is performed and accepted the sum might be larger.
Parameters: t (int, None) – Population number. If t is None the last population is selected. Returns: median – Weighted distances. The dataframe has column “w” for the weights and column distance for the distances Return type: DataFrame

max_t
¶ The population number of the last populations. This is equivalent to
n_populations  1
.

n_populations
¶ Number of populations stored in the database. This is equivalent to
max_t + 1
.

nr_of_models_alive
(t=None) → int¶ Number of models still alive.
Parameters: t (int) – Population number Returns: nr_alive – Number of models still alive. None is for the last population Return type: int >= 0 or None

store_initial_data
(ground_truth_model: int, options: dict, observed_summary_statistics: dict, ground_truth_parameter: dict, model_names: List[str], distance_function_json_str: str, eps_function_json_str: str, population_strategy_json_str: str)¶ Store the initial configuration data.
Parameters:  ground_truth_model (int) – Nr of the ground truth model.
 options (dict) – Of ABC metadata
 observed_summary_statistics (dict) – the measured summary statistics
 ground_truth_parameter (dict) – the ground truth parameters
 model_names (List) – A list of model names
 distance_function_json_str (str) – The distance function represented as json string
 eps_function_json_str (str) – The epsilon represented as json string
 population_strategy_json_str (str) – The population strategy represented as json string
Note. This function is called by the
pyabc.ABCSMC
class internally. You should most likely not find it necessary to call this method under normal circumstances.

total_nr_simulations
¶ Number of sample attempts for the ABC run.
Returns: Total nr of sample attempts for the ABC run. Return type: int
