pyabc.storage

Data store

Purpose of the data store

The most important class here is the History class. The History class is the interface to the database in which pyABC stores and logs information during the ABC-SMC run, but also the interface which allows you to query that information later on.

Initializing the database interface from a file

For querying, you initialize a History object with a valid SQLAlchemy database identifier. For example, if your ABC-SMC data is stored in a file “data.db”, you initialize the History with:

history = History("sqlite:///data.db")

Don’t mind the three slashes. This is SQLAlchemy syntax.

If more than one ABC-SMC run is stored in your database file, these runs will have ids. The first run has id=1, the second run id=2, and so on. Per default, the first run found in the database is automatically selected. To select a specific run n (e.g. n=3), do

history.id = n

Querying the database

The History class has a number of methods which are relevant for querying the stored data. The most important ones are:

  • History.get_distribution to retrieve information on the parameter posteriors,

  • History.get_model_probabilities to retrieve information on the model probabilities in case you’re doing model selection,

  • History.get_all_populations, to retrieve information on the evolution of the acceptance threshold and the number of sample attempts per population,

  • History.get_nr_particles_per_population, to retrieve the number of particles per population (this number os not necessariliy constant),

  • History.get_weighted_distances, to retrieve the distances the parameter samples achieved,

  • History.n_populations to get the total number of populations, and

  • History.total_nr_simulations to get the total number of simulations, i.e. sample attempts.

Use get_distribution to retrieve your posterior particle population. For example,

df, w = history.get_distribution(m)

will return a DataFrame df of parameters and an array w of weights of the particles of model m in the last available population. If you’re interested in intermediate populations, add the optional t parameter, which indicates the population number (the first population is t=0)

df, w = history.get_distribution(m, t)

What can be stored as summary statistics

Currently, integers, floats, strings, and in general everything that can be converted to a numpy array, can be stored. In addition, it is also possible to store pandas DataFrames.

Warning

Storage of pandas DataFrames is considered experimental at this point.

class pyabc.storage.History(db: str, stores_sum_stats: bool = True, _id: int = None, create: bool = True)[source]

Bases: object

History for ABCSMC.

This class records the evolution of the populations and stores the ABCSMC results.

db

SQLalchemy database identifier. For a relative path use the template “sqlite:///file.db”, for an absolute path “sqlite:////path/to/file.db”, and for an in-memory database “sqlite://”.

Type:

str

stores_sum_stats

Whether to store summary statistics to the database. Note: this is True by default, and should be set to False only for testing purposes (i.e. to speed up the writing to the file system), as it can not be guaranteed that all methods of pyabc work correctly if the summary statistics are not stored.

Type:

bool, optional (default = True)

id

The id of the ABCSMC analysis that is currently in use. If there are analyses in the database already, this defaults to the latest id. Manually set if another run is wanted.

Type:

int

__init__(db: str, stores_sum_stats: bool = True, _id: int = None, create: bool = True)[source]

Initialize history object.

Parameters:

create – If False, an error is thrown if the database does not exist.

alive_models(t: int = None) List[source]

Get the models which are still alive at time t.

Parameters:

t (int, optional (default = self.max_t)) – Population index.

Returns:

alive – A list which contains the indices of those models which are still alive.

Return type:

List

all_runs()[source]

Get all ABCSMC runs which are stored in the database.

append_population(t: int, current_epsilon: float, population: Population, nr_simulations: int, model_names)[source]

Append population to database.

Parameters:
  • t (int) – Population number.

  • current_epsilon (float) – Current epsilon value.

  • population (Population) – List of sampled particles.

  • nr_simulations (int) – The number of model evaluations for this population.

  • model_names (list) – The model names.

Note. This function is called by the pyabc.ABCSMC class internally. You should most likely not find it necessary to call this method under normal circumstances.

property db_size: int | str

Size of the database.

Returns:

db_size – Size of the SQLite database in MB. Currently this only works for SQLite databases.

Returns an error string if the DB size cannot be calculated.

Return type:

int, str

done(end_time: datetime = None)[source]

Close database sessions and store end time of the analysis.

Parameters:

end_time – End time of the analysis.

Note. This function is called by the pyabc.ABCSMC class internally. You should most likely not find it necessary to call this method under normal circumstances.

get_all_populations()[source]

Returns a pandas DataFrame with columns

  • t: Population number

  • population_end_time: The end time of the population

  • samples: The number of sample attempts performed

    for a population

  • epsilon: The acceptance threshold for the population.

Returns:

all_populations – DataFrame with population info

Return type:

pd.DataFrame

get_distribution(m: int = 0, t: int = None) Tuple[DataFrame, ndarray][source]

Returns the weighted population sample for model m and timepoint t as a tuple.

Parameters:
  • m (int, optional (default = 0)) – Model index.

  • t (int, optional (default = self.max_t)) – Population index. If t is not specified, then the last population is returned.

Returns:

df, w

  • df: a DataFrame of parameters

  • w: are the weights associated with each parameter

Return type:

pandas.DataFrame, np.ndarray

get_ground_truth_parameter() Parameter[source]

Create a pyabc.Parameter object from the ground truth parameters saved in the database, if existent.

Return type:

A PyParameter dictionary.

get_model_probabilities(t: int | None = None) DataFrame[source]

Model probabilities.

Parameters:

t (int or None (default = None)) – Population index. If None, all populations of indices >= 0 are considered.

Returns:

probabilities – Model probabilities.

Return type:

np.ndarray

get_nr_particles_per_population() Series[source]

Get the number of particles per population.

Returns:

nr_particles_per_population – A pandas DataFrame containing the number of particles for each population.

Return type:

pd.DataFrame

get_population(t: int = None)[source]

Create a pyabc.Population object containing all particles, as far as those can be recreated from the database. In particular, rejected particles are currently not stored.

Parameters:

t (int, optional (default = self.max_t)) – The population index.

get_population_extended(*, m: int | None = None, t: int | str = 'last', tidy: bool = True) DataFrame[source]

Get extended population information, including parameters, distances, summary statistics, weights and more.

Parameters:
  • m (int or None, optional (default = None)) – The model to query. If omitted, all models are returned.

  • t (int or str, optional (default = "last")) – Can be “last” or “all”, or a population index (i.e. an int). In case of “all”, all populations are returned. If “last”, only the last population is returned, for an int value only the corresponding population at that time index.

  • tidy (bool, optional) – If True, try to return a tidy DataFrame, where the individual parameters and summary statistics are pivoted. Setting tidy to true will only work for a single model and a single population.

Returns:

full_population

Return type:

DataFrame

get_population_strategy()[source]

Get information on the population size strategy.

Returns:

The population strategy.

Return type:

population_strategy

get_weighted_distances(t: int = None) DataFrame[source]

Population’s weighted distances to the measured sample. These weights do not necessarily sum up to 1.

Parameters:

t (int, optional (default = self.max_t)) – Population index. If t is None, the last population is selected.

Returns:

df_weighted – Weighted distances. The dataframe has column “w” for the weights and column “distance” for the distances.

Return type:

pd.DataFrame

get_weighted_sum_stats(t: int = None) Tuple[List[float], List[dict]][source]

Population’s weighted summary statistics. These weights do not necessarily sum up to 1.

Parameters:

t (int, optional (default = self.max_t)) – Population index. If t is None, the latest population is selected.

Returns:

In the same order in the first array the weights (multiplied by the model probabilities), and tin the second array the summary statistics.

Return type:

weights, sum_stats

get_weighted_sum_stats_for_model(m: int = 0, t: int = None) Tuple[ndarray, List][source]

Summary statistics for model m. The weights sum to 1, unless there were multiple acceptances per particle.

Parameters:
  • m (int, optional (default = 0)) – Model index.

  • t (int, optional (default = self.max_t)) – Population index.

Returns:

w, sum_stats

  • w: the weights associated with the summary statistics

  • sum_stats: list of summary statistics

Return type:

np.ndarray, list

property max_t

The population number of the last populations. This is equivalent to n_populations - 1.

model_names(t: int = -1)[source]

Get the names of alive models for population t.

Parameters:

t (int, optional (default = -1)) – Population index.

Returns:

model_names – List of model names.

Return type:

List[str]

property n_populations

Number of populations stored in the database. This is equivalent to max_t + 1.

nr_of_models_alive(t: int = None) int[source]

Number of models still alive.

Parameters:

t (int, optional (default = self.max_t)) – Population index.

Returns:

nr_alive – Number of models still alive. None is for the last population

Return type:

int >= 0 or None

observed_sum_stat()[source]

Get the observed summary statistics.

Returns:

sum_stats_dct – The observed summary statistics.

Return type:

dict

store_initial_data(ground_truth_model: int, options: dict, observed_summary_statistics: dict, ground_truth_parameter: dict, model_names: List[str], distance_function_json_str: str, eps_function_json_str: str, population_strategy_json_str: str, start_time: datetime = None) None[source]

Store the initial configuration data.

Parameters:
  • ground_truth_model – Index of the ground truth model.

  • options – Of ABC metadata.

  • observed_summary_statistics – The measured summary statistics.

  • ground_truth_parameter – The ground truth parameters.

  • model_names – A list of model names.

  • distance_function_json_str – The distance function represented as json string.

  • eps_function_json_str – The epsilon represented as json string.

  • population_strategy_json_str – The population strategy represented as json string.

  • start_time – Start time of the analysis.

Note. This function is called by the pyabc.ABCSMC class internally. You should most likely not find it necessary to call this method under normal circumstances.

store_pre_population(ground_truth_model: int, observed_summary_statistics: dict, ground_truth_parameter: dict, model_names: List[str])[source]

Store a dummy pre-population containing some configuration data and in particular some ground truth values.

For the parameters, see store_initial_data.

Note. This function is called by the pyabc.ABCSMC class internally. You should most likely not find it necessary to call this method under normal circumstances.

property total_nr_simulations: int

Number of sample attempts for the ABC run.

Returns:

nr_sim – Total nr of sample attempts for the ABC run.

Return type:

int

update_after_calibration(nr_samples: int, end_time: datetime)[source]

Update after the calibration iteration. In particular set time and number of samples. Update the number of samples used in iteration t.

Parameters:
  • nr_samples – Number of samples reported.

  • end_time – End time of the calibration iteration.

Note. This function is called by the pyabc.ABCSMC class internally. You should most likely not find it necessary to call this method under normal circumstances.

pyabc.storage.create_sqlite_db_id(dir_: str = None, file_: str = 'pyabc_test.db')[source]

Convenience function to create an sqlite database identifier which can be understood by sqlalchemy.

Parameters:
  • dir – The base folder name. Optional, defaults to the system’s temporary directory, i.e. “/tmp/” on Linux. While this makes sense for testing purposes, for productive use a non-temporary location should be used.

  • file – The database file name. Optional, defaults to “pyabc_test.db”.

pyabc.storage.load_dict_from_json(file_: str, key_type: type = <class 'int'>)[source]

Read in json file. Convert keys to key_type’. Inverse to `save_dict_to_json.

Parameters:
  • file – Name of the file to read in.

  • key_type – Type to convert the keys into.

Returns:

dct

Return type:

The json file contents.

pyabc.storage.save_dict_to_json(dct: dict, file_: str)[source]

Save dict to file. Inverse to load_dict_from_json.

Parameters:
  • dct – The dictionary to write to file.

  • file – Name of the file to write to.