Aggregating and weighting diverse data

In this notebook, we illustrate the aggregation of various data, and how to combine that with an adaptive scheme of computing weights.

Aggregating diverse distance functions

We want to combine different distance metrics operating on subsets of the data to one distance value. As a toy model, assume we want to combine a Laplace and a Normal distance.

[ ]:
# install if not done yet
!pip install pyabc --quiet
[1]:
import os
import tempfile

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

import pyabc

p_true = {'p0': 0, 'p1': 0}


def model(p):
    return {
        's0': p['p0'] + 0.1 * np.random.normal(),
        's1': p['p1'] + 0.1 * np.random.normal(),
    }


observation = {'s0': 0, 's1': 0}


def distance0(x, x_0):
    return abs(x['s0'] - x_0['s0'])


def distance1(x, x_0):
    return (x['s1'] - x_0['s1']) ** 2


# prior
prior = pyabc.Distribution(
    p0=pyabc.RV("uniform", -1, 2), p1=pyabc.RV("uniform", -1, 2)
)

The key is now to use pyabc.distance.AggregatedDistance to combine both.

[2]:
distance = pyabc.AggregatedDistance([distance0, distance1])

abc = pyabc.ABCSMC(model, prior, distance)
db_path = "sqlite:///" + os.path.join(tempfile.gettempdir(), "tmp.db")
abc.new(db_path, observation)
history1 = abc.run(max_nr_populations=6)

# plotting


def plot_history(history):
    fig, ax = plt.subplots()
    for t in range(history.max_t + 1):
        df, w = history.get_distribution(m=0, t=t)
        pyabc.visualization.plot_kde_1d(
            df,
            w,
            xmin=-1,
            xmax=1,
            x='p0',
            ax=ax,
            label=f"PDF t={t}",
            refval=p_true,
        )
    ax.legend()

    fig, ax = plt.subplots()
    for t in range(history.max_t + 1):
        df, w = history.get_distribution(m=0, t=t)
        pyabc.visualization.plot_kde_1d(
            df,
            w,
            xmin=-1,
            xmax=1,
            x='p1',
            ax=ax,
            label=f"PDF t={t}",
            refval=p_true,
        )
    ax.legend()


plot_history(history1)
ABC.Sampler INFO: Parallelize sampling on 4 processes.
ABC.History INFO: Start <ABCSMC id=1, start_time=2021-05-19 15:19:12>
ABC INFO: Calibration sample t = -1.
ABC INFO: t: 0, eps: 7.93828347e-01.
ABC INFO: Accepted: 100 / 227 = 4.4053e-01, ESS: 1.0000e+02.
ABC INFO: t: 1, eps: 5.47513600e-01.
ABC INFO: Accepted: 100 / 227 = 4.4053e-01, ESS: 9.3165e+01.
ABC INFO: t: 2, eps: 3.48865157e-01.
ABC INFO: Accepted: 100 / 192 = 5.2083e-01, ESS: 8.7158e+01.
ABC INFO: t: 3, eps: 2.17783825e-01.
ABC INFO: Accepted: 100 / 249 = 4.0161e-01, ESS: 8.8381e+01.
ABC INFO: t: 4, eps: 1.47444558e-01.
ABC INFO: Accepted: 100 / 340 = 2.9412e-01, ESS: 9.2480e+01.
ABC INFO: t: 5, eps: 8.91141994e-02.
ABC INFO: Accepted: 100 / 394 = 2.5381e-01, ESS: 7.4999e+01.
ABC INFO: Stop: Maximum number of generations.
ABC.History INFO: Done <ABCSMC id=1, duration=0:00:03.119590, end_time=2021-05-19 15:19:15>
../_images/examples_aggregated_distances_7_1.png
../_images/examples_aggregated_distances_7_2.png

Weighted aggregation

A problem with the previous aggregation of distance function is that usually they vary on different scales. In order to account for all in a similar manner, one thing one can do is to weight them.

Let us look at a simple example of two summary statistics which vary on very different scales:

[3]:
import os
import tempfile

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

import pyabc

p_true = {'p0': 0, 'p1': 0}


def model(p):
    return {
        's0': p['p0'] + 0.1 * np.random.normal(),
        's1': p['p1'] + 100 * np.random.normal(),
    }


observation = {'s0': 0, 's1': 0}


def distance0(x, x_0):
    return abs(x['s0'] - x_0['s0'])


def distance1(x, x_0):
    return (x['s1'] - x_0['s1']) ** 2


# prior
prior = pyabc.Distribution(
    p0=pyabc.RV("uniform", -1, 2), p1=pyabc.RV("uniform", -1, 2)
)

distance = pyabc.AggregatedDistance([distance0, distance1])

abc = pyabc.ABCSMC(model, prior, distance)
db_path = "sqlite:///" + os.path.join(tempfile.gettempdir(), "tmp.db")
abc.new(db_path, observation)
history1 = abc.run(max_nr_populations=6)

# plotting


def plot_history(history):
    fig, ax = plt.subplots()
    for t in range(history.max_t + 1):
        df, w = history.get_distribution(m=0, t=t)
        pyabc.visualization.plot_kde_1d(
            df,
            w,
            xmin=-1,
            xmax=1,
            x='p0',
            ax=ax,
            label=f"PDF t={t}",
            refval=p_true,
        )
    ax.legend()

    fig, ax = plt.subplots()
    for t in range(history.max_t + 1):
        df, w = history.get_distribution(m=0, t=t)
        pyabc.visualization.plot_kde_1d(
            df,
            w,
            xmin=-1,
            xmax=1,
            x='p1',
            ax=ax,
            label=f"PDF t={t}",
            refval=p_true,
        )
    ax.legend()


plot_history(history1)
ABC.Sampler INFO: Parallelize sampling on 4 processes.
ABC.History INFO: Start <ABCSMC id=2, start_time=2021-05-19 15:19:32>
ABC INFO: Calibration sample t = -1.
ABC INFO: t: 0, eps: 4.59935137e+03.
ABC INFO: Accepted: 100 / 215 = 4.6512e-01, ESS: 1.0000e+02.
ABC INFO: t: 1, eps: 1.09029894e+03.
ABC INFO: Accepted: 100 / 421 = 2.3753e-01, ESS: 9.3548e+01.
ABC INFO: t: 2, eps: 2.14142711e+02.
ABC INFO: Accepted: 100 / 784 = 1.2755e-01, ESS: 9.1347e+01.
ABC INFO: t: 3, eps: 4.48406680e+01.
ABC INFO: Accepted: 100 / 1955 = 5.1151e-02, ESS: 7.7921e+01.
ABC INFO: t: 4, eps: 1.34184695e+01.
ABC INFO: Accepted: 100 / 3585 = 2.7894e-02, ESS: 7.8617e+01.
ABC INFO: t: 5, eps: 4.69471462e+00.
ABC INFO: Accepted: 100 / 6668 = 1.4997e-02, ESS: 9.5092e+01.
ABC INFO: Stop: Maximum number of generations.
ABC.History INFO: Done <ABCSMC id=2, duration=0:00:10.112188, end_time=2021-05-19 15:19:42>
../_images/examples_aggregated_distances_10_1.png
../_images/examples_aggregated_distances_10_2.png

The algorithm has problems extracting information from the first summary statistic on the first parameter, because the second summary statistic is on a much larger scale. Let us use the pyabc.distance.AdaptiveAggregatedDistance instead, which tries to find good weights itself (and even adapts these weights over time):

[4]:
# prior
prior = pyabc.Distribution(
    p0=pyabc.RV("uniform", -1, 2), p1=pyabc.RV("uniform", -1, 2)
)

distance = pyabc.AdaptiveAggregatedDistance([distance0, distance1])

abc = pyabc.ABCSMC(model, prior, distance)
db_path = "sqlite:///" + os.path.join(tempfile.gettempdir(), "tmp.db")
abc.new(db_path, observation)
history2 = abc.run(max_nr_populations=6)

plot_history(history2)
ABC.Sampler INFO: Parallelize sampling on 4 processes.
ABC.History INFO: Start <ABCSMC id=3, start_time=2021-05-19 15:19:54>
ABC INFO: Calibration sample t = -1.
ABC.Population INFO: Recording also rejected particles: True
ABC INFO: t: 0, eps: 5.56804028e-01.
ABC INFO: Accepted: 100 / 215 = 4.6512e-01, ESS: 1.0000e+02.
ABC INFO: t: 1, eps: 3.16722119e-01.
ABC INFO: Accepted: 100 / 190 = 5.2632e-01, ESS: 9.2354e+01.
ABC INFO: t: 2, eps: 2.20958442e-01.
ABC INFO: Accepted: 100 / 346 = 2.8902e-01, ESS: 9.6911e+01.
ABC INFO: t: 3, eps: 1.61422674e-01.
ABC INFO: Accepted: 100 / 408 = 2.4510e-01, ESS: 8.1748e+01.
ABC INFO: t: 4, eps: 1.16683118e-01.
ABC INFO: Accepted: 100 / 587 = 1.7036e-01, ESS: 7.4133e+01.
ABC INFO: t: 5, eps: 6.81859941e-02.
ABC INFO: Accepted: 100 / 1256 = 7.9618e-02, ESS: 8.0627e+01.
ABC INFO: Stop: Maximum number of generations.
ABC.History INFO: Done <ABCSMC id=3, duration=0:00:04.460621, end_time=2021-05-19 15:19:59>
../_images/examples_aggregated_distances_12_1.png
../_images/examples_aggregated_distances_12_2.png

The result is much better. We can also only initially calculate weights by setting adaptive=False:

[5]:
# prior
prior = pyabc.Distribution(
    p0=pyabc.RV("uniform", -1, 2), p1=pyabc.RV("uniform", -1, 2)
)

distance = pyabc.AdaptiveAggregatedDistance(
    [distance0, distance1], adaptive=False
)

abc = pyabc.ABCSMC(model, prior, distance)
db_path = "sqlite:///" + os.path.join(tempfile.gettempdir(), "tmp.db")
abc.new(db_path, observation)
history3 = abc.run(max_nr_populations=6)

plot_history(history3)
ABC.Sampler INFO: Parallelize sampling on 4 processes.
ABC.History INFO: Start <ABCSMC id=4, start_time=2021-05-19 15:20:40>
ABC INFO: Calibration sample t = -1.
ABC INFO: t: 0, eps: 5.66625277e-01.
ABC INFO: Accepted: 100 / 211 = 4.7393e-01, ESS: 1.0000e+02.
ABC INFO: t: 1, eps: 3.59666990e-01.
ABC INFO: Accepted: 100 / 219 = 4.5662e-01, ESS: 9.1798e+01.
ABC INFO: t: 2, eps: 2.23159544e-01.
ABC INFO: Accepted: 100 / 300 = 3.3333e-01, ESS: 9.4351e+01.
ABC INFO: t: 3, eps: 1.44789110e-01.
ABC INFO: Accepted: 100 / 336 = 2.9762e-01, ESS: 8.3898e+01.
ABC INFO: t: 4, eps: 9.83630433e-02.
ABC INFO: Accepted: 100 / 597 = 1.6750e-01, ESS: 8.1514e+01.
ABC INFO: t: 5, eps: 5.39954272e-02.
ABC INFO: Accepted: 100 / 1284 = 7.7882e-02, ESS: 6.7990e+01.
ABC INFO: Stop: Maximum number of generations.
ABC.History INFO: Done <ABCSMC id=4, duration=0:00:03.763636, end_time=2021-05-19 15:20:44>
../_images/examples_aggregated_distances_14_1.png
../_images/examples_aggregated_distances_14_2.png

Here, pre-calibration performs comparable to adaptation, because the weights do not change so much over time. Note that one can also specify other scale functions, by passing AdaptiveAggregatedDistance(distances, scale_function), e.g. pyabc.distance.mean[/median/span].

The following plots demonstrate that we not only have a much better posterior approximation after the same number of iterations in the second and third run compared to the first, but we achieve that actually with a much lower number of samples.

[6]:
histories = [history1, history2, history3]
labels = ["Standard", "Adaptive", "Pre-Calibrated"]

pyabc.visualization.plot_sample_numbers(histories, labels, rotation=45)
pyabc.visualization.plot_total_sample_numbers(
    histories, labels, yscale='log10', rotation=45
)
pyabc.visualization.plot_effective_sample_sizes(histories, labels)
[6]:
<AxesSubplot:title={'center':'Effective sample size'}, xlabel='Population index', ylabel='ESS'>
../_images/examples_aggregated_distances_17_1.png
../_images/examples_aggregated_distances_17_2.png
../_images/examples_aggregated_distances_17_3.png