Using R with pyABC

This example illustrates how to use models, summary statistics and distance functions defined in R. We’re assuming you’re already familiar with the basic workings of pyABC. If not, consult the other tutorial examples.

Download this notebook Using R with pyABC.

In this example, we’re introducing the new class R which is our interface with R. We use this class to to load an external R script.

In [1]:
%matplotlib inline
from pyabc.external import R

r = R("myRModel.R")
import
/home/emmanuel/CodeProjects/abc/pyabc/external.py:50: UserWarning: The support of the R language for ABC-SMC is considered experimental. The API might change in the future.
  warnings.warn("The support of the R language for ABC-SMC is "

Note

R("myRModel.R") does, amongst other things, the equivalent to R’s source("myRModel.R"). That is, the entire script is executed with all the possible side effects this might have.

You can download the file here: myRModel.R. But now, let’s have a look at it.

In [2]:
r.display_source_ipython()
Out[2]:
# Blue print for pyABC parameter inference runs.
# A proposal of how to structure an R file for usage with pyABC


#' The model to be simulated.
#' In this example, it is just a multivariate normal
#' distribution. The number of parameters depends on the
#' model and can be arbitrary. However, the parameters
#' should be real numbers.
#' The return type is arbitrary as well.
myModel <- function(pars){
  x <- rnorm(1) + pars$meanX
  y <- rnorm(1) + pars$meanY
  c(x,y)  # It is not important that it is a vector.
}

#' Calculates summary statistics from whatever the model returns
#' 
#' It is important that the summary statistics have names
#' to store them correctly in pyABC's database.
#' In many cases, the summary statistics function might just
#' pass through the result of the model function if the
#' summary statistics calculation is already
#' done there. Splitting summary statistics and the model
#' makes most sense in a model selection scenario.
#' 
#' @param modelResult The data simulated by the model
#' @return Named list of summary statistics.
mySummaryStatistics <- function(modelResult){
  list(x=modelResult[1],
       y=modelResult[2],
       mtcars=mtcars,  # Can also pass data frames
       iris=iris,
       arbitraryKey="Some random text")
}

#' Calculate distance between summary statistics
#' 
#' @param sumStatSample The summary statistics of the sample
#' proposed py pyABC
#' @param sumStatData The summary statistics of the observed
#'        data for which we want to calculate the posterior.
#' @return A single float
myDistance <- function(sumStatSample, sumStatData){
  sqrt((sumStatSample$x - sumStatData$x)^2
       + abs(sumStatSample$y - sumStatData$y)^2)
}


# We store the observed data as named list
# in a variable.
# This is passed by pyABC as to myDistance
# as the sumStatData argument
mySumStatData <- list(x=4, y=8, mtcars=mtcars, iris=iris)

# The functions for the model, the summary
# statistics and distance
# have to be constructed in
# such a way that the following always makes sense:
#
# myDistance(
#   mySummaryStatistics(myModel(list(meanX=1, meanY=2))),
#   mySummaryStatistics(myModel(list(meanX=2, meanY=2))))

We see that four relevant objects are defined in the file.

  • myModel
  • mySummaryStatistics (optional)
  • myDistance
  • mySumStatData

The names of these do not matter. The mySummaryStatistics is actually optional and can be omitted in case the model calculates the summary statistics directly. We load the defined functions using the r object:

In [3]:
model = r.model("myModel")
distance = r.distance("myDistance")
sum_stat = r.summary_statistics("mySummaryStatistics")

From there on, we can use them (almost) as if they were ordinary Python functions.

In [4]:
from pyabc import Distribution, RV, ABCSMC

prior = Distribution(meanX=RV("uniform", 0, 10),
                     meanY=RV("uniform", 0, 10))
abc = ABCSMC(model, prior, distance,
             summary_statistics=sum_stat)

We also load the observation with r.observation and pass it to a new ABC run.

In [5]:
import os
from tempfile import gettempdir

db = "sqlite:///" + os.path.join(gettempdir(), "test.db")
abc.new(db, r.observation("mySumStatData"))
INFO:Epsilon:initial epsilon is 4.724742302356306
INFO:History:Start <ABCSMC(id=1, start_time=2017-09-04 16:41:49.681454, end_time=None)>
Out[5]:
1

We start a run which terminates as soon as an acceptance threshold of 0.9 or less is reached or the maximum number of 4 populations is sampled.

In [6]:
history = abc.run(minimum_epsilon=0.9, max_nr_populations=4)
INFO:ABC:t:0 eps:4.72474230236
INFO:ABC:t:1 eps:3.0097425354809477
INFO:ABC:t:2 eps:1.8946693507506824
INFO:ABC:t:3 eps:1.3138505196997436
INFO:History:Done <ABCSMC(id=1, start_time=2017-09-04 16:41:49.681454, end_time=2017-09-04 16:41:54.056108)>

Lastly, we plot the results and observe how the generations contract slowly around the oserved value. (Note, that the contraction around the observed value is a particular property of the chosen example and not always the case.)

In [7]:
from pyabc.visualization import plot_kde_2d

for t in range(history.n_populations):
    df, w = abc.history.get_distribution(0, t)
    ax = plot_kde_2d(df, w, "meanX", "meanY",
                      xmin=0, xmax=10,
                      ymin=0, ymax=10,
                      numx=100, numy=100)
    ax.scatter([4], [8],
               edgecolor="black",
               facecolor="white",
               label="Observation");
    ax.legend();
    ax.set_title("PDF t={}".format(t))
../_images/examples_using_R_16_0.png
../_images/examples_using_R_16_1.png
../_images/examples_using_R_16_2.png
../_images/examples_using_R_16_3.png

And we can also retrieve summary statistics such as a stored DataFrame, although the DataFrame was acutally defined in R.

In [8]:
history.get_sum_stats(1, 0)[1][0]["iris"].head()
Out[8]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa

Dumping the results to a file format R can read

Although you could query pyABC’s database directly from R since the database is just a SQL database (e.g. SQLite), pyABC ships with a utility for facilitate export of the database. Use the abc-dump utility provided by pyABC to dump results to file formats such as csv, feather, html, json and others. These can be easiliy read in by R. See Exporting pyABC’s database for how to use this utility.

Assume you dumped to the feather format:

abc-export --db results.db --out exported.feather --format feather

You could read the results in with the following R snippet

install.packages("feather")
install.packages("jsonlite")

library("feather")
library("jsonlite")

loadedDf <- data.frame(feather("exported.feather"))

jsonStr <- loadedDf$sumstat_ss_df[1]

sumStatDf <- fromJSON(jsonStr)

If you prefer CSV over the feather format you can also do that.