Loading [MathJax]/jax/output/CommonHTML/jax.js

Generating distributions with the Stick Breaking version of the Dirichlet Process

Posted on 28 Jun 2018

The Dirichlet Process is a prior over distributions. The basic definition relays on measure theory and has some obscure definitions for most of people. Intuitevily, imagine we want to draw a distribution G over the line of real numbers using some prior distribution P. We say that P is a Dirichlet Process if, for any possible partition of the line of real numbers, (A1,...,Am), the probabilities over the partitions are distributed according to a Dirichlet distribution:

G(A1),...,G(Am)Dirichlet(α(A1),...,α(Ak))

There are different ways to generate a distribution G. One of the most popular is the Stick Breaking construction. The trick is to realize that

G=kπkδθk

where πk weights sum up to 1, and the deltas are distributed around the line of the real numbers according to some distribution. The stick breaking construction goes as follows:

  1. Take a stick of length 1 and make infinite random breaks. Each break removes a percentage vk of the remaining stick

    vkBeta(1,α)

    The length of the removed break is

    πk=vkk1i=1(1vi)

    where k1l=1(1vl) is the remaining length of the stick at step k. At the end of the process we have a collection of segments whose length sum is 1.

  2. Draw infinite samples from a base distribution G0:

    ϕkG0

    where G0 can be any distribution such as a Gaussian, a Poisson or a Multinomial.

  3. We build our distribution as

    G=k=1πkδϕk

Here is the R code to sample from a Dirichlet Process using the Stick Breaking construction:

library(dplyr)
library(ggplot2)

#'@title Stick Breaking construction of a Dirichlet Process
#'@param n number of samples
#'@param alpha concentration parameter
#'@param G0 base measure
DP_stick_breaking <- function(n, alpha, G0) {

  # Stick breaking
  v  <- rbeta(n, 1, alpha)
  
  pi <- numeric(n)
  pi[1] <- v[1] 
  pi[2:n] <- sapply(2:n, function(i) v[i] * prod(1 - v[1:(i-1)]))
  
  # Atoms from the base measure
  y <- G0(n) 
  return(list(atoms = y,
              pis = pi))
  # Sample from the Dirichlet Process G
  #theta <- sample(y, prob = pi, replace = TRUE)
  #return(theta)
}


# Choose a base distribution
G0 <- function(n) rpois(n, 7)       # Poisson
G0 <- function(n) rnorm(n, 0, 1)   # Gaussian

# Concentration parameter around the base distribution
alphas <- c(1, 10, 100, 1000)
n <- 1000
df.G <- as.data.frame(matrix(NA, 0, 4))
names(df.G) <- c("xp", "i", "atom", "pi")

df.samples <- as.data.frame(matrix(NA, 0, 4))
names(df.samples) <- c("xp", "i", "alpha", "samples")

for(alpha in alphas){
  for(xp in 1:3){
    cat("\n", alpha)
    
    # Generate a distribution from the DP(alpha, G0)
    G <- DP_stick_breaking(n, alpha, G0)
    df.G <- bind_rows(df.G, 
                      list(xp = rep(xp, n), 
                           alpha= rep(alpha, n), 
                           i=1:n,
                           atom = G$atoms,
                           pi=G$pis))
    
    # Generate samples from the DP
    samples <- sample(G$atoms, prob = G$pis, replace = TRUE)
    df.samples <- bind_rows(df.samples, 
                            list(xp = rep(xp, n), 
                                 alpha= rep(alpha, n), 
                                 i=1:n, 
                                 sample=samples))
  }
}

# Plot the distributon drawn by the DP
df.G$alpha <- factor(df.G$alpha)
p <- ggplot(df.G, aes(x=atom, y=pi)) + 
  geom_col(width=0.01)+
  facet_grid(alpha ~ xp, scales = "free") +
  theme_bw()
print(p)
ggsave(p, filename = paste0("fig15_DP.png"), 
       height=16, width=16, units='cm')

# Plot samples from the distribution
df.samples$alpha <- factor(df.samples$alpha)
p <- ggplot(df.samples, aes(x=sample)) + 
     #geom_density() +
     #geom_histogram(bins=100, aes(y=..count../sum(..count..))) +
     geom_histogram(bins=1000) + 
     facet_grid(alpha ~ xp, scales = "free") +
     theme_bw()
print(p)
ggsave(p, filename = paste0("fig16_DPsamples.png"), 
       height=16, width=16, units='cm')

And here are the results of Dirichlet Process G generated with the Stick Breaking and different concentration α, and samples obtained from these G:

Fig.1 - (Left) Distributions drawn from a Dirichlet Processes with different concentration parameters. (Right) Samples from these distributions. Columns represent repetitions of the experiment.