Posted on 28 Jun 2018
The Dirichlet Process is a prior over distributions. The basic definition relays on measure theory and has some obscure definitions for most of people. Intuitevily, imagine we want to draw a distribution G over the line of real numbers using some prior distribution P. We say that P is a Dirichlet Process if, for any possible partition of the line of real numbers, (A1,...,Am), the probabilities over the partitions are distributed according to a Dirichlet distribution:
G(A1),...,G(Am)∼Dirichlet(α(A1),...,α(Ak))There are different ways to generate a distribution G. One of the most popular is the Stick Breaking construction. The trick is to realize that
G=∑kπkδθkwhere πk weights sum up to 1, and the deltas are distributed around the line of the real numbers according to some distribution. The stick breaking construction goes as follows:
Take a stick of length 1 and make infinite random breaks. Each break removes a percentage vk of the remaining stick
vk∼Beta(1,α)The length of the removed break is
πk=vkk−1∏i=1(1−vi)where ∏k−1l=1(1−vl) is the remaining length of the stick at step k. At the end of the process we have a collection of segments whose length sum is 1.
Draw infinite samples from a base distribution G0:
ϕk∼G0where G0 can be any distribution such as a Gaussian, a Poisson or a Multinomial.
We build our distribution as
G=∞∑k=1πkδϕkHere is the R code to sample from a Dirichlet Process using the Stick Breaking construction:
library(dplyr)
library(ggplot2)
#'@title Stick Breaking construction of a Dirichlet Process
#'@param n number of samples
#'@param alpha concentration parameter
#'@param G0 base measure
DP_stick_breaking <- function(n, alpha, G0) {
# Stick breaking
v <- rbeta(n, 1, alpha)
pi <- numeric(n)
pi[1] <- v[1]
pi[2:n] <- sapply(2:n, function(i) v[i] * prod(1 - v[1:(i-1)]))
# Atoms from the base measure
y <- G0(n)
return(list(atoms = y,
pis = pi))
# Sample from the Dirichlet Process G
#theta <- sample(y, prob = pi, replace = TRUE)
#return(theta)
}
# Choose a base distribution
G0 <- function(n) rpois(n, 7) # Poisson
G0 <- function(n) rnorm(n, 0, 1) # Gaussian
# Concentration parameter around the base distribution
alphas <- c(1, 10, 100, 1000)
n <- 1000
df.G <- as.data.frame(matrix(NA, 0, 4))
names(df.G) <- c("xp", "i", "atom", "pi")
df.samples <- as.data.frame(matrix(NA, 0, 4))
names(df.samples) <- c("xp", "i", "alpha", "samples")
for(alpha in alphas){
for(xp in 1:3){
cat("\n", alpha)
# Generate a distribution from the DP(alpha, G0)
G <- DP_stick_breaking(n, alpha, G0)
df.G <- bind_rows(df.G,
list(xp = rep(xp, n),
alpha= rep(alpha, n),
i=1:n,
atom = G$atoms,
pi=G$pis))
# Generate samples from the DP
samples <- sample(G$atoms, prob = G$pis, replace = TRUE)
df.samples <- bind_rows(df.samples,
list(xp = rep(xp, n),
alpha= rep(alpha, n),
i=1:n,
sample=samples))
}
}
# Plot the distributon drawn by the DP
df.G$alpha <- factor(df.G$alpha)
p <- ggplot(df.G, aes(x=atom, y=pi)) +
geom_col(width=0.01)+
facet_grid(alpha ~ xp, scales = "free") +
theme_bw()
print(p)
ggsave(p, filename = paste0("fig15_DP.png"),
height=16, width=16, units='cm')
# Plot samples from the distribution
df.samples$alpha <- factor(df.samples$alpha)
p <- ggplot(df.samples, aes(x=sample)) +
#geom_density() +
#geom_histogram(bins=100, aes(y=..count../sum(..count..))) +
geom_histogram(bins=1000) +
facet_grid(alpha ~ xp, scales = "free") +
theme_bw()
print(p)
ggsave(p, filename = paste0("fig16_DPsamples.png"),
height=16, width=16, units='cm')
And here are the results of Dirichlet Process G generated with the Stick Breaking and different concentration α, and samples obtained from these G: