Total Variation Distance Between a Continuous and Discrete Distributions

Question

My answer is about how to calculate this using base R.

We have two multinomial parameter vectors, θ and η. The total variation distance is equivalent to P_θ(E) - P_η(E), where E={ω | P_θ({ω})>P_η({ω})}, and ω is a vector of sample counts.

I know of two ways to evaluate P(E) in base R. One is a very simple simulation-based method. The other reframes the problem in terms of a linear combination of the counts, which is approximately normally distributed, and uses the pnorm function.

Simulation-based method

You simulate samples from each distribution, check whether they're in E using the probability mass functions, and count how often they are. I'll go through an example here. We'll assume the true distribution from your question:

          unnormalized.true <- c(5,7,10,8,14,10,15,12,10,9) true <- unnormalized.true / sum(unnormalized.true)

We'll draw a sample and estimate a new distribution using a Bayes estimator:

          set.seed(921) result <- as.vector(rmultinom(1, size = 30, prob = true)) result ##  [1] 3 6 2 0 5 3 3 4 1 3 dirichlet <- (result+1)/(30+length(true))

Calculating the probability of E under the true distribution:

          set.seed(939) true.dist <- rmultinom(10^6, 30, true) p.true.e <- mean(apply(true.dist, 2, function(x)                  dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))

Calculating the probability of E under the estimated distribution from the Bayes estimator:

          dirichlet.dist <- rmultinom(10^6, 30, dirichlet) p.dirichlet.e <- mean(apply(dirichlet.dist, 2, function(x)                  dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))

And we can subtract to get the total variation distance.

          p.true.e - p.dirichlet.e ## [1] 0.83737

Repeating this with the maximum likelihood estimate, we get a comparison of the estimators.

          mle <- result/30 mle.dist <- rmultinom(10^6, 30, mle) p.true.e2 <- mean(apply(true.dist, 2, function(x)   dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0)) p.mle.e2 <- mean(apply(mle.dist, 2, function(x)   dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0)) p.true.e2 - p.mle.e2 ## [1] 0.968301

(edited to fix a serious mistake. Previously I had re-used p.true.e in the comparison with the MLE. I forgot that the event E is defined in terms of the estimated distribution.)

Normal approximation

I think this method is actually more accurate than the simulation based method, despite the normal approximation. As you'll see, we're not taking a normal approximation to the multinomial counts, which would be unlikely to be accurate for n=30. We're taking a normal approximation to a linear combination of these counts, which is close to normal. The weakness of this method will turn out to be that it can't handle zero probabilities in the estimated distribution. That's a real problem, since handling zeros gracefully is, to me, part of the point of using total variation distance rather than Kullback-Leibler divergence. But here it is.

The following derivation yields a restatement of E:

deriv1

deriv2

deriv3

deriv4

Define

ldef

where N_i is one cell of the multinomial sample, and

lambdadef

Then, E is the event that L>0.

The reason we have a problem with a zero probability is that it causes one of the λ_i's to be infinite.

I want to verify that L is close to normally distributed, in the example from before. I'll do that by getting samples from the distribution of L, using the previous multinomial simulation:

          lambda <- log(true/dirichlet) L.true.dist <- apply(true.dist, 2, function(x) sum(lambda*x)) L.dirichlet.dist <- apply(dirichlet.dist, 2, function(x) sum(lambda*x))

Note that I'm doing the comparison between the true distribution and the Bayes estimated distribution. I can't do the one with the MLE, because my sample had a zero count.

Plotting the distribution of L and comparing to a normal fit:

          par(mfrow=c(1,2)) L.true.dist.hist <- hist(L.true.dist) L.true.dist.fit <- function(x)   length(L.true.dist) * diff(L.true.dist.hist$breaks)[1] *   dnorm(x, mean(L.true.dist), sd=sd(L.true.dist)) curve(L.true.dist.fit, add=TRUE, n=1000, col='red') L.dirichlet.dist.hist <- hist(L.dirichlet.dist) L.dirichlet.dist.fit <- function(x)   length(L.dirichlet.dist) * diff(L.dirichlet.dist.hist$breaks)[1] *   dnorm(x, mean(L.dirichlet.dist), sd=sd(L.dirichlet.dist)) curve(L.dirichlet.dist.fit, add=TRUE, n=1000, col='red') par(mfrow=c(1,1))

histograms

The distribution of L appears normal. So, instead of using simulations, we can just use pnorm. However, we need to calculate the mean and standard deviation of L. This can be done as follows.

The mean of L is

expectedvalue

where p_i is the cell probability of cell i in the distribution p. The variance is

variance

where

sigma

is the covariance matrix of the multinomial distribution. I'll calculate these moments for this example, and check them against the empirical moments in the simulation. First, for the distribution of L under the true distribution:

          n <- 30 k <- length(true) mean.L.true <- sum(lambda * n * true) # Did we get the mean right? c(mean.L.true, mean(L.true.dist)) ## [1] 3.873509 3.875547 # Covariance matrix assuming the true distribution sigma.true <- outer(1:k, 1:k, function(i,j)   ifelse(i==j, n*true[i]*(1-true[i]), -n*true[i]*true[j])) var.L.true <- t(lambda) %*% sigma.true %*% lambda # Did we get the standard deviation right? c(sqrt(var.L.true), sd(L.true.dist)) ## [1] 2.777787 2.776945

Then, the mean and variance of L under the Bayes estimate of the distribution:

          mean.L.dirichlet <- sum(lambda * n * dirichlet) # Did we get the mean right? c(mean.L.dirichlet, mean(L.dirichlet.dist)) ## [1] -3.893836 -3.895983 # Covariance matrix assuming the estimated distribution sigma.dirichlet <- outer(1:k, 1:k, function(i,j)   ifelse(i==j, n*dirichlet[i]*(1-dirichlet[i]), -n*dirichlet[i]*dirichlet[j])) var.L.dirichlet <- t(lambda) %*% sigma.dirichlet %*% lambda # Did we get the standard deviation right? c(sqrt(var.L.dirichlet), sd(L.dirichlet.dist)) ## [1] 2.796348 2.793421

With these in hand, we can calculate the total variation distance with pnorm:

          pnorm(0, mean.L.true, sd=sqrt(var.L.true), lower.tail=FALSE) -   pnorm(0, mean.L.dirichlet, sd=sqrt(var.L.true), lower.tail=FALSE) ## [1] 0.8379193 # Previous result was 0.83737

We get three digits of agreement with the simulation.

I don't know of any easy way to extend the normal approximation method to handle zero probabilities, though. I had an idea, but I got stuck trying to calculate the covariance matrix of the counts conditional on a specific cell having 0 count. I could share my progress if you think you could make something of it.

Swan Criess

Total Variation Distance Between a Continuous and Discrete Distributions

Simulation-based method

Normal approximation

Post a Comment for "Total Variation Distance Between a Continuous and Discrete Distributions"