About score matching

Aapo Hyvärinen suggested a method to estimate non-normalized statistical models by the means of score matching. There is an alternative way to derive his objective function.

We start with absolutely continuous positive in the domain D probability density function \frac{1}{Z(\theta)} u(x;\theta), where Z(\theta)= \int_D u(x;\theta) is a partition function, and u(x;\theta) is a potential. Then for any good enough probing function f(x) with f(x)p(x)=0 on the boundary of the domain (\partial D) the expected value for \frac{\partial f(x) }{ \partial x_i } by integration by parts is

\left< \frac{\partial f(x) }{ \partial x_i } \right> = \int_D \frac{\partial f(x) }{ \partial x_i } p(x) dx = \left. f(x)p(x) \right|_{\partial D} - \int_D f(x) \frac{\partial p(x) }{ \partial x_i } dx = - \int_D f(x) \frac{\partial p(x) }{ \partial x_i } \frac{p(x)}{p(x)} dx = - \int_D f(x) \frac{\partial \log p(x) }{ \partial x_i } p(x) dx = -\left< f(x) \frac{\partial \log p(x) }{ \partial x_i } \right>

In other words, for any good enough probability density function, and for any food enough probing function f(x) the following identity holds.

(1) \left< \frac{\partial f(x) }{ \partial x_i } \right> + \left< f(x) \frac{\partial \log p(x) }{ \partial x_i } \right> = 0 .

Now choosing the set of probing functions f_i(x)= \frac{1}{2} \frac{ \partial u(x;\theta) }{ \partial x_i } and summing up identities (1) for all probing function we get

2 \sum \limits_i \left< \frac{1}{2} \frac{\partial^2 u(x;\theta) }{ \partial x_i^2 } +\left( \frac{1}{2} \frac{\partial u(x;\theta) }{ \partial x_i } \right)^2 \right> = \sum \limits_i \left< \frac{\partial^2 u(x;\theta) }{ \partial x_i^2 } + \frac{1}{2} \left( \frac{\partial u(x;\theta) }{ \partial x_i } \right)^2 \right>

i.e. Aapo cost function. In other words, my interpretation of score matching is a particular way to impose identity (1) for the distribution given a set of data points via sampling mean.

Now consider exponential family of distributions, i.e. u(x;\theta)= \sum \limits_k \theta_k \phi_k(x) than the cost function reads

\sum_i \left< \vec\theta^T \frac{\partial^2 \vec\phi(x)}{ \partial x_i^2 }  + \frac{1}{2} \left( \frac{\partial \vec\phi(x)}{ \partial x_i } \vec\theta \right)^2 \right> =  \sum_i \left< \vec\theta^T \frac{\partial^2 \vec\phi(x)}{ \partial x_i^2 }  + \frac{1}{2} \vec\theta^T \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \vec\theta \right>

That will fail miserably when \left< \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \right> is ill conditioned for any realistic sample sizes. Moreover, if \left< \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \right> is singular the problem is ill defined, and one need to do some kind of regularization to obtain \theta, i.e. biasing estimated distribution.

Posted in Uncategorized | Leave a comment

On the density estimation

For probability space (\Omega, \mathcal{F}, \mathbb{P}) with A \in \mathcal{F} the indicator random variable

{\bf 1}_A : \Omega \rightarrow \mathbb{R} = \left\{ \begin{array}{cc} 1, & \omega \in A \\ 0, & \omega \notin A \end{array} \right.

Than expected value of the indicator variable is the probability of the event \omega \in A .

E( {\bf 1}_A )= \int_{\Omega} {\bf 1}_A(\omega) d \mathbb{P} = P(A).

Next we are interested in real valued random variable X:\Omega \rightarrow \mathbb{R}. We are now interested to find, assuming absolutely continuous measure \mathbb{P} , the probability density p(X), given a finite set of observations X_i= X(\omega_i),\ i=1..N .

P(A)= E( {\bf 1}_A )= \int_{\Omega} {\bf 1}_A(\omega) d \mathbb{P} = \int_{ \mathbb{R} } {\bf 1}_A (x) p(x) dx = \int_{ \mathbb{R}^2 } {\bf 1}_A (x) p(x) \delta(x-y) {\bf 1}_A (y) dx dy = \int_{ \mathbb{R}^2 } {\bf 1}_A (x) K(x, y) {\bf 1}_A (y) dx dy ,

where \delta(x-y) is Dirac delta function, and K(x, y)= p(x) \delta(x-y).

Next we want another complication – a resolution of identity, e.g. to represent all indicator function in some basis, e.g {\bf 1}_A (x) = \sum_i | i > < i | {\bf 1}_A (x)

We have

P(A)= \sum_{i,k} \int_{ \mathbb{R}^2 } {\bf 1}_A (x) | i > < i | K(x, y) | k > < k | {\bf 1}_A(y) dx dy= \sum_{i,k} \int_{ \mathbb{R}^2 } {\bf 1}_A (x) | i > \rho < k|{\bf 1}_A (y) dx dy ,

where \rho is the density matrix by the analogy with quantum mechanics.

To be concrete we can use the real valued Hermite functions \psi_n(x) as a complete basis set. Here we have \sum \limits_{n=0}^\infty \psi_n(z) \psi_n(z') = \delta( z-z') , so
P(A)= \sum_{i,k} \int_{ \mathbb{R}^2 } {\bf 1}_A (x) | i > \rho < k|{\bf 1}_A (y) dx dy =
\sum_{i,k} \int_{ \mathbb{R}^4 } {\bf 1}_A (x) \psi_i( x ) \psi_i( z_1 ) p(z_1) \delta( z_1 - z_2 ) \psi_k( z_2) \psi_k( y ) {\bf 1}_A (y) dx dy dz_1 dz_2 .

The density matrix elements \rho_{i,k}= \int_{ \mathbb{R}^2 } \psi_i( z_1 ) p(z_1) \delta( z_1 - z_2 ) \psi_k( z_2) dz_1 dz_2 = \int_{ \mathbb{R} } \psi_i( z ) p(z) \psi_k( z) dz. The density p(x)= \sum_{i,k} \psi_i( x ) \rho_{i,k} \psi_k( x ).

In order to estimate density one need to use samples to estimate density matrix, e.g. replace the integral with sample mean \hat {\rho}_{i,k} = \frac{1}{N} \psi_i( X_i ) \psi_k( X_i ) . In terms of asymptotic behavior, each element of the density matrix will converge to the true value (asymptotically unbiased), and the variance of the expected value decay as \frac{1}{N} . Think of it as a Monte-Carlo integration.

Comments.

  1. This seems (since no theorems are proven here) to be more efficient than the kernel method of density estimation.
  2. The truncated version of Hermite functions will lead to systematic bias. Since Hermite functions are the basis function of continuous Fourier transform, the bias will consist of truncating the bandwidth of probability density function.
  3. The complexity is in the complexity of matrix computation, e.g. N*M^2, where M is the number of Hermite functions used, compared to just N for kernel method.
  4. The basis set does not have to be real, and if complex resemble quantum mechanical description of reality, where \psi_k(x) represent an experimental setup, and p(x) “classical” probabilities.
  5. Depending on the M/N ratio one will get either smooth or sharp approximation of the density.
  6. The mothod allow to compare different distributions on the same basis set.
  7. There are too many details need to be filled in to become publication quality. This will take unreasonable long time for me, so if anyone sees rational grain in it and can use as the basis, please go ahead.
  8. The method works in practice. It has some additional information, not mentioned here.
Posted in Uncategorized | Leave a comment

Counterexample

There is an explicit counterexample to my proposal for polynomial complexity solution to max-cut problem (see The P-versus-NP page entry #63 ):

The quartic polynomial \sum_{i=1}^5 (x_i^2-1)^2 + ( \sum x_i )^2 -.1 is positive definite, e.g > 0 for all x, but cannot be represented as a sum of squares. On the other hand, this polynomial represent fully connected graph. The polynomial is due to Pablo Parrilo via Amir Ali Ahmadi.

Comments:
1. My understanding of it – there are too few degrees of freedom in sums of squares to capture general behavior of quartic polynomials.

2. arxive.org is taken seriously by some people.

 

Posted in Uncategorized | Leave a comment