## About score matching

Aapo Hyvärinen suggested a method to estimate non-normalized statistical models by the means of score matching. There is an alternative way to derive his objective function.

We start with absolutely continuous positive in the domain $D$ probability density function $\frac{1}{Z(\theta)} u(x;\theta)$, where $Z(\theta)= \int_D u(x;\theta)$ is a partition function, and $u(x;\theta)$ is a potential. Then for any good enough probing function $f(x)$ with $f(x)p(x)=0$ on the boundary of the domain ($\partial D$) the expected value for $\frac{\partial f(x) }{ \partial x_i }$ by integration by parts is

$\left< \frac{\partial f(x) }{ \partial x_i } \right> =$ $\int_D \frac{\partial f(x) }{ \partial x_i } p(x) dx =$ $\left. f(x)p(x) \right|_{\partial D} - \int_D f(x) \frac{\partial p(x) }{ \partial x_i } dx =$ $- \int_D f(x) \frac{\partial p(x) }{ \partial x_i } \frac{p(x)}{p(x)} dx =$ $- \int_D f(x) \frac{\partial \log p(x) }{ \partial x_i } p(x) dx =$ $-\left< f(x) \frac{\partial \log p(x) }{ \partial x_i } \right>$

In other words, for any good enough probability density function, and for any food enough probing function $f(x)$ the following identity holds.

(1) $\left< \frac{\partial f(x) }{ \partial x_i } \right> + \left< f(x) \frac{\partial \log p(x) }{ \partial x_i } \right> = 0$.

Now choosing the set of probing functions $f_i(x)= \frac{1}{2} \frac{ \partial u(x;\theta) }{ \partial x_i }$ and summing up identities (1) for all probing function we get

$2 \sum \limits_i \left< \frac{1}{2} \frac{\partial^2 u(x;\theta) }{ \partial x_i^2 } +\left( \frac{1}{2} \frac{\partial u(x;\theta) }{ \partial x_i } \right)^2 \right> = \sum \limits_i \left< \frac{\partial^2 u(x;\theta) }{ \partial x_i^2 } + \frac{1}{2} \left( \frac{\partial u(x;\theta) }{ \partial x_i } \right)^2 \right>$

i.e. Aapo cost function. In other words, my interpretation of score matching is a particular way to impose identity (1) for the distribution given a set of data points via sampling mean.

Now consider exponential family of distributions, i.e. $u(x;\theta)= \sum \limits_k \theta_k \phi_k(x)$ than the cost function reads

$\sum_i \left< \vec\theta^T \frac{\partial^2 \vec\phi(x)}{ \partial x_i^2 } + \frac{1}{2} \left( \frac{\partial \vec\phi(x)}{ \partial x_i } \vec\theta \right)^2 \right> = \sum_i \left< \vec\theta^T \frac{\partial^2 \vec\phi(x)}{ \partial x_i^2 } + \frac{1}{2} \vec\theta^T \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \vec\theta \right>$

That will fail miserably when $\left< \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \right>$ is ill conditioned for any realistic sample sizes. Moreover, if $\left< \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \right>$ is singular the problem is ill defined, and one need to do some kind of regularization to obtain $\theta$, i.e. biasing estimated distribution.