About score matching


Aapo Hyvärinen suggested a method to estimate non-normalized statistical models by the means of score matching. There is an alternative way to derive his objective function.

We start with absolutely continuous positive in the domain D probability density function \frac{1}{Z(\theta)} u(x;\theta), where Z(\theta)= \int_D u(x;\theta) is a partition function, and u(x;\theta) is a potential. Then for any good enough probing function f(x) with f(x)p(x)=0 on the boundary of the domain (\partial D) the expected value for \frac{\partial f(x) }{ \partial x_i } by integration by parts is

\left< \frac{\partial f(x) }{ \partial x_i } \right> = \int_D \frac{\partial f(x) }{ \partial x_i } p(x) dx = \left. f(x)p(x) \right|_{\partial D} - \int_D f(x) \frac{\partial p(x) }{ \partial x_i } dx = - \int_D f(x) \frac{\partial p(x) }{ \partial x_i } \frac{p(x)}{p(x)} dx = - \int_D f(x) \frac{\partial \log p(x) }{ \partial x_i } p(x) dx = -\left< f(x) \frac{\partial \log p(x) }{ \partial x_i } \right>

In other words, for any good enough probability density function, and for any food enough probing function f(x) the following identity holds.

(1) \left< \frac{\partial f(x) }{ \partial x_i } \right> + \left< f(x) \frac{\partial \log p(x) }{ \partial x_i } \right> = 0 .

Now choosing the set of probing functions f_i(x)= \frac{1}{2} \frac{ \partial u(x;\theta) }{ \partial x_i } and summing up identities (1) for all probing function we get

2 \sum \limits_i \left< \frac{1}{2} \frac{\partial^2 u(x;\theta) }{ \partial x_i^2 } +\left( \frac{1}{2} \frac{\partial u(x;\theta) }{ \partial x_i } \right)^2 \right> = \sum \limits_i \left< \frac{\partial^2 u(x;\theta) }{ \partial x_i^2 } + \frac{1}{2} \left( \frac{\partial u(x;\theta) }{ \partial x_i } \right)^2 \right>

i.e. Aapo cost function. In other words, my interpretation of score matching is a particular way to impose identity (1) for the distribution given a set of data points via sampling mean.

Now consider exponential family of distributions, i.e. u(x;\theta)= \sum \limits_k \theta_k \phi_k(x) than the cost function reads

\sum_i \left< \vec\theta^T \frac{\partial^2 \vec\phi(x)}{ \partial x_i^2 }  + \frac{1}{2} \left( \frac{\partial \vec\phi(x)}{ \partial x_i } \vec\theta \right)^2 \right> =  \sum_i \left< \vec\theta^T \frac{\partial^2 \vec\phi(x)}{ \partial x_i^2 }  + \frac{1}{2} \vec\theta^T \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \vec\theta \right>

That will fail miserably when \left< \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \right> is ill conditioned for any realistic sample sizes. Moreover, if \left< \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \right> is singular the problem is ill defined, and one need to do some kind of regularization to obtain \theta, i.e. biasing estimated distribution.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s