Estimating family of densities

Given a set of samples $S_i= \{X_{i,k}\}, i=1..M, k=..N$ We are interested in finding the family of the distributions that describes in some sense the whole set. The problem is ill defined. On the other hand, if samples are drawn from distributions with probability densities $p_i(x)$ than the usual measure for the differences between two distributions would be Kullback-Leibler divergence. Than, the overall measure between the family and the probability samples would be expected value of the KL-divergence $\inf_{ \{ \theta_i\} }E \left[ D_{KL}( p_i(x) || p(x, \theta_i) \right]$.

It is interesting to see what is happening when all samples are drawn from exponential family, e.g. $p_i(x)= \frac{1}{Z_i}e^{-u_i(x)}$, where $Z_i= \int_D e^{-u_i(x)} dx < \infty$ is a partition function, and $u_i$ is a potential. Here, the above-defined cost function, taking sample mean over the distributions, reads $\frac{1}{N} \sum_i \int_D p_i(x) \left( \log p(x;\theta_i) + u_i(x) \right) dx + \log Z_i =$ $\frac{1}{N} \sum_i \int_D p_i(x) \left( u_i(x) -u(x;\theta_i) \right) dx + \log Z_i- \log Z( \theta_i ) =$ $\frac{1}{N} \sum_i \int_D p_i(x) \left( u_i(x) - \sum_m \theta_{k,m} \phi_m(x ) \right) dx + \log Z_i- \log Z( \theta_i )$. That is certainly zero, when the set $\{ \phi_m(x) \}$ spans the same space as $u_i(x)$. Although $D_{KL}$ is not symmetric with respect to its arguments in the case of exponential families it does not matter, since elements under the integral are almost everywhere zero. With this respect any measure that is a function of the potential differences will discover the correct family, and by that will describe sample potentials exactly almost everywhere. From that it follows that the partition function will also coincide, and can be ignored in the cost function.

So far we have established that to find the exponential family of distributions one need to estimate sample probability densities, take their logarithm, remove the constant part representing partition function, and then perform principal value decomposition to find the basis functions $\phi_m(x)$, that in particular represents the minimal sufficient statistics with respect to the family. In other words, if we do not know the underlying family, we need to choose ambient space of functions representing individual potentials, for example the space of square integrable functions, choose the basis, represent the probability of samples in this basis, and perform principal component analysis. That achieves couple of goals. First, it removes the null space, e.g. It leaves only the sufficient statistics, and second, it is also gives relative weights of the different directions in this space. In other words, if one wants to reduce the representation of probabilities with the smaller number of statistics one have some objective measure what statistics to chose, with respect to the variance explained. (examples- discrete and continuous cases on bounded domains).

There are however two problems. One is to get good estimate of the probability densities, and the second one is that the family does not have to be exponential. And we are interested to discover the family of distributions based on the data only. We assume no additional information is available.

It the family is not of exponential family and if there is no additional information available it is still reasonable to approximate samples by the large dimensional space of functions that capture the wide class of functions, e.g. The space of square integrator functions or the class of bounded functions. The principal component analysis will sort the subsisted of functions according to the variance explained, and therefore it is possible to select the subspace that will provide minimum mean squared estimates of the samples distributions.

Now the real problem comes with the density estimation. The most popular techniques divides into two classes. The parametric estimation and non-parametric ones. The nonparametric methods are consisting basically on remembering all the data, and smoothing them somehow. The large class of such methods are usually referred as kernel density estimations. That is certainly is not helpful for estimating a paramedic family. The second class of methods assume some parametric form and minimize some cost function. This class of methods are usually redder as Bayesian. In particular very popular method of estimating parameters, assuming flat prior is maximum likelihood estimations. This set of methods require to compute partition function which is known to be hard.

There is another technique, which does not require computing partition function – score matching, proposed by Aapo Hivarinen. We would exploit a modified version here. We consider here exponential family of distributions $u_i(x)= \sum_k \theta_{i,k} \phi_k(x)$, were $\phi_k$ is the finite/countable set of basis functions spanning some Hilbert space (we need Hilbert to have inner product, that allows to perform PCA).

Given the set of “good” probing functions $f_m(x)$, each sampling distributions in the limit should satisfy the condition $\left< \frac{ \partial f_m(x) }{ \partial x_k }+ \vec \theta^T f_m(x) \frac{ \partial \vec \phi(x) }{ \partial x_k } \right> =0$ In the case of the sampling these equation can be true only approximately.

$\sum \limits_{m,k, i} \left< \frac{ \partial f_m(x) }{ \partial x_k }+ \vec \theta_i^T f_m(x) \frac{ \partial \vec \phi(x) }{ \partial x_k } \right>^2$

We are interested in finding a subspace with the basis vectors $u$, mean vector $mu$, with $\theta_i= \mu+ \vec \zeta_i u$ that minimize the cost function

$\sum \limits_{m,k,i} \left< \frac{ \partial f_m(x) }{ \partial x_k }+ (\mu + \vec \zeta_i u )^T f_m(x) \frac{ \partial \vec \phi(x) }{ \partial x_k } \right>^2$. That is quartic equation, that is hard to compute. It is also analog of the PCA with missing data.