## NP-complete problems and tautologies

Many NP complete problems can be encoded in system of polynomial equations. For example, in partition problem one is asking whether it is possible to divide a list of integers into two lists that the sum of element of each least are equal. Given the list of elements $a \in \mathbb{Z}^n$ one is looking for the element $x \in \{ \pm 1 \}^n$, such that $\sum \limits_{k=1}^n x_k a_k =0$.

To encode the problem into the system of polynomial equations over $\mathbb{C}$ one need $n$ equations of the form $x_k^2-1=0$ – that will ensure variables $x_k=\pm 1$, and the above mentioned equation $\sum \limits_{k=1}^n x_k a_k =0$. In total $n+1$ nonlinear equations. We can also change the variables that $x'_k= \frac{x_k+1}{2} \in \{0,1\}$. In some sense this variables may be easier for the analyses later.

One way to solve this equation is to linearise it, e.g. we define a new set of variables $y_{(k,m)}= x'_k x'_m,\ k,m=0..n$, and we set $x'_0=1$. Now in terms of $y_{(k,m)}$ we have system of linear equations. Unfortunately, we have lost relationship between terms. Linear system does not “know” that $y_{(2,3)}= y_2 y_3$, etc. On the other hand, if solution of the original system exists it should lay in the null space of the coefficient matrix. We will try use it. Suppose, we have null space $U=[u_1,..., u_K] \in \mathbb{C}^{ \frac{(n+1)(n+2)}{2} \times K }$, than if system of equations has the solution all monomials will be simultaneously expressed as a linear combination of null space vectors $y_{(k,m)}(c)= \sum \limits_{i=1}^K c_i u_{i,(k,m)}$. The dimensionality of null space $K$ is exactly the number of quadratic monomials minus number of equations.

So far, we were looking only at second order monomials. If we go to higher order monomials we can see that $x_{k_1} x_{k_2} x_{k_3} x_{k_4}= (x_{k_1} x_{k_2}) (x_{k_3} x_{k_4})=(x_{k_1} x_{k_3}) (x_{k_2} x_{k_4})=(x_{k_1} x_{k_4})( x_{k_2} x_{k_3})$. What we wrote is tautology – this is true independent of the values of $x_k$. This is also true for any $k_1,...,k_4$. So this tautologies can be written in terms of our new variables $y_{(k_1,k_2)}y_{(k_3,k_4)}-y_{(k_1,k_3)}y_{(k_2,k_4)}=0$, $y_{(k_1,k_2)}y_{(k_3,k_4)}-y_{(k_1,k_4)}y_{(k_2,k_3)}=0$. On the other hand, if we have $k_m = k_n \forall m \neq n$, than we does not have tautology. In any case, we can write down all tautologies for up to order 4, and their expression in terms of new variables $y$. Moreover, for each $y_{k,m}$ we have its linear expression in terms of unknowns $c_i$. So if we expand tautologies in terms of $c_i$ we will get a new set of quadratic equations, that must be satisfied, if original system has a solution.

Now we already know what to do. We need to linearise the system of quadratic equations, for which we need to introduce new variables $b_{k,m}= c_k c_m$, write down the system of linear equations, find the null space of coefficient matrix, encode $b_{m,k}$ as a linear combination of the null space basis, and finally encode forth-order monomials in terms of new coefficient of new null space basis. Ok, we can repeat this forever, just remember to write down all quadratic tautologies in terms of known monomials (e.g. $(x_1x_2x_3x_4)(x_5x_6x_7x_8)= (x_1x_2x_3x_5)(x_4x_6x_7x_8)=...$ many more tautologies). What is interesting here is how $K$ changes with iterations.

Ok, I will come back here later, when I cool down and wil have more time. Note that the number of tautologies are much more (seems to be exponentially more) than the number of monomials.

Each iteration we are squaring the number of variables which encode monomials in terms of null space. Now consider only monomials with $k_m \neq k_n, \forall m \neq n$. If we have monomial of degree $2k$ than only for that monomial we have $\frac{1}{2}\binom{2k}{k}$, since we want to choose half variables, and one half comes from the fact that multiplication is commutative. The number of monomials is $\binom{n}{2k}$. So, assuming tautologies are linearly independent, we have $K_0=n^2$, $K_{m+1}=K_m^2-\frac{1}{2}\binom{2m}{m}\binom{n}{2m} =$ $K_m^2-\frac{1}{2}\binom{n-m}{m}\binom{n}{m} \approx$ $K_m^2-\frac{1}{2} n^{2m}/m!$. So the lower bound is asymptotically …

Posted in Uncategorized | 2 Comments

## Semiprime factorization and custom proof system

Suppose we have $N=pq$, with $p$ and $q$ are unknown odd primes. We can encode factorization problem in the system of polynomial equations. For instance, $p= 1+ \sum_{k=1}^n 2^k x_k$, $q= 1+ \sum_{k=1}^n 2^k y_k$, where $n = \lfloor \log_2 N \rfloor$ is one less the number of bits required to represent $N$, and $x_k, y_k$ are binary indeterminates. By abuse of notation, we can write the following system of equations (here $x, y \in \mathbb{C}^n$)

$\left\{ \begin{array}{lll} f_1= &x_k^2-x_k & =0 \\ g_1= &y_k^2-y_k &=0 \\ h_{1,1}= &\left( 1+ \sum_{k=1}^n 2^k x_k \right) \left( 1+ \sum_{k=1}^n 2^k y_k \right) -N & = 0 \end{array} \right.$

This system has only two solutions for semiprimes – $x$ encoding $p$, $y$ encoding $q$, and vice versa. Therefore, the Groebner basis will consist of $2n$ linear equations encoding line passing through $(x,y)= (p,q)$ and $(x,y)=(q,p)$, and one quadratic equation selecting these two points along this line.

Therefore, we want to represent the linear part of Groebner basis using a linear combination of our equations with polynomial coefficients – $c+ \sum_k a_k x_k +b_k y_k = P_1 h+ \sum_k P_{2,k} f_k + P_{3,k} g_k$ with $P_{...} \in \mathbb{C}[x_1,..., x_n, y_1, ..., y_n]$, Here, the coefficients in the polynomials $P$ are treated as indeterminates. For example, for $n=1$ the correspondin system reads

$\left\{ \begin{array}{lll} f_k= &x_1^2-x_1 & =0 \\ g_k= &y_1^2-y_1 &=0 \\ h_{1,1}= &\left( 1+ 2 x_1 \right) \left( 1+ 2 y_1 \right) -N & = 0 \end{array} \right.$

The linear bases is expressed by

$\begin{array}{l} c+ a_1 x_1 +b_1 y_1 = \\ \left( r_{1,1} x_1 +r_{1,2} y_1 +r_{1,3} \right) (x_1^2-x_1)+ \\ \left( r_{2,1} x_1 +r_{2,2} y_1 +r_{2,3} \right) (y_1^2-y_1)+ \\ \left( r_{3,1} x_1 +r_{3,2} y_1 +r_{3,3} \right) (1+2 x_1+2 y_1+4 x_1 y_1-N) \end{array}$.

and we have the following system for coefficients of different monomials

$\left\{ \begin{array}{rrl} x_1^3: & r_{1,1}&=0 \\ x_1^2y_1: &r_{1,2} +4r_{3,1} &=0 \\ x_1^2:& r_{1,3} - r_{1,1} +2 r_{3,1} &=0\\ x_1y_1^2:& r_{2,1}+4r_{3,2} &=0\\ x_1y_1:& -r_{1,2}-r_{2,1} +2r_{3,1}+2r_{3,2}+4r_{3,3}&=0\\ x_1:& -a_1 -r_{1,3}-r_{3,1}(N-1)+2r_{3,3} &=0\\ y_1^3:& r_{2,2} &=0\\ y_1^2:& r_{2,3}-r_{2,2} +2r_{3,2}&=0\\ y_1:& -b_1 -r_{2,3} -r_{3,2} (N-1) +2r_{3,3} &=0\\ 1:& c+ r_{3,3} (N-1) &=0\\ \end{array} \right.$

$\left\{ \begin{array}{rrl} x_1^2y_1: &-r_{1,2} &=4r_{3,1} \\ x_1y_1^2:& -r_{2,1}&=4r_{3,2} \\ x_1^2:& -r_{1,3} &=2 r_{3,1}\\ y_1^2:& -r_{2,3} &=2r_{3,2}\\ x_1y_1:& 2r_{3,3}&=-3(r_{3,1}+r_{3,2})\\ x_1:& r_{3,1}N+3r_{3,2} &=a_1\\ y_1:& r_{3,2} N+3r_{3,1} &=b_1\\ 1:& -3(r_{3,1}+r_{3,2})(N-1) &=2c\\ \end{array} \right.$

Therefore,
$(N r_{3,1}+3r_{3,2}) x_1+ ( N r_{3,2} +3r_{3,1} ) y_1 - \frac{ 3}{2}(r_{3,1}+r_{3,2})(N-1)= \\ \left( -4r_{3,1} y_1 -2r_{3,1} \right) (x_1^2-x_1)+ \\ \left( -4r_{3,2} x_1 -2r_{3,2} \right) (y_1^2-y_1)+ \\ \left( r_{3,1} x_1 +r_{3,2} y_1 - \frac{ 3}{2}(r_{3,1}+r_{3,2}) \right) (1+2 x_1+2 y_1+4 x_1 y_1-N)$

So the left-hand side of the equation can be written as

$[r_{3,1}, r_{3,2}] \left[ \begin{array}{ccc} N &3 & \frac{ 3}{2}(1-N) \\ 3 &N &\frac{ 3}{2}(1-N) \end{array} \right] \left[ \begin{array}{c} x_1\\y_1\\1 \end{array} \right]$

What we can see here is that the matrix in the above expression has rank 2, unless $N=3$, where the rank of the matrix is 1. When $N=3$ the nullspace of the matrix is

$\left[ \begin{array}{c} 1 \\ 0 \\ 1 \end{array} \right], \left[ \begin{array}{c} -1 \\ 1 \\ 0 \end{array} \right]$

which correspond to two solutions – $x_1= 1, y_1= 0 \rightarrow 3= 3*1$, and $x_1= 0, y_1= 1 \rightarrow 3= 1*3$. Otherwise the null space is expressed as

$\left[ \begin{array}{c} \frac{3(N-1)}{2(N+3)} \\ \frac{3(N-1)}{2(N+3)} \\ 1 \end{array} \right]$

So, for $N=1$ $x_1=y_1= 0 \rightarrow 1= 1*1$, and for $N=9$ $x_1= 1, y_1= 1 \rightarrow 9= 3*3$.

Overall, despite we started with 9+3 indeterminates there are only 2 are relevant to the problem. The question is now how does it scales with $n$. Or another question, is it at all possible to find an algorithm that will select only relevant monomials in $P_{...}$.

Let’s try to extend analysis to the case $n=2$.
To be continued.

## Shor’s algorithm and period finding

Scott Aaronson give interesting interpretation of Shor’s quantum algorithm for factoring integer for the layman like me. What I do not find is the reason why period finding is difficult classically. The only implicit reason is that many smart people have tried it, and did not find an algorithm. Here I want to check the ideas in the Shor’s algorithm for the classical approach. $N$ is the number we want to factor to be consistent with Scott notation.

The main purpose for the quantum algorithm is to find the period of the function $x^r \mod N= x$ for some integer x. We need two ingredients. First, $r$ can be represented as a binary number $r= \sum_k 2^k b_k, b \in \{0, 1\}^n$. Second the natural representation of $x \in \mathbb{Z}/N$ is $e^{2 \pi i \frac{x}{N}}$. Two together give nice representation of the period as a solution of the system of polynomial equations.

Following Scott presentation, we can write $e^{ 2 \pi i \frac{x}{N} }= e^{ 2 \pi i \frac{x^r}{N} } = e^{ 2 \pi i \frac{x ^{ \left( \sum_k 2^k b_k \right) } }{N} } = \prod_k e^{ 2 \pi i \frac{x ^{ \left( 2^k b_k \right) } }{N} } = \prod_k c_k$, where

$c_k= \left \{ \begin{array}{cc} 1, & b_k= 0 \\ e^{ 2 \pi i \frac{x ^{ 2^k } }{N} }, & b_k = 1 \end{array} \right.$

To encode properties of $b_k$ we need following equations

(1) $b_k (b_k-1)= 0$.

To encode $c_k$ we need additionally

(2) $C_k(x)= \left( e^{ 2 \pi i \frac{x ^{ 2^k } }{N} } -1 \right) b_k + 1$.

and the final equations encoding periodicity

(3) $\prod_k C_k(x_m) - c_{0,x_m} = 0$.

We can write many equations of the form (3) as many as we need using different seed values $x$. (and of cause we need to check if the period is archived by the consecutive squaring). Otherwise we need to solve complicated system of equations.

The first thing is we a re going to rewrite eqns (3). We would call product term in normal form for some set of indices $S$ if it is written as

$\phi_{0} \prod_{k \in S} (b_k-\phi_k)$

Next, we peel the onion. We first are interested on the action of the $b_k+\xi$ on $b_k+\phi_k$

$(b_k+\xi)( b_k+\phi_k)= (1+\phi_k+\xi) b_k+ \phi_k \xi = (1+\phi_k+\xi) ( b_k+ \frac{\phi_k \xi}{1+\phi_k+\xi})$.

So if we choose $\xi= \zeta_k \frac{1+\phi_k}{\phi_k-\zeta_k}$ we will obtain form $b_k+\zeta_k$ with some global multiplier. Therefore, we can convert any form to any form by this action applied to all variables.

Now consider two equations of the form (3).

$\begin{array}{ll} \phi_{0,1} \prod_k (b_k-\phi_{k,1}) - 1 & =0 \\ \phi_{0,2} \prod_k (b_k-\phi_{k,2}) - 1 & =0 \end{array}$.

We are going to act on $b_1$ in the first equation in order to obtain the corresponding term in the second equation, and we are going to act on all the rest variables in the second equation in order to obtain the terms in the first equation. At the end we get

$\begin{array}{ll} \phi'_{0,1} (b_1-\phi_{1,2}) \prod_{k \neq 1} (b_k-\phi_{k,1})- (b_1-\xi_1)&=0 \\ \phi'_{0,2} (b_1-\phi_{1,2}) \prod_{k \neq 1} (b_k-\phi_{k,1})- \prod_{k \neq 1}(b_k-\xi_k)&=0 \end{array}$

The differences between two is

$\phi_0 \prod_{k \neq 1}(b_k-\xi_k) - (b_1-\xi_1)=0$

$\left \{ \begin{array}{lll} b_p \left( c_{0,2} \prod_k c_k(x_1) + c_{0,1} \prod_k c_k(x_2) \right) &= c_{0,2} \frac{ 1+c_{1,p} }{c_{1,p}} \prod_k C_k(x_1) - c_{0,2} \frac{ 1+c_{1,p}}{c_{1,p} } \prod_{k \neq p} C_k(x_1) - c_{0,1}\frac{ 1+c_{2,p} }{c_{2,p}} \prod_k C_k(x_2) + c_{0,1}\frac{ 1+c_{2,p}}{c_{2,p} } \prod_{k \neq p} C_k(x_2) & = 0\\ b_q \left( c_{0,2} \prod_k c_k(x_1) + c_{0,1} \prod_k c_k(x_2) \right) &= c_{0,2} \frac{ 1+c_{1,q} }{c_{1,q}} \prod_k C_k(x_1) - c_{0,2} \frac{ 1+c_{1,q}}{c_{1,q} } \prod_{k \neq q} C_k(x_1) - c_{0,1}\frac{ 1+c_{2,q} }{c_{2,q}} \prod_k C_k(x_2) + c_{0,1}\frac{ 1+c_{2,q}}{c_{2,q} } \prod_{k \neq q} C_k(x_2) & = 0 \\ c_{0,2} \prod_k c_k(x_1) - c_{0,1} \prod_k c_k(x_m) & =0 \end{array}\right.$

$\left \{ \begin{array}{ll} c_{0,2} \left( \frac{ 1+c_{1,p} }{c_{1,p}} - \frac{ 1+c_{2,p} }{c_{2,p}} \right) \prod_k C_k(x_1) - c_{0,2} \frac{ 1+c_{1,p}}{c_{1,p} } \prod_{k \neq p} C_k(x_1) + c_{0,1}\frac{ 1+c_{2,p}}{c_{2,p} } \prod_{k \neq p} C_k(x_2) & = 0 \\ c_{0,2} \left( \frac{ 1+c_{1,q} }{c_{1,q}} - \frac{ 1+c_{2,q} }{c_{2,q}} \right) \prod_k C_k(x_1) - c_{0,2} \frac{ 1+c_{1,q}}{c_{1,q} } \prod_{k \neq q} C_k(x_1) + c_{0,1}\frac{ 1+c_{2,q}}{c_{2,q} } \prod_{k \neq q} C_k(x_2) & = 0 \\ c_{0,2} \frac{ 1+c_{2,q} }{c_{2,q}} \prod_k c_k(x_1) - c_{0,1} \frac{ 1+c_{2,q} }{c_{2,q}} \prod_k c_k(x_m) & =0 \end{array}\right.$

In order to avoid possible multiplication to 0 $a \neq \{0,-1\}$. Let $\phi_{k,m}= e^{ 2 \pi i \frac{x_m ^{ 2^k } }{N} } -1$. Than

$(b_k +a) C_k(x_m) = \left( 1+a+\frac{1}{\phi_{k,m}}\right) C_k(x_m) - \left( 1+\frac{1}{\phi_{k,m}} \right)$,

$(b_n+a_1) \prod_k C_k(x_m) = \left( 1+a_1+\frac{1}{\phi_{k,m}}\right) \prod_k C_k(x_m) - \left( 1+\frac{1}{\phi_{k,m}} \right) \prod_{k \neq n} C_k(x_m)$,

$(b_n+a_2) (b_n+a_1) \prod_k C_k(x_m)$ $= (b_n+a_2) \left( 1+a_1+\frac{1}{\phi_{k,m}}\right) \prod_k C_k(x_m) - (b_n+a_2) \left( 1+\frac{1}{\phi_{k,m}} \right) \prod_{k \neq n} C_k(x_m)$ $= \left[ \left( 1+a_2+\frac{1}{\phi_{k,m}}\right) \left( 1+a_1+\frac{1}{\phi_{k,m}}\right) - \frac{1}{\phi_{k,m}} \left( 1+\frac{1}{\phi_{k,m}} \right) \right] \prod_k C_k(x_m)$ $- \left[ \left( 1+\frac{1}{\phi_{k,m}} \right) \left( 1+a_1+\frac{1}{\phi_{k,m}}\right) +\left( a_2+\frac{1}{\phi_{k,m}} \right) \left( 1+\frac{1}{\phi_{k,m}} \right) \right] \prod_{k \neq n} C_k(x_m)$

—— >>> Still needs update <<< —-
Applying this twice for second line in (4) with different constants $a_1, a_2$ we get

$\left \{ \begin{array}{ll} \mbox{const}_1 \prod_k c_k(x_1) +\mbox{const}_3 \prod_k c_k(x_m) + \mbox{const}_2 \prod_{k \neq n} c_k(x_1)+ \mbox{const}_4 \prod_{k \neq n} c_k(x_m) & = 0\\ \mbox{const}_5 \prod_k c_k(x_1) + \mbox{const}_7 \prod_k c_k(x_m)+ \mbox{const}_6 \prod_{k \neq n} c_k(x_1) + \mbox{const}_8 \prod_{k \neq n} c_k(x_m) & =0 \\ \mbox{const}_9 \prod_k c_k(x_1) - \mbox{const}_{10} \prod_k c_k(x_m) & =0 \end{array}\right.$

From which we can eliminate terms $\prod_k c_k(x_1)$ and $\prod_k c_k(x_m)$. Therefore, we left with the system of second line in (4) with one variable less. Applying it recursively we can get equation on a single variable find the solution and back substitute.

I leave details for experts, what values of $a_1, a_2$ should be chosen at each stage, the precision one have to keep for const values (solving it exact may require exponential number of terms), the possible linear dependencies in eliminating variable (e.g. the choice and the number of seed values $x_m$ ), etc. Don't forget I may have simple mistake, and this is bullshit.

[Update] since there is no interest, I unroll some statements. (19 Oct 2012)

Posted in Uncategorized | Tagged | 1 Comment

## On the integer factorization and Nullstellensatz linear algebra: Fighting monomials.

Previous post show the idea. The problem is that one need to find polynomials of high degree to get the number of coefficients equal to the number of monomials. Too many monomials have to be killed. Here is another way to hunt. Shor’s algorithm uses Fourier transform , and fast integer multiplication also uses discrete Fourier Transform. Let’s try it here. It is going to be only sketch here, it is neither pretended to be correct nor complete.

We have our binary represented “could be factors”

$\begin{array}{l} x= \sum \limits_{k=0}^{K_x} 2^k x_k,\\ y= \sum \limits_{k=0}^{K_y } 2^k y_k, \\ x_k^2-x_k=0,\\ y_k^2-y_k=0 \end{array}$.

Following the fast multiplication recipe we are going to pad out binary representations with zeros, e.g. now

$\begin{array}{ll} x= \sum \limits_{k=0}^{K_x+K_y} 2^k x_k, \\ y= \sum \limits_{k=0}^{K_y+K_x } 2^k y_k, \\x_k^2-x_k=0, & k=0..K_x, \\ x_k=0, & k> K_x, \\ y_k^2-y_k=0, & k=0..K_y, \\ y_k= 0, & k > K_y. \end{array}$

Now we want to apply Discrete Fourier Transform to our factors, e.g. ($K= K_x+K_y$)

$\begin{array}{ll} x_k= \frac{1}{\sqrt{K}} \sum \limits_{m=0}^{K-1} e^{ \frac{2 \pi i}{K} k m} v_m \\ y_k= \frac{1}{\sqrt{K}} \sum \limits_{m=0}^{K-1} e^{ \frac{2 \pi i}{K} k m} u_m \\ \end{array}$

We need to keep the constraints.
$\begin{array}{ll} x_k^2-x_k= \left(\frac{1}{\sqrt{K}} \sum \limits_{m=0}^{K-1} e^{ \frac{2 \pi i}{K} k m} v_m \right) \left( \frac{1}{\sqrt{K}} \sum \limits_{m=0}^{K-1} e^{ \frac{2 \pi i}{K} k p} v_p \right) - \frac{1}{\sqrt{K}} \sum \limits_{k=0}^{K-1} e^{ \frac{2 \pi i}{K} k m} v_m \\ \end{array}$

The linear combinations of constraints are still constraints, so we can make another Fourier transform for the constraints. Since it is non-singular transformation we are not loosing any constraint.

$\begin{array}{ll} F(x_k(v))= \frac{1}{\sqrt{K}} \sum \limits_{k=0}^K e^{ -\frac{2 \pi i}{K} k n} (x_k^2-x_k)= \left( K^{-\frac{3}{2}} \sum \limits_{m, p, k=0}^{K-1} e^{ \frac{2 \pi i}{K} k(m+p-n)} v_m v_p \right) - \frac{1}{K} \sum \limits_{k, m=0}^{K-1} e^{ \frac{2 \pi i}{K} k (m-n) } v_m \\ \end{array}$

Which means that each of K constraints has a single term of the first order, and exactly $K$ terms of second order. In total it contains all quadratic terms distributed across $K$ constraints such that $m+p \mod K = n$.

Now the product $x y-n$ is also containing linear in K number of terms because we need to take point-wise products of the transformed x

$x y -n = \frac{1}{\sqrt{K}} \sum \limits_{m=0}^{K-1} 2^m \sum \limits_{k=0}^{K-1} e^{ -\frac{2 \pi i}{K} k m} v_k u_k - n= 0$

Now we are ready to construct certificate. Let say of degree 4. Moreover, we want to restrict the terms in the certificate to not exceed 2 in any variable. Therefore, we have the following components.

$K-K_x+1$ terms of the form $\forall p=K_x..K-1, x_p(v) \left( \sum \limits_{k=0}^{K-1} \sum \limits_{m=k}^{K-1} \sum \limits_{n=0}^{K-1} c_{p,k,m,n}^X u_k u_m v_n \right)$ – giving $(K-K_x)*K*K (K+1)/2$ coefficients.

$K-K_y+1$ terms of the form $\forall p=K_y..K-1, y_p(v) \left( \sum \limits_{k=0}^{K-1} \sum \limits_{m=k}^{K-1} \sum \limits_{n=0}^{K-1} c_{p,k,m,n}^Y v_k v_m u_n \right)$ – giving $(K-K_y)*K*K (K+1)/2$ coefficients. Both of this conditions give $K^3(K+1)/2$

$K_x-1$ terms representing quadratic constraints for unknown x bits: $F(x_p(v)) \left( \sum \limits_{k=0}^{K-1} \sum \limits_{m=k}^{K-1} c_{p,k,m}^{X^2} u_k u_m \right)$

$K_y-1$ terms representing quadratic constraints for unknown y bits: $F(y_p(u)) \left( \sum \limits_{k=0}^{K-1} \sum \limits_{m=k}^{K-1} c_{p,k,m}^{Y^2} v_k v_m \right)$

On the other hand there are only $(K+1)^2*(K+2)^2/4$ monomials in the certificate. Therefore, there is enough coefficients, if they are linearly independent (which have to be shown) to test feasibility of factoring.
If we would be lucky to show this linear independence, and that this linear independence holds for all not factorable n, we would have polynomial algorithm for factorization, since the degree of the certificate can be bounded by 4, independent of n.

Posted in Uncategorized | | 1 Comment

## On the integer factorization and Nullstellensatz linear algebra

In integer factorization we are interested in finding the nontrivial divisor of the positive integer $n$. Suppose we have factors $p$ and $q$. We are interested in representing factors in binary form – $x= \sum \limits_{k=0}^{\lceil \log_2 n \rceil -1 } 2^k x_k, y= \sum \limits_{k=0}^{\lceil \log_2 n \rceil -1 } 2^k y_k$, than we are interested in solving system of equations:

(1) $\left\{ \begin{array}{r} x_1(x_1-1)=0 \\ ... \\ x_K(x_K-1)=0 \\ y_1(y_1-1)=0 \\ ... \\ y_K(y_K-1)=0 \\ \left( \sum \limits_{k=0}^{K} 2^k x_k \right ) \left( \sum \limits_{m=0}^{ M } 2^m y_m \right) -n =0 \end{array} \right.$

We assume $x_k, y_k \in \mathbb{C}$. Each equation except the last one has exactly two solutions – $\{0,1\}$. The last equation (given the constraint of the first equation) is encoding product of binary representation of integer numbers $x, y$ with K and M bits respectively. Therefore, the ideal of the system of polynomial equations (1) is either empty set, or the sets of points representing the possible factorization (and thus zero-dimensional). For example, it has two solution if both $p$ and $q$ are primes – $x=p,y=q$ and $x=q, y=p$. We also can assume that both $p$ and $q$ are odd primes, than $x_0=1$ and $y_0=1$.

Suppose, that we can determine the feasibility of the system (1) in a reasonable time. Then finding factors would be equal to testing recurrently whether the system is feasible with the K-th bit in x representation is equal to one or zero, and then testing next bit in feasible configuration, until all bits are determined. That would require $\log n$ testings.

Now we are interested in finding the fast algorithm for testing feasibility of system

(2) $\left\{ \begin{array}{r} x_1(x_1-1)=0 \\ ... \\ x_K(x_K-1)=0 \\ y_1(y_1-1)=0 \\ ... \\ y_K(y_K-1)=0 \\ \left( 1 + \sum \limits_{k=1}^{K} 2^k x_k \right ) \left(1+ \sum \limits_{m=1}^{ M } 2^m y_m \right) -n =0 \end{array} \right. , K+M= \lceil \log_2 n \rceil .$

Here, we will apply Nullstellensatz linear algebra (J.A. De Loera et al. / Journal of Symbolic Computation 46 (2011) 1260–1283) algorithm for testing the feasibility of the system of polynomial equations (2). The algorithm basically is based on Hilbert’s Nullstellensatz, in particular, a system of polynomial equations $f_1(x)=0, ..., f_s(x)= 0$, where $f_i \in \mathbb{K}[x_1, ..., x_m]$, and $\mathbb{K}$ is algebraically closed field has no solution in $\mathbb{K}^m$ if and only if there exist polynomials $\beta_1, ..., \beta_s \in \mathbb{K}[x_1, ... , x_m]$ such that $1= \sum \beta_i f_i$. Moreover, $1= \sum \beta_i f_i$ is called Nullstellensatz certificate. The idea behind Nullstellensatz linear algebra algorithm is to fix the degree of the polynomials $\beta_i(x)$ to $d$ and write the polynomials in the general way – $\beta_i(x)= \sum c_{i,k} x^{\alpha_k}$, where $\alpha_k \in \mathbb{N}_0^m, c_{i,k} \in \mathbb{C}$, $x^{\alpha}= \prod x_i^{\alpha_i}$, and the sum is taken over all monomials up to degree d. Then expanding the Nulsstellensatz certificate and equating coefficients in the left and right side one is looking to solve the system of linear equations on the coefficients.

For example, suppose K=M=1, then we have the following system

$\left\{ \begin{array}{r} x(x-1)=0 \\ y(y-1)=0 \\ \left( 1 + 2 x \right ) \left(1+ 2 y \right) -n =0 \end{array} \right.$

Now suppose we fix the degree of $\beta_1, \beta_2, \beta_3$ to 2, so the final polynomial is of order 4. In general $\beta_k(x,y)= c_{k,1} x^2+ c_{k,2} xy+ c_{k,3} y^2 + c_{k,4} x + c_{k,5} y + c_{k,6}$, than Nullstellensatz certificate is

$1= (c_{1,1} x^2+ c_{1,2} xy+ c_{1,3} y^2 + c_{1,4} x + c_{1,5} y + c_{1,6} )(x^2-x) +$ $(c_{2,1} x^2+ c_{2,2} xy+ c_{2,3} y^2 + c_{2,4} x + c_{2,5} y + c_{2,6} )(y^2-y) +$ $(c_{3,1} x^2+ c_{3,2} xy+ c_{3,3} y^2 + c_{3,4} x + c_{3,5} y + c_{3,6} )(4xy+2x+2y+1-n) =$

$c_{1,1} x^4 + ( c_{1,2} + 4 c_{3,1} ) x^3 y + ( c_{1,3} + c_{2,1} + 4 c_{3,2} ) x^2 y^2 + ( c_{2,2} + 4 c_{3,3} ) x y^3 + c_{2,3} y^4 + ( - c_{1,1} + c_{1,4} + 2 c_{3,1} ) x^3 + ( -c_{1,2}+ c_{1,5} - c_{2,1} + 2 c_{3,1} + 2 c_{3,2} + 4 c_{3, 4} ) x^2 y + ( -c_{1,3} - c_{2,2} + c_{2,4} +2 c_{3,3} + 2 c_{3, 2 } + 4 c_{3,5} ) x y^2 + ( c_{2,5} - c_{2,3} + 2 c_{3,3} ) y^3 + ( c_{1,6} -c_{1,4} + c_{3,1} + 2 c_{3,4} ) x^2 + ( -c_{1,5} -c_{2,4} + c_{3,2} (1-n) + 4 c_{3,6} + 2 c_{3,5} + 2 c_{3,4} ) xy + ( c_{2,6} - c_{2,5} + 2 c_{3,5} + c_{3,3} ) y^2 + ( -c_{1,6} + 2 c_{3,6} + c_{3,4} (1-n) ) x + ( -c_{2,6} + 2 c_{3,6} + c_{3,5} (1-n) ) y + c_{3,6} (1-n)$.

which defines 15 equations in 18 variables.

$\left\{ \begin{array}{r} c_{1,1}=0 \\ c_{1,2} + 4 c_{3,1}=0 \\ c_{1,3} + c_{2,1} + 4 c_{3,2}= 0 \\ c_{2,2} + 4 c_{3,3}= 0 \\ c_{2,3}= 0 \\ - c_{1,1} + c_{1,4} + 2 c_{3,1} = 0 \\ -c_{1,2}+ c_{1,5} - c_{2,1} + 2 c_{3,1} + 2 c_{3,2} + 4 c_{3, 4}=0 \\ -c_{1,3} - c_{2,2} + c_{2,4} +2 c_{3,3} + 2 c_{3, 2 } + 4 c_{3,5}= 0 \\ c_{2,5} - c_{2,3} + 2 c_{3,3}= 0 \\ c_{1,6} -c_{1,4} + c_{3,1} + 2 c_{3,4} =0 \\ -c_{1,5} -c_{2,4} + c_{3,2} (1-n) + 4 c_{3,6} + 2 c_{3,5} + 2 c_{3,4} = 0 \\ c_{2,6} - c_{2,5} + 2 c_{3,5} + c_{3,3} = 0 \\ -c_{1,6} + 2 c_{3,6} + c_{3,4} (1-n)=0 \\ -c_{2,6} + 2 c_{3,6} + c_{3,5} (1-n) =0 \\ c_{3,6} (1-n)= 1 \end{array} \right.$

This system has infinitely many solutions unless n=1,3,9, which are the composite numbers represented by the 2-bit factors. The explicit certificate for equivalent system

$\left\{ \begin{array}{r} x(x-2)=0 \\ y(y-2)=0 \\ \left( 1 + x \right ) \left(1+ y \right) -n =0 \end{array} \right.$

is

$1= \left[ \left( \frac{9}{4(n-3)(n-9)} - \frac{1}{4(n-1)(n-3)} \right) y^2 + \frac{1}{(n-1)(n-3)} \right] x( x-2 ) +$ $\left[ -\left( \frac{1}{4(n-1)(n-3)}+ \frac{3}{4(n-3)(n-9)} \right) x^2 + \frac{6}{(n-3)(n-9)} x + \frac{1}{(n-1)(n-3)}\right] y (y-2) +$ $\left[ \left( \frac{1}{2(n-1)(n-3)}- \frac{3}{2(n-3)(n-9)} \right) xy -\frac{1}{(n-1)(n-3)} x - \frac{1}{(n-1)(n-3)} y -\frac{1}{n-1} \right] [(x+1)(y+1)-n]$.

We do not know the degree of the polynomials $\beta$ in advance. On the other hand, we can construct lower and upped bound. The upper bound can be computed using Corollary of Lazard lemma (p 1264 in De Loera paper)

Corollary 2.5. Given polynomials $f_1, ... , f_s \in \mathbb{K}[x_1, ... , x_n]$ where $\mathbb{K}$ is an algebraically closed field and d = max{ deg( $f_i$) }, if $f_1, ... , f_s$ have no common zeros and $f_1, ... , f_s$ have no common zeros at infinity, then $1 = \sum \limits^s_{ i=1} \beta_i f_i$ where deg($\beta_i$) ≤ n(d − 1).

The system (1) has no common zeros if infisible, and also has no common zeros at infinity, d=2, therefore deg$(\beta_i) \leq \lceil \log n \rceil$, which lead to the linear system in $(\lceil \log n \rceil +1 ) \left( \begin{array}{c} 2 \lceil \log n \rceil \\ \lceil \log n \rceil \end{array} \right) \approx n^2 \sqrt{\lceil \log n \rceil}$ coefficients.

On the other hand, the lower bound can be computed by equating the number of monomials in the certificate and the number of coefficients. Let (M+K)=m be the total number of variables. The number of coefficients is equal to the number of monomials times the number of equations in system (1), e.g. for degree d – $\left( \begin{array}{c} m+d \\ m \end{array} \right) (m+1)$, on the other hand the number of monomials in Nullstellensatz certificate correspond to the number of monomials for the polynomial with degree d+2, e.g $\left( \begin{array}{c} m+d+2 \\ m \end{array} \right)$. Than d should satisfy

$0= (m+1) \left( \begin{array}{c} m+d \\ m \end{array} \right)- \left( \begin{array}{c} m+d+2 \\ m \end{array} \right) = \frac{(m+1) (m+d)!}{m!d!} - \frac{(m+d+2)!}{m!(d+2)!}=$ $\frac{(m+d)!}{m!(d+2)!} ( m(d+1)(d+2) - (m+d+1)(m+d) ) \Rightarrow d^2+d -m- 1=0$ $\Rightarrow d= \sqrt{ \frac{5}{4}+m} - \frac{1}{2} \approx \sqrt{m}$. Therefore, the minimal number of monomials is $(m+1) \left( \begin{array}{c} m+\sqrt{m} \\ m \end{array} \right) \approx \sqrt{m} 2^{ \sqrt{m} }$, which is subexponential, and comparable to Quadratic sieve performance. To finish the work one needs to prove that the system of equations for coefficients is solvable.

Posted in Uncategorized | 1 Comment

## On the Lonely Runner conjecture for prime number of runners

The formulation of the problem can be found here, here, or here.

Conjecture Suppose $p$ runners having distinct constant speeds start at a common point and run laps on a circular track with circumference 1. Then for any given runner, there is a time at which that runner is distance at least $\frac{1}{p}$ (along the track) away from every other runner.

We consider $p$ being odd prime number of runners with distinct speeds $v_r \in \mathbb{Z}, r=1..p-1$, and without loss of generality $v_p=0$.

Consider speeds decomposition base p $v_r= \sum \limits_{n=0}^M v_{r,n} p^n, v_{r,n} \in \mathbb{Z}/p$, with negative speeds being p-complement represented.

For each speed there is first non zero ‘digit’ base p, e.g. $n_r= \max k: v_{r, k'} = 0, \forall k' < k$. We want to rearrange speeds that the positions of the first non zero digits in descending order – $n_r \geq n_{r'}, \forall r < r';$.

For example, if $n_1=0$ than every runner has speeds $v_r \mod p \neq 0$, and at time $t=\frac{1}{p}$ their positions on a lap are $\frac{v_{r,0}}{p}$ + whole number of laps – e.g. they all are distant from the origin – Lonely Runner conjecture is satisfied.

Some terminology for positions on the lap. Sector k – $S_k = \left( \frac{k}{p}, \frac{k+1}{p} \right)$, beginning of the sector k $B_k= \frac{k}{p}$. Origin – $B_0$. Lonely Runner conjecture is satisfied when there are no runner sin $B_0, S_0, S_{p-1}$. $B_k$ is occupied if at least one of the runner at position $B_k$, and unoccupied (free) otherwise.

At time 0 all runners are at origin. If $n_1= n_{p-1}$ at time $\frac{1}{p^{n_1+1} }$ all runners are distant. Therefore, we consider the case when $n_1 > n_{p-1}$.

At time $\frac{1}{p^{n_1+1} }$ none of the runners at the origin, runners with $n_r= n_1$ are in the beginning of sectors $v_{r, n_1}$. Since there are at most $p-2$ such runners at least one $B_k, k \neq 0$ is unoccupied. Since multiplicative integer group modulo prime number is cyclic there is time $t_1= \frac{m_1}{p^{n_1+1}}, m_1 \in \mathbb{Z}/p$ when $B_{p-1}$ is free. At that time none of the runners in the last sector, none in the origin, but some may be at sector 0.

We need the following sequence of numbers $r_1 = 1, r_k = \min k': n_{k'} < n_{r_{k-1}}, k \geq 2$, e.g. $r_k$ indexing the first runners with distinct $n_r$.

Suppose we have runners at arbitrary positions $l_1, ..., l_n$ with speeds $v_1, .., v_n, v_k < p-1, \forall k$. Consider times $\frac{m}{p}, m \in \{1..p-1\}$. Since the multiplicative group of integers modulo prime p is cyclic there is at most 2 distinct $m$ when the runner is in the vicinity of origin. Therefore, if $n< \frac{p-1}{2}$ there is m when none of the runner in the vicinity of the origin.

From that it follows that if $r_{k+1} -r_k < \frac{p-1}{2}$ there is a time $t= \sum_k \frac{m_k}{p^{n_{r_k}+1}}$, where $m_k$ is a sequence of numbers that guarantee that the first $r_k$ runners neither in the first nor in the last sectors by applying previous paragraph recurrently.

The difficult case that is left is when there are two or more $r_k$, and $\exists k | r_{k+1} -r_k \geq \frac{p-1}{2}$.

Update 28.Aug.2012

The idea to use group operations and time proportional to $\frac{1}{p^k}$, where k is $\max \lceil log_p v_k \rceil$ does not work. For example, the time when no runners are near the origin for the runners with speeds 1 3 4 5 7 18 is $\frac{5}{11}$ and $\frac{6}{11}$ and for the times $\frac{m}{49}$ for all $m$ there is at least one runner in the vicinity of the origin. On the other hand, the time when closest runners are most distant from the origin is corresponding to the time when runners 4 and 7 are at the same distance from the origin on the opposite sides from the origin. e.g. the time for the maximum distance of closest runners from the origin is of the form $\frac{m}{v_k+v_i}$ for some $m, k, i$, and the question here is to check if there are different runners close to the origin. Moreover, either all $m, v_k, v_i$ have common factor, or they are relatively prime (no pair have common factors).

Posted in Uncategorized | 2 Comments

## Estimating family of densities

Given a set of samples $S_i= \{X_{i,k}\}, i=1..M, k=..N$ We are interested in finding the family of the distributions that describes in some sense the whole set. The problem is ill defined. On the other hand, if samples are drawn from distributions with probability densities $p_i(x)$ than the usual measure for the differences between two distributions would be Kullback-Leibler divergence. Than, the overall measure between the family and the probability samples would be expected value of the KL-divergence $\inf_{ \{ \theta_i\} }E \left[ D_{KL}( p_i(x) || p(x, \theta_i) \right]$.

It is interesting to see what is happening when all samples are drawn from exponential family, e.g. $p_i(x)= \frac{1}{Z_i}e^{-u_i(x)}$, where $Z_i= \int_D e^{-u_i(x)} dx < \infty$ is a partition function, and $u_i$ is a potential. Here, the above-defined cost function, taking sample mean over the distributions, reads $\frac{1}{N} \sum_i \int_D p_i(x) \left( \log p(x;\theta_i) + u_i(x) \right) dx + \log Z_i =$ $\frac{1}{N} \sum_i \int_D p_i(x) \left( u_i(x) -u(x;\theta_i) \right) dx + \log Z_i- \log Z( \theta_i ) =$ $\frac{1}{N} \sum_i \int_D p_i(x) \left( u_i(x) - \sum_m \theta_{k,m} \phi_m(x ) \right) dx + \log Z_i- \log Z( \theta_i )$. That is certainly zero, when the set $\{ \phi_m(x) \}$ spans the same space as $u_i(x)$. Although $D_{KL}$ is not symmetric with respect to its arguments in the case of exponential families it does not matter, since elements under the integral are almost everywhere zero. With this respect any measure that is a function of the potential differences will discover the correct family, and by that will describe sample potentials exactly almost everywhere. From that it follows that the partition function will also coincide, and can be ignored in the cost function.

So far we have established that to find the exponential family of distributions one need to estimate sample probability densities, take their logarithm, remove the constant part representing partition function, and then perform principal value decomposition to find the basis functions $\phi_m(x)$, that in particular represents the minimal sufficient statistics with respect to the family. In other words, if we do not know the underlying family, we need to choose ambient space of functions representing individual potentials, for example the space of square integrable functions, choose the basis, represent the probability of samples in this basis, and perform principal component analysis. That achieves couple of goals. First, it removes the null space, e.g. It leaves only the sufficient statistics, and second, it is also gives relative weights of the different directions in this space. In other words, if one wants to reduce the representation of probabilities with the smaller number of statistics one have some objective measure what statistics to chose, with respect to the variance explained. (examples- discrete and continuous cases on bounded domains).

There are however two problems. One is to get good estimate of the probability densities, and the second one is that the family does not have to be exponential. And we are interested to discover the family of distributions based on the data only. We assume no additional information is available.

It the family is not of exponential family and if there is no additional information available it is still reasonable to approximate samples by the large dimensional space of functions that capture the wide class of functions, e.g. The space of square integrator functions or the class of bounded functions. The principal component analysis will sort the subsisted of functions according to the variance explained, and therefore it is possible to select the subspace that will provide minimum mean squared estimates of the samples distributions.

Now the real problem comes with the density estimation. The most popular techniques divides into two classes. The parametric estimation and non-parametric ones. The nonparametric methods are consisting basically on remembering all the data, and smoothing them somehow. The large class of such methods are usually referred as kernel density estimations. That is certainly is not helpful for estimating a paramedic family. The second class of methods assume some parametric form and minimize some cost function. This class of methods are usually redder as Bayesian. In particular very popular method of estimating parameters, assuming flat prior is maximum likelihood estimations. This set of methods require to compute partition function which is known to be hard.

There is another technique, which does not require computing partition function – score matching, proposed by Aapo Hivarinen. We would exploit a modified version here. We consider here exponential family of distributions $u_i(x)= \sum_k \theta_{i,k} \phi_k(x)$, were $\phi_k$ is the finite/countable set of basis functions spanning some Hilbert space (we need Hilbert to have inner product, that allows to perform PCA).

Given the set of “good” probing functions $f_m(x)$, each sampling distributions in the limit should satisfy the condition $\left< \frac{ \partial f_m(x) }{ \partial x_k }+ \vec \theta^T f_m(x) \frac{ \partial \vec \phi(x) }{ \partial x_k } \right> =0$ In the case of the sampling these equation can be true only approximately.

$\sum \limits_{m,k, i} \left< \frac{ \partial f_m(x) }{ \partial x_k }+ \vec \theta_i^T f_m(x) \frac{ \partial \vec \phi(x) }{ \partial x_k } \right>^2$

We are interested in finding a subspace with the basis vectors $u$, mean vector $mu$, with $\theta_i= \mu+ \vec \zeta_i u$ that minimize the cost function

$\sum \limits_{m,k,i} \left< \frac{ \partial f_m(x) }{ \partial x_k }+ (\mu + \vec \zeta_i u )^T f_m(x) \frac{ \partial \vec \phi(x) }{ \partial x_k } \right>^2$. That is quartic equation, that is hard to compute. It is also analog of the PCA with missing data.

Aapo Hyvärinen suggested a method to estimate non-normalized statistical models by the means of score matching. There is an alternative way to derive his objective function.

We start with absolutely continuous positive in the domain $D$ probability density function $\frac{1}{Z(\theta)} u(x;\theta)$, where $Z(\theta)= \int_D u(x;\theta)$ is a partition function, and $u(x;\theta)$ is a potential. Then for any good enough probing function $f(x)$ with $f(x)p(x)=0$ on the boundary of the domain ($\partial D$) the expected value for $\frac{\partial f(x) }{ \partial x_i }$ by integration by parts is

$\left< \frac{\partial f(x) }{ \partial x_i } \right> =$ $\int_D \frac{\partial f(x) }{ \partial x_i } p(x) dx =$ $\left. f(x)p(x) \right|_{\partial D} - \int_D f(x) \frac{\partial p(x) }{ \partial x_i } dx =$ $- \int_D f(x) \frac{\partial p(x) }{ \partial x_i } \frac{p(x)}{p(x)} dx =$ $- \int_D f(x) \frac{\partial \log p(x) }{ \partial x_i } p(x) dx =$ $-\left< f(x) \frac{\partial \log p(x) }{ \partial x_i } \right>$

In other words, for any good enough probability density function, and for any food enough probing function $f(x)$ the following identity holds.

(1) $\left< \frac{\partial f(x) }{ \partial x_i } \right> + \left< f(x) \frac{\partial \log p(x) }{ \partial x_i } \right> = 0$.

Now choosing the set of probing functions $f_i(x)= \frac{1}{2} \frac{ \partial u(x;\theta) }{ \partial x_i }$ and summing up identities (1) for all probing function we get

$2 \sum \limits_i \left< \frac{1}{2} \frac{\partial^2 u(x;\theta) }{ \partial x_i^2 } +\left( \frac{1}{2} \frac{\partial u(x;\theta) }{ \partial x_i } \right)^2 \right> = \sum \limits_i \left< \frac{\partial^2 u(x;\theta) }{ \partial x_i^2 } + \frac{1}{2} \left( \frac{\partial u(x;\theta) }{ \partial x_i } \right)^2 \right>$

i.e. Aapo cost function. In other words, my interpretation of score matching is a particular way to impose identity (1) for the distribution given a set of data points via sampling mean.

Now consider exponential family of distributions, i.e. $u(x;\theta)= \sum \limits_k \theta_k \phi_k(x)$ than the cost function reads

$\sum_i \left< \vec\theta^T \frac{\partial^2 \vec\phi(x)}{ \partial x_i^2 } + \frac{1}{2} \left( \frac{\partial \vec\phi(x)}{ \partial x_i } \vec\theta \right)^2 \right> = \sum_i \left< \vec\theta^T \frac{\partial^2 \vec\phi(x)}{ \partial x_i^2 } + \frac{1}{2} \vec\theta^T \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \vec\theta \right>$

That will fail miserably when $\left< \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \right>$ is ill conditioned for any realistic sample sizes. Moreover, if $\left< \frac{\partial \vec\phi(x)}{ \partial x_i } \frac{\partial \vec\phi(x)}{ \partial x_i }^T \right>$ is singular the problem is ill defined, and one need to do some kind of regularization to obtain $\theta$, i.e. biasing estimated distribution.

## On the density estimation

For probability space $(\Omega, \mathcal{F}, \mathbb{P})$ with $A \in \mathcal{F}$ the indicator random variable

${\bf 1}_A : \Omega \rightarrow \mathbb{R} = \left\{ \begin{array}{cc} 1, & \omega \in A \\ 0, & \omega \notin A \end{array} \right.$

Than expected value of the indicator variable is the probability of the event $\omega \in A$.

$E( {\bf 1}_A )= \int_{\Omega} {\bf 1}_A(\omega) d \mathbb{P} = P(A)$.

Next we are interested in real valued random variable $X:\Omega \rightarrow \mathbb{R}$. We are now interested to find, assuming absolutely continuous measure $\mathbb{P}$, the probability density $p(X)$, given a finite set of observations $X_i= X(\omega_i),\ i=1..N$.

$P(A)= E( {\bf 1}_A )= \int_{\Omega} {\bf 1}_A(\omega) d \mathbb{P} =$ $\int_{ \mathbb{R} } {\bf 1}_A (x) p(x) dx = \int_{ \mathbb{R}^2 } {\bf 1}_A (x) p(x) \delta(x-y) {\bf 1}_A (y) dx dy = \int_{ \mathbb{R}^2 } {\bf 1}_A (x) K(x, y) {\bf 1}_A (y) dx dy$,

where $\delta(x-y)$ is Dirac delta function, and $K(x, y)= p(x) \delta(x-y)$.

Next we want another complication – a resolution of identity, e.g. to represent all indicator function in some basis, e.g ${\bf 1}_A (x) = \sum_i | i > < i | {\bf 1}_A (x)$

We have

$P(A)= \sum_{i,k} \int_{ \mathbb{R}^2 } {\bf 1}_A (x) | i >$ $< i | K(x, y) | k >$ $< k |$ ${\bf 1}_A(y) dx dy=$ $\sum_{i,k} \int_{ \mathbb{R}^2 } {\bf 1}_A (x) | i > \rho < k|{\bf 1}_A (y) dx dy$,

where $\rho$ is the density matrix by the analogy with quantum mechanics.

To be concrete we can use the real valued Hermite functions $\psi_n(x)$ as a complete basis set. Here we have $\sum \limits_{n=0}^\infty \psi_n(z) \psi_n(z') = \delta( z-z')$, so
$P(A)= \sum_{i,k} \int_{ \mathbb{R}^2 } {\bf 1}_A (x) | i > \rho < k|{\bf 1}_A (y) dx dy =$
$\sum_{i,k} \int_{ \mathbb{R}^4 } {\bf 1}_A (x) \psi_i( x ) \psi_i( z_1 ) p(z_1) \delta( z_1 - z_2 ) \psi_k( z_2) \psi_k( y ) {\bf 1}_A (y) dx dy dz_1 dz_2$.

The density matrix elements $\rho_{i,k}= \int_{ \mathbb{R}^2 } \psi_i( z_1 ) p(z_1) \delta( z_1 - z_2 ) \psi_k( z_2) dz_1 dz_2 = \int_{ \mathbb{R} } \psi_i( z ) p(z) \psi_k( z) dz$. The density $p(x)= \sum_{i,k} \psi_i( x ) \rho_{i,k} \psi_k( x )$.

In order to estimate density one need to use samples to estimate density matrix, e.g. replace the integral with sample mean $\hat {\rho}_{i,k} = \frac{1}{N} \psi_i( X_i ) \psi_k( X_i )$. In terms of asymptotic behavior, each element of the density matrix will converge to the true value (asymptotically unbiased), and the variance of the expected value decay as $\frac{1}{N}$. Think of it as a Monte-Carlo integration.

1. This seems (since no theorems are proven here) to be more efficient than the kernel method of density estimation.
2. The truncated version of Hermite functions will lead to systematic bias. Since Hermite functions are the basis function of continuous Fourier transform, the bias will consist of truncating the bandwidth of probability density function.
3. The complexity is in the complexity of matrix computation, e.g. $N*M^2$, where $M$ is the number of Hermite functions used, compared to just $N$ for kernel method.
4. The basis set does not have to be real, and if complex resemble quantum mechanical description of reality, where $\psi_k(x)$ represent an experimental setup, and $p(x)$ “classical” probabilities.
5. Depending on the $M/N$ ratio one will get either smooth or sharp approximation of the density.
6. The mothod allow to compare different distributions on the same basis set.
7. There are too many details need to be filled in to become publication quality. This will take unreasonable long time for me, so if anyone sees rational grain in it and can use as the basis, please go ahead.
8. The method works in practice. It has some additional information, not mentioned here.
The quartic polynomial $\sum_{i=1}^5 (x_i^2-1)^2 + ( \sum x_i )^2 -.1$ is positive definite, e.g > 0 for all x, but cannot be represented as a sum of squares. On the other hand, this polynomial represent fully connected graph. The polynomial is due to Pablo Parrilo via Amir Ali Ahmadi.