Training a Pairwise Ising-type Model with Maximum Entropy Principle

The ising model energy is defined as:
$$
E(\sigma) = -\sum_{ij}J_{ij}\sigma_i\sigma_j - \sum_{i} H_i\sigma_i
$$
In which, $\sigma_i = {0, 1}$ in this work.

Let $\mathcal S$ is the set of all possible configuration. $|\mathcal S| = 2^n$,

where $n$ is the number of sites, and is equal to the length of $\sigma$.

In maximum entropy principle, we expect to maximize the entropy $S(p)=-\sum_\sigma p(\sigma)logp(\sigma)$, in the constraints that

$\langle \sigma_i \sigma_j\rangle^{emp} = \langle \sigma_i \sigma_j\rangle$, $\langle \sigma_i\rangle^{emp}=\langle \sigma_i\rangle$ and $\sum_\sigma p(\sigma) = 1$.

Combine to the Lagrange function:

$\mathcal L(p;J;H)=S(p) -\lambda((\sum_\sigma p(\sigma))-1)-\sum_{ij}J_{ij}(\langle \sigma_i \sigma_j\rangle -\langle \sigma_i \sigma_j\rangle^{emp})-\sum_{j}H_{i}(\langle \sigma_i \rangle -\langle \sigma_i\rangle^{emp})$​

Let $\partial \mathcal L(p;J;H)/\partial p\Rightarrow -logp(\sigma) -1-\lambda-\sum_{ij}J_{ij}\sigma_i\sigma_j -\sum_iH_i\sigma_i=0$

We now have $p(\sigma) = e^{ -1-\lambda+\sum_{ij}J_{ij}\sigma_i\sigma_j +\sum_iH_i\sigma_i}=e^{-(-\sum_{ij}J_{ij}\sigma_i\sigma_j-\sum_i H_i\sigma_i)}/e^{1+\lambda}$.

As $p(\sigma)$ should sum up to 1. Choice the $e^{-1-\lambda}$ can normalize the $p(\sigma)$ to sum up to $1$.

Let $e^{1+\lambda} = Z = \sum_{\sigma} e^{-E(\sigma)} =\sum_\sigma e^{-(-\sum_{ij} J_{ij}\sigma_i\sigma_j-\sum_{i }H_i\sigma_i)}$

We now have $p(\sigma) = e^{-\sum_{ij}J_{ij}\sigma_i\sigma_j-\sum_i H_i\sigma_i}/Z$

let the empirical distribution is $p^{emp}$,

We can now minimalize the K-L divergence between $p$ and $p^{emp}$.

Specifically,
$$
D_{KL}(p^{emp}||p) = \sum_\sigma p^{emp}(\sigma)log;\frac{p^{emp}(\sigma)}{p(\sigma)}
$$
Note that, $D_{KL}(p||q)$​ measures how much information is lost when approximating $p$​ by $q$​.

Now, we calculate the gradient of the K-L divergence with respect to $J_{ij}$ and $H_i$​.

We have:
$$
\frac{\partial D_{KL}(p^{emp}||p)}{\partial J_{ij}} = \frac{\partial(\sum_\sigma p^{emp}(\sigma)log;\frac{p^{emp}(\sigma)}{p(\sigma)})}{\partial J_{ij}} =\
\frac{\partial(\sum_\sigma p^{emp}(\sigma)log;\frac{p^{emp}(\sigma)}{p(\sigma)})}{\partial J_{ij}}=\sum_\sigma p^{emp}(\sigma)\frac{\partial[\log; p^{emp}(\sigma)-log;p(\sigma)]}{\partial J_{ij}}=\
\sum_\sigma p^{emp}(\sigma) (-\frac{\partial(-(-\sum_{ij} J_{ij}\sigma_i\sigma_j-\sum_i H_i\sigma_i)-\sum_\sigma p(\sigma)[-\sum_{ij}J_{ij}\sigma_i\sigma_j-\sum_i H_i \sigma_i])}{\partial J_{ij}})=\
\-\sum_{\sigma}p^{emp}(\sigma)(\sigma_i\sigma_j)+\sum_\sigma p(\sigma)\sigma_i\sigma_j = \langle\sigma_i\sigma_j\rangle - \langle\sigma_i\sigma_j\rangle^{emp}
$$

Utilize gradient descending, the update of $J_{ij}= J_{ij} - \alpha (\langle\sigma_i\sigma_j\rangle - \langle\sigma_i\sigma_j\rangle^{emp})$

Similarly,
$$
\frac{\partial D_{KL}(p^{emp}||p)}{\partial H_{i}} = \frac{\partial(\sum_\sigma p^{emp}(\sigma)log;\frac{p^{emp}(\sigma)}{p(\sigma)})}{\partial H_{i}} =\
\frac{\partial(\sum_\sigma p^{emp}(\sigma)log;\frac{p^{emp}(sigmac)}{p(\sigma)})}{\partial H_{i}}=\sum_\sigma p^{emp}(\sigma)\frac{\partial[\log; p^{emp}(\sigma)-log;p(\sigma)]}{\partial H_{i}}=\
\sum_\sigma p^{emp}(\sigma) (-\frac{\partial(-(-\sum_{ij} J_{ij}\sigma_i\sigma_j-\sum_i H_i\sigma_i)-\sum_\sigma p(\sigma)[-\sum_{ij}J_{ij}\sigma_i\sigma_j-\sum_i H_i \sigma_i])}{\partial H_{i}})=\
\-\sum_{\sigma}p^{emp}(\sigma)\sigma_i+\sum_\sigma p(\sigma)\sigma_i = \langle\sigma_i\rangle - \langle\sigma_i\rangle^{emp}
$$
Utilize gradient descending, the update of $H_{i}= H_{i} - \alpha (\langle\sigma_i\rangle - \langle\sigma_i\rangle^{emp})$