Why is the integral of a kernel minimum norm interpolant the same as the kernel quadrature?
Introduction
This post is about an algorithm for numerical integration. Consider computing an integral $\int f \mathrm{d} P$ of a function $f: \mathcal{X} \to \mathbb{R}$ with respect to (say) a probability distribution $P$ on some (nice) space $\mathcal{X}$. It is often not easy to compute such an integral exactly, and we need to numerically approximate it. To proceed, suppose we have a collection $(x_1, \dots, x_n)$ of $n$ points in $\mathcal{X}$, which we use to approximate the integral.
Kernel quadrature (KQ) is an algorithm that creates an approximation of the form $\sum_{i=1}^n w_i f(x_i)$, where $w_i \in \mathbb{R}$; more precisely, it computes some weights $w_1, \dots, w_n$ to “optimally” combine the function values $f(x_1), \dots, f(x_n)$. How are the weights determined in KQ? KQ’s weights $w_{1,P}^*, \dots, w_{n,P}^*$ are given as those minimising the worst-case error
$$ \sup_{\|h\|_k \leq 1} \left \lvert \sum_{i=1}^n w_i h(x_i) - \int h \mathrm{d}P \right\rvert, $$where $\|h\|_k$ denotes the norm in a reproducing kernel Hilbert space (RKHS) $\mathcal{H}_k$ of a kernel $k:\mathcal{X} \times \mathcal{X} \to \mathbb{R}$. In other words, the weights are chosen such that the integral approximation is good at least for the functions in the RKHS. (For this algorithm to work in practice, we need to assume that we can compute or well approximate some integral of the kernel $k$.) If you are interested, please see papers like Briol et al., 2019 or Kanagawa et al., 2020 for details.
This post is about a relation satisfied by the KQ weights:
$$ \begin{equation} \sum_{i=1}^n w_{i,P}^* f(x_i) = \int f_n \mathrm{d}P, \tag{{Rel}} \end{equation} $$where $f_n$ is the minimum norm interpolant:
$$ f_n \coloneqq \argmin_{h \in \mathcal{H}_k} \|h\|_k \quad \text{s.t.}\quad h(x_i) = f(x_i)\quad \text{for}\ i=1,\dots, n. $$This represents two ways of approximating the integral $\int f \mathrm{d} P$:
- LHS: approximate $P$ first by $\sum_{i=1}^nw^*_{i,P} \delta_{x_i}$, and estimate the integral by applying it to $f$: $\sum_{i=1}^n w_{i,P}^* f(x_i)$;
- RHS: approximate $f$ by $f_n$ and estimate the integral by $\int f_n \mathrm{d} P$,
where $\delta_{x_i}$ denotes the point evaluation at $x_i$.
One can derive this relation by working out the expressions of both sides — if you know what $w_{1,P}^*, \dots, w_{n,P}^*$ and $f_n$ look like. This post is about arriving at it without taking this route.
Some observations
Before deriving the relation above, let’s collect some observations.
Minimum norm interpolant
The minimum norm interpolant uniquely exists when the gram matrix $K\in \mathbb{R}^{n \times n}$ — whose $(i, j)$-element is $k(x_i, x_j)$ — is invertible. We can see the minimum norm interpolant as mapping $f$ (given $x_1, \dots, x_n$) to a function in the RKHS; specifically, the minimum norm interpolant $f_n$ is an element of $\mathrm{span}\{k(x_i, \cdot) \}_{i=1}^n$, which is a closed subspace. The minimum norm interpolant is kind of “orthogonal projection” onto the span; indeed, this intuition precisely holds when $f$ belongs to the RKHS.1
Kernel quadrature
KQ can also be viewed as performing orthogonal projection. Suppose we can embed $P$ to the RKHS $\mathcal{H}_k,$ that is, we assume an element $\mu_P\in \mathcal{H}_k$ satisfying $\langle h, \mu_P \rangle_k = \int h \mathrm{d}P$ for any $h \in \mathcal{H}_k$, where $\langle \cdot, \cdot \rangle_k$ is the RKHS’s inner product. The KQ weights can then be equivalently written as
$$ \begin{equation*} w_{1,P}^*,\dots, w_{n,P}^* = \argmin_{w_1,\dots, w_n \in \mathbb{R}} \left \lVert \sum_{i=1}^n w_i k(x_i, \cdot) - \mu_P\right \rVert_k. \end{equation*} $$Note that if $\mathcal{P}_n$ denotes the orthogonal projection operator onto $\mathrm{span}\{k(x_i, \cdot) \}_{i=1}^n$, then we have $\mu_{\hat{P}} = \mathcal{P}_n \mu_P$, where $\hat{P} = \sum_{i=1}^n w_{i,P}^* \delta_{x_i}$.
Derivation
The relation (Rel) can be derived as follows:
$$ \begin{align*} \int f_n \mathrm{d}P &\overset{\bf (a)}{=} \langle \mu_P, f_n\rangle_k\\ &\overset{\bf (b)}{=} \langle \mathcal{P}_n \mu_P, f_n \rangle_k\\ &\overset{\bf (c)}{=} \sum_{i=1}^n w_{i,P}^* f_n(x_i) &\overset{\bf (d)}{=} \sum_{i=1}^n w_{i,P}^* f(x_i), \end{align*} $$where
- (a) is due to the definition of the mean element $\mu_P;$
- (b) is due to the orthogonal complement of $\mu_P$ being negligible (since $f_n$ is in the span);
- (c) is the definition of the KQ weight (projection); and
- (d) follows from $f_n$ being an interpolant.
I finish this post with a hand-wavy comment. With the usual pairing of a measure and a function, the above relation essentially states that KQ is the adjoint of the operation of taking the minimum norm interpolant (which makes sense under appropriate assumptions on the integrand and the RKHS). That is, denoting the dual pairing by $(\cdot, \cdot)$,
$$ \int (\pi f) \mathrm{d}P = (P, \pi f) = (\pi^*P, f) = \int f \mathrm{d} (\pi^* P) $$where $\pi$ maps a function $f$ to the minimum norm interpolant $f_n$, and $\pi^{*}$ maps the measure to the KQ approximation. The intuition here is that taking either the KQ or the interpolant is essentially an orthogonal projection, so it does not matter which one you take first.
- Mapping to the interpolant is still a projection operator even when $f$ is not in the RKHS. (This point does not matter much in the derivation though.) ↩︎ 
