kernel functions in machine learning

The problem

When classifying data, we often want to map existing data into a higher dimensional space, in order to be able to seperate it with a hyperplane or some “not-too-complicated” structure (in case you are not yet familiar with this approach and its immanent problems of overfitting, etc. read “The Curse of Dimensionality in classification”). However, as we keep increasing dimensions, the computational cost of measuring the similiarity (e.g. with a distance in an Euclidian space) increases as well.

Mathematically speaking, we first have to transform features x and y into some higher dimensional space (map them to f(x) and f(y) so they are easier to seperate) and then calculate the inner product <f(x),f(y)>, which may be regarded as some sort of distance as the inner product returns the projection of one argument onto the other (colloquially, how much they overlap).

This procedure turns out to be “pretty expensive” (up to impossible) in terms of computational effort. The problem is the mapping into higher dimensional spaces (up to infinite dimensional spaces!), so is there any way to circumvent the computation of f(x) and f(y)?

Kernel functions to the rescue!

Let K(x, y) := <f(x), f(y)>. By some applied magic this can be computed without ever having to compute f(x) or f(y). The interesting part is that for seperating data we often don’t construct a function f and the corresponding kernel, but rather chose some of the popular kernels (like the RBF kernel, Polynomial kernel, etc.) and obtain some function f implicitly!

Example for given f

Let x = (x_1, x_2, x_3), y = (y_1, y_2, y_3)

Then for the function

f(x) := (x_1 x_1, x_1 x_2, x_1 x_3, x_2 x_1, x_2 x_2, x_2 x_3, x_3 x_1, x_3 x_2, x_3 x_3)

the kernel is K(x, y) = (<x, y>)^2.

Suppose x = (1, 2, 3), y = (4, 5, 6). Then:

f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9),\newline f(y) = (16, 20, 24, 20, 25, 36, 24, 30, 36), \newline <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024

Using the Kernel instead yields:

K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024

(example shamelessly stolen from Lili Jiang’s excellent answer on Quora)