מילון מונחים

בחר אחת ממילות המפתח משמאל...

Machine LearningSupport Vector Machines

זמן קריאה: ~60 min

Logistic regression identifies linear decision boundaries by modeling the regression function r(\mathbf{x}) = \mathbb{P}(Y = 1 | \mathbf{X} = \mathbf{x}). In this section, we will find decision boundaries directly based on desirable features of those boundaries. We begin with a warm-up exercise on the geometry of planes in \mathbb{R}^3.

Example
Find the distance from the plane 3x + 2y + z = 6 to the point P = (4,7,1).

Solution. The vector [3,2,1] is perpendicular to the given plane, so moving t units directly away from some point in the plane means adding the vector

\begin{align*} t\frac{[3,2,1]}{|[3,2,1]|} = t\frac{[3,2,1]}{\sqrt{14}}\end{align*}

to that point. The value of the function 3x + 2y + z is 6 for any point in the plane, and adding t\frac{[3,2,1]}{\sqrt{14}} increases that value by t\sqrt{14}:

\begin{align*} 3\left(x + \frac{3t}{\sqrt{14}}\right) + 2\left(y + \frac{2t}{\sqrt{14}}\right) + 1\left(z + \frac{t}{\sqrt{14}}\right) = 3x + 2y + z + \frac{9t + 4t + t}{\sqrt{14}} = 6 + t\sqrt{14}.\end{align*}

Since the value of 3x + 2y + z is equal to 3(4) + 2(7) + 1(1) = 27 at the point P, it changes by 27 - 6 = 21 as you move from the nearest point on the plane to P. We solve t\sqrt{14} = 21 to find that the distance from the plane to the point is \boxed{21/\sqrt{14}}.

Example
Find the distance from the hyperplane \{\mathbf{x} \in \mathbb{R}^n : \boldsymbol{\beta} \cdot \mathbf{x} + \alpha = 0\} to the point \mathbf{x} \in \mathbb{R}^n.

Solution. Generalizing the idea we developed in the previous problem, we can say that \boldsymbol{\beta} is the normal vector to the hyperplane, and moving t units directly away from the hyperplane corresponds to adding t|\boldsymbol{\beta}| to the value of \boldsymbol{\beta}' \mathbf{x} + \alpha. Therefore, we can check the value of the function \mathbf{x} \mapsto \boldsymbol{\beta}' \mathbf{x} + \alpha at our point \mathbf{x} and divide by |\boldsymbol{\beta}| to find the distance from \mathbf{x} to the plane. So our distance formula is \frac{|\boldsymbol{\beta}' \mathbf{x} + \alpha|}{|\boldsymbol{\beta}|}.

Example
Simulate data for a binary classification problem in the plane for which the two classes can be separated by a line. Write a Julia function for finding the thickest slab which separates the two classes.

Solution. Suppose that the observations are \{(\mathbf{x}_i,y_i)\}_{i=1}^n, where y_i \in \{-1,1\} for each 1 \leq i \leq n. Let us describe the separating slab as \{\mathbf{x} \in \mathbb{R}^2 : -1 \leq \boldsymbol{\beta}' \mathbf{x} + \alpha \leq 1 \}. The width of this slab is 2/|\boldsymbol{\beta}|, by the preceding example. We can check whether a point is on the correct side of the slab by checking whether \boldsymbol{\beta} ' \mathbf{x} + \alpha \geq 1 for points of class 1 and less than or equal to -1 for points of class -1. More succinctly, we can check whether y_i(\boldsymbol{\beta} \cdot \mathbf{x}_i + \alpha) \geq 1 for all 1 \leq i \leq n.

So, we are looking for the values of \boldsymbol{\beta} and \alpha which minimize |\boldsymbol{\beta}| subject to the conditions y_i(\boldsymbol{\beta}' \mathbf{x}_i + \alpha) \ge 1 for all 1 \leq i \leq n. This is a constrained optimization problem, since we are looking to maximize the value of a function over a domain defined by some constraining inequalities.

Constrained optimization is a ubiquitous problem in applied mathematics, and numerous solvers exist for them. Most of these solvers are written in low-level languages like C and have bindings for the most popular high-level languages (Julia, Python, R, etc.). No solver is uniformly better than the others, so solving a constrained optimization problem often entails trying different solvers to see which works best for your problem. Julia has a package which makes this process easier by providing a single interface for problem encoding (JuMP). Let's begin by sampling our points and looking at a scatter plot. We go ahead and load the Ipopt package, which provides the solver we'll use.

using Plots, JuMP, Ipopt, Random; Random.seed!(1234);
gr(aspect_ratio=1, size=(500,500))

Let's sample the points from each class from a multivariate normal distribution. We'll make the mean of the second distribution a parameter of the sampling function so we can change it in the next example.

function samplepoint(μ=[3,3])
    class = rand([-1,1])
    if class == -1
        X = randn(2)
    else
        X = μ + [1 -1/2; -1/2 1] * randn(2)
    end
    (X,class)
end
n = 100
observations = [samplepoint() for i in 1:n]
ys = [y for (x,y) in observations]
scatter([(x₁,x₂) for ((x₁,x₂),y) in observations], group=ys)

Next we describe our constrained optimization problem in JuMP. The data for the optimization is stored in a Model object, and the solver can be specified when the model is instantiated.

m = Model(with_optimizer(Ipopt.Optimizer,print_level=0))

We add variables to the model with the @variable macro, and we add constraints with the @constraint macro. We add the function we want to minimize or maximize with the objective macro.

@variable(m,β[1:2]) # adds β[1] and β[2] at the same time
@variable(m,α)
for (x,y) in observations
    @constraint(m,y*(β'*x + α) ≥ 1)
end
@objective(m,Min,β[1]^2+β[2]^2)

When we call solve on the model, it makes the optimizing values of the variables retrievable via JuMP.value.

optimize!(m)
β, α = JuMP.value.(β), JuMP.value(α)

Now we can plot our separating line.

l(x₁) = (-α - β[1]*x₁)/β[2]
xs = [-2,5]
plot!(xs,[l(x₁) for x₁ in xs], label="")

We can see that there are necessarily observations of each class which lie on the boundary of the separating slab. These observations are called support vectors. If a sample is not a support vector, then moving it slightly or removing it does not change the separating hyperplane we find.

The approach of finding the thickest separating slab is called the hard-margin SVM (support vector machine). A different approach (the soft-margin SVM) is required if the training observations for the two classes are not separable by a hyperplane.

Example
Consider a binary classification problem in \mathbb{R}^2 for which the training observations \{(\mathbf{x}_i,y_i)\}_{i=1}^n are not separable by a line. Explain why

\begin{align*} L(\boldsymbol{\beta},\alpha) = \lambda |\boldsymbol{\beta}|^2 + \frac{1}{n}\sum_{i=1}^{n}\big[1-y_i(\boldsymbol{\beta}\cdot \mathbf{x}_i + \alpha)\big]_+\end{align*}

is a reasonable quantity to minimize.

Solution. Minimizing the first term \lambda |\boldsymbol{\beta}|^2 is the same as minimizing the value of |\boldsymbol{\beta}|, which is the hard-margin objective function. The second term penalizes points which are not on the right side of the slab (in units of margin widths: points on the slab midline get a penalty of 1, on the wrong edge of the slab they get a penalty of 2, and so on). Thus this term imposes misclassification penalty. The parameter \lambda governs the tradeoff between the large-margin incentive of the first term and the correctness incentive of the second.

Example (Soft-margin SVM)
Simulate some overlapping data and minimize the soft-margin loss function. Select \lambda by leave-one-out cross-validation.

Solution. We begin by generating our observations. To create overlap, we make the mean of the second distribution closer to the origin.

observations = [samplepoint([1,1]) for i in 1:n]
ys = [y for (x,y) in observations]
scatter([(x₁,x₂) for ((x₁,x₂),y) in observations],group=ys)

Next we define our loss function, including a version that takes \boldsymbol{\beta} and \alpha together in a single vector called params.

L(λ,β,α,observations) =
    λ*norm(β)^2 + 1/n*sum(max(0,1-y*(β'*x + α)) for (x,y) in observations)
L(λ,params,observations) = L(λ,params[1:end-1],params[end],observations)

Since this optimization problem is unconstrained, we can use Optim to do the optimization. We define a function SVM which returns the values of \boldsymbol{\beta} and \alpha which minimize the empirical loss:

using Statistics, LinearAlgebra, Optim
function SVM(λ,observations)
    params = optimize(params->L(λ,params,observations), ones(3), BFGS()).minimizer
    params[1:end-1], params[end]
end

To choose \lambda, we write some functions to perform cross-validation:

function errorrate(β,α,observations)
    count(y * (β'*x + α) < 0 for (x,y) in observations)
end
function CV(λ,observations,i)
    β, α = SVM(λ,[observations[1:i-1];observations[i+1:end]])
    x, y = observations[i]
    y * (β'*x + α) < 0
end
function CV(λ,observations)
    mean(CV(λ,observations,i) for i in 1:length(observations))
end

Finally, we optimize over \lambda:

λ₁, λ₂ = 0.001, 0.25        
λ = optimize(λ->CV(first(λ),observations),λ₁,λ₂).minimizer
β,α = SVM(λ,observations)
l(x₁) = (-α - β[1]x₁)/β[2]
xs = [-1.5,2]
plot!(xs, l, legend = false)

Kernelized support vector machines

Exercise

Support vector machines can be used to find nonlinear separating boundaries, because we can map the feature vectors into a higher-dimensional space and use a support vector machine find a separating hyperplane in that space.

Find a map from \mathbb{R}^2 to a higher dimensional space such that the two classes shown in the figure can be separated with a hyperplane.

Solution. The distinguishing feature of the points is distance from the origin, so we supplement the features x_1 and x_2 with the combination x_1^2 + x_2^2. So our map is

\begin{align*} \phi(x_1,x_2) = (x_1,x_2,x_1^2+x_2^2).\end{align*}

We can see that the points can be separated by a plane in \mathbb{R}^3, so we could do a hard-margin SVM at this stage. To classify a test point (x_1,x_2), we can just check which side of the plane the point \phi(x_1, x_2) is on.

We can map points into a higher-dimensional space to achieve linear separation.

This example might have seemed a bit fortuitous, and it is structured so as to apparently rely on human insight to choose the appropriate map to \mathbb{R}^3. We can come up with a more general approach which leads to a new geometric perspective on kernelized support vector machines.

We begin with an overview of the mathematical theory underlying the support vector machine. This content tends to get notationally heavy, so we will adopt a vectorized notation which is more condensed and which also supports convenient translation to code.

We begin by slightly reformulating the soft-margin support vector machine. The SVM is a prediction function of the form

\begin{align*}\mathbf{x}\mapsto \operatorname{sign}(\mathbf{x}\cdot \boldsymbol{\beta} + \alpha),\end{align*}

where \boldsymbol{\beta} \in \mathbb{R}^d and \alpha \in \mathbb{R}. To train a support vector machine on a set of training data (X, \mathbf{y}), with X \in \mathbb{R}^{n \times d} and \mathbf{y} \in \{-1,1\}^n, we choose a value C > 0 and solve the optimization problem

\begin{align*} &\text{minimize} \quad \frac{1}{2}\|\boldsymbol{\beta}\|^2 + C \mathbf{1}'\boldsymbol{\zeta}\\ &\text{subject to} \quad \mathbf{y} \odot (X\boldsymbol{\beta} + \alpha\mathbf{1}) \succcurlyeq \mathbf{1} - \boldsymbol{\zeta} \text{ and } \boldsymbol{\zeta} \succcurlyeq 0.\end{align*}

where \odot indicates elementwise multiplication and \succcurlyeq indicates that the elementwise comparison holds for every element. The parameter C governs the tradeoff between the margin width term \frac{1}{2}|\boldsymbol{\beta}|^2 and the classification penalty \operatorname{sum}(\boldsymbol{\zeta}). Note that this is equivalent to minimizing

\begin{align*}L(\boldsymbol{\beta},\alpha) = \lambda |\boldsymbol{\beta}|^2 + \frac{1}{n}\sum_{i=1}^{n}\big[1-y_i(\boldsymbol{\beta}\cdot \mathbf{x}_i + \alpha)\big]_+,\end{align*}

because if we multiply this expression by \frac{1}{2\lambda} and set C = \frac{1}{2\lambda n}, and we can define \boldsymbol{\zeta} to be the vector whose i\text{th} component is [1-y_i((\boldsymbol{\beta} \boldsymbol{x})_i + \alpha))]_{+}.

Next, we fold the constraints into the objective function by defining a function H so that H(\mathbf{x}) = 0 when \mathbf{x} \preccurlyeq 0 and H(\mathbf{x}) = \infty otherwise. Then our optimization problem is equivalent to

\begin{align*}\frac{1}{2}\|\boldsymbol{\beta}\|^2 + C \mathbf{1}'\boldsymbol{\zeta} + H(\mathbf{1} - \boldsymbol{\zeta} - \mathbf{y} \odot (X\boldsymbol{\beta} + \alpha\mathbf{1})) + H(-\boldsymbol{\zeta}),\end{align*}

since the terms involving H enforce the constraints by returning when the constraints are not satisfied. Note that whenever \boldsymbol{\eta} \succcurlyeq 0, we have H(\mathbf{x}) \geq \boldsymbol{\eta}' \mathbf{x}. Furthermore, if we take the maximum of \boldsymbol{\eta}' \mathbf{x} over all \boldsymbol{\eta} \succcurlyeq 0, we get H(\mathbf{x}). Therefore, the optimization problem can be rewritten as

\begin{align*}\min_{\alpha, \boldsymbol{\beta}, \boldsymbol{\zeta}} \max_{\boldsymbol{\eta}, \boldsymbol{\theta} \succcurlyeq \boldsymbol{0}}\left[\frac{1}{2}\|\boldsymbol{\beta}\|^2 + C \mathbf{1}'\boldsymbol{\zeta} + \boldsymbol{\eta}'(\mathbf{1} - \boldsymbol{\zeta} - \mathbf{y} \odot (X\boldsymbol{\beta} + \alpha\mathbf{1})) - \boldsymbol{\theta}'\zeta\right].\end{align*}

Next, we swap the min and max operations. The resulting optimization problem—called the dual problem—is different, but the maximal value of the objective function in the new optimization problem does provide a lower bound for the first function. Here's the general idea (with F as a real-valued function of two variables):

\begin{align*}F(x,y) &\leq \max_{y} F(x,y)\text{ for all $x,y$}\implies \\ \min_{x}F(x,y) &\leq \min_x\max_{y} F(x,y) \text{ for all $y$} \implies \\ \max_{y}\min_{x}F(x,y) &\leq \min_x\max_{y} F(x,y).\end{align*}

Furthermore, a result called Slater's Theorem implies that in this case, we have strong duality, meaning that the two sides of the equation are actually equal.

Performing the max/min swap and re-arranging terms a bit, we get

\begin{align*}\max_{\boldsymbol{\eta}, \boldsymbol{\theta}\succcurlyeq \boldsymbol{0}}\min_{\alpha, \boldsymbol{\beta}, \boldsymbol{\zeta}} \Bigg[ &\frac{1}{2}\|\boldsymbol{\beta}\|^2 - \boldsymbol{\eta}' (\mathbf{y} \odot X\boldsymbol{\beta}) + \boldsymbol{\eta}' \mathbf{1} + \\ &(C \mathbf{1}' - \boldsymbol{\eta}' - \boldsymbol{\theta}') \boldsymbol{\zeta} \\ &-\alpha\boldsymbol{\eta}'\mathbf{y}\Bigg].\end{align*}

The first line depends only on \boldsymbol{\beta}, the second only on \boldsymbol{\zeta}, and the third only on \alpha. So we can minimize the function over \alpha, \boldsymbol{\beta}, \boldsymbol{\zeta} by minimizing each line individually. To minimize the first term, we rewrite \boldsymbol{\eta}' (\mathbf{y} \odot X\boldsymbol{\beta}) as (\boldsymbol{\eta}' \odot \mathbf{y}') X\boldsymbol{\beta}) differentiate with respect to \boldsymbol{\beta} to get \boldsymbol{\beta}' - (\boldsymbol{\eta}'\odot \boldsymbol{y}')X, which we can solve to find that \boldsymbol{\beta} = X'(\boldsymbol{\eta} \odot \boldsymbol{y}).

The minimum of the second term is -\infty unless \boldsymbol{\theta} + \boldsymbol{\eta} = C\mathbf{1}, because if any component of the vector multiplying \zeta were nonzero, then we could achieve arbitrarily large negative values for the objective function by choosing a suitably large value for \zeta in that component. Likewise, the minimum of the third term is -\infty unless \boldsymbol{\eta}'\boldsymbol{y} is equal to .

Therefore, the outside maximization over \boldsymbol{\eta} and \boldsymbol{\theta} will have to set \boldsymbol{\theta} = C\mathbf{1} - \boldsymbol{\eta} and ensure that the equation \boldsymbol{\eta}'\boldsymbol{y} = 0 is satisfied. All together, substituting \boldsymbol{\beta} = X'(\boldsymbol{\eta} \odot \boldsymbol{y}) into the objective function, we get the dual problem

\begin{align*}&\text{maximize} \quad -\frac{1}{2}(\boldsymbol{\eta} \odot \mathbf{y})'XX' (\boldsymbol{\eta} \odot \mathbf{y}) + \boldsymbol{1}'\boldsymbol{\eta} \\ &\text{subject to} \quad 0 \preccurlyeq \boldsymbol{\eta} \preccurlyeq C \text{ and } \boldsymbol{\eta}' \mathbf{y} = 0.\end{align*}

Exercise

Show that we can give the dual problem the following interpretation, thinking of the entries of \boldsymbol{\eta} as weights for the training points:

  • Each weight must be between 0 and C
  • Equal amounts of weight must be assigned in total to the +1 training points and to the -1 training points.
  • The objective function includes one term for the total amount of weight assigned and one term which is equal to -\frac{1}{2} times the squared norm of the vector from the \boldsymbol{\eta}-weighted sum of negative training observations to the \boldsymbol{\eta}-weighted sum of positive training observations.

Note: the figure shows that \boldsymbol{\eta} values for each point, together with the \boldsymbol{\eta}-weighted sums for both classes (each divided by 3, so that their relation to the original points can be more readily visualized). The vector connecting these two points is \beta.

Solution. The first statement says that 0 \preccurlyeq \boldsymbol{\eta} \preccurlyeq C, which is indeed one of the constraints. The second statement is equivalent to \boldsymbol{\eta}' \boldsymbol{y} = 0, since dotting with \boldsymbol{y} yields the difference between the entries corresponding to positive training examples and the entries corresponding to negative training examples.

The objective function includes one term for the total amount of weight (that's \boldsymbol{1}'\boldsymbol{\eta}), and one term which is -\frac{1}{2}|\boldsymbol{\beta}|^2, where \boldsymbol{\beta} = X'(\boldsymbol{\eta} \odot \boldsymbol{y}). We can see that the vector \beta is in fact the difference between the \boldsymbol{\eta}-weighted sums of the negative and positive training examples by writing X'(\boldsymbol{\eta} \odot \boldsymbol{y}) as (X'_{+1}\boldsymbol{\eta}_{+1} - X'_{-1}\boldsymbol{\eta}_{-1}), where the +1 and -1 subscripts mean "discard rows corresponding to negative observations" and "discard rows corresponding to positive observations", respectively.

If we solve this problem, we can substitute \boldsymbol{\beta} = X'(\widehat{\boldsymbol{\eta}} \odot \boldsymbol{y}) to find the optimizing value of \boldsymbol{\beta} in the original problem. Although \alpha got lost in the translation from the original problem to the dual problem, we can recover its value as well. We'll do this by identifying points on the edge of the separating slab, using the following observation: consider a set of values which solves the optimization problem in the form

\begin{align*}\min_{\alpha, \boldsymbol{\beta}, \boldsymbol{\zeta}} \max_{\boldsymbol{\eta}, \boldsymbol{\theta} \succcurlyeq \boldsymbol{0}}\left[\frac{1}{2}\|\boldsymbol{\beta}\|^2 + C \mathbf{1}'\boldsymbol{\zeta} + \boldsymbol{\eta}'(\mathbf{1} - \boldsymbol{\zeta} - \mathbf{y} \odot (X\boldsymbol{\beta} + \alpha\mathbf{1})) - \boldsymbol{\theta}'\zeta\right]\end{align*}

If i is an index such that \zeta_i > 0 (in other words, the i\text{th} training point is inside the slab or on the wrong side of it), then \theta_i must be zero, since otherwise we could lower the value of the objective function by nudging \theta_i down slightly. This implies that \eta_i = C.

Likewise, if i\text{th} component of \boldsymbol{1} - \boldsymbol{\zeta} - \boldsymbol{y} \odot (X \boldsymbol{\beta} + \alpha \boldsymbol{1}) is negative (in other words, the i\text{th} training point is safely on the correct side of the slab, and not on the edge), then we must have \eta_i = 0.

Putting these two observations together, we conclude that points on the edge of the slab may be detected by looking for the components of \eta which are strictly between 0 and C. Since y_i((X\beta)_i+ \alpha) = 1 for training points (\mathbf{x}_i, y_i) on the edge of the slab, we can identify the value of \alpha by solving this equation for any index i such that 0 < \eta_i < C. We multiply both sides of the equation by y_i to get (X\beta)_i + \alpha = y_i, and that implies that \alpha = (\mathbf{y} - X\boldsymbol{\beta})_i.

In summary, the optimizing value of \alpha in the original problem may also be obtained looking at any component of

\begin{align*} \mathbf{y} - X\widehat{\boldsymbol{\beta}}\end{align*}

for which the corresponding entry of \boldsymbol{\eta} is strictly between 0 and C. A couple caveats when working with solutions obtained numerically: we should (i) include a small tolerance when determining which entries of \boldsymbol{\eta} are deemed to be strictly between 0 and C, and (ii) average all entries of \mathbf{y} - X\widehat{\boldsymbol{\beta}} for which \eta_i is strictly between 0 and C.

Exercise
Execute the code cell below to observe that the values of \eta do indeed turn out to be 0 on the correct side of the slab, C inside the slab or on the wrong side, and strictly between 0 and C only on the edge of the slab. Experiment with different values of C.

include("data-gymnasia/svmplot.jl")
using Random; Random.seed!(1)

# number of training points of each class
n = 20
# generate training observations
X = [randn(n) .+ 1 randn(n) .+ 1
     randn(n) .- 1 randn(n)]
y = repeat([1,-1], inner = n)

# initialize and train an SVM:
S = SVM(X, y)
dualfit!(S, C = 0.5)

# visualize the SVM, together with annotations showing
# the η values
plot(S, (S,i) -> S.η[i])

The Kernel Trick

The reason for formulating the dual problem is that it permits the application of a useful and extremely common technique called the kernel trick. The idea is that if we apply a transformation \phi:\mathbb{R}^d \to \mathbb{R}^d to each row of X and call the resulting matrix \phi(X), then the resulting change to the dual problem is just to replace XX' with \phi(X) \phi(X)'. This matrix's entries consist entirely of dot products of rows of \phi(X) with rows of \phi(X), so we can solve the problem as long as we can calculate \phi(\mathbf{x})\cdot \phi(\mathbf{y}) for all \mathbf{x} and \mathbf{y} in \mathbb{R}^d. The function K = (\mathbf{x}, \mathbf{y}) \mapsto \phi(\mathbf{x})\cdot \phi(\mathbf{y}) is called the kernel associated with the transformation \phi. Typically we ignore \phi and use one of the following kernel functions:

\begin{align*} \text{linear} \quad K(\mathbf{x},\mathbf{y}) &= \mathbf{x}' \mathbf{y} \\ \text{degree-$d$ polynomial} \quad K(\mathbf{x},\mathbf{y}) &= (\gamma \mathbf{x}' \mathbf{y} + r)^d \\ \text{Gaussian radial basis function} \quad K(\mathbf{x},\mathbf{y}) &= \exp(-\gamma \|\mathbf{x} - \mathbf{y}\|^2) \\ \text{sigmoid} \quad K(\mathbf{x},\mathbf{y}) &= \tanh(\gamma \mathbf{x}'\mathbf{y}+ r),\end{align*}

where \gamma and r are hyperparameters.

To bring it all together, suppose that K is a kernel function and \mathcal{K} = {K(\mathbf{x}_i, \mathbf{x}_j)}_{1 \leq i,j \leq n} is the resulting matrix of kernel values (where \mathbf{x}_i is the i\text{th} row of X). Then the support vector machine with kernel K is obtained by letting \widehat{\boldsymbol{\eta}} be the optimizing vector \boldsymbol{\eta} in the optimization problem

\begin{align*} &\text{minimize} \quad \frac{1}{2}(\boldsymbol{\eta} \odot \mathbf{y})'\mathcal{K} (\boldsymbol{\eta} \odot \mathbf{y}) - \operatorname{sum}(\boldsymbol{\eta}) \\ &\text{subject to} \quad 0 \preccurlyeq \boldsymbol{\eta} \preccurlyeq C \text{ and } \boldsymbol{\eta}' \mathbf{y} = 0.\end{align*}

The prediction vector for an n_{\mathrm{test}} \times n feature matrix X_{\mathrm{test}} is

\begin{align*} \operatorname{sign}(\phi(X) \widehat{\boldsymbol{\beta}} + \alpha\boldsymbol{1}) = \operatorname{sign}(\phi(X) \phi(X)' (\widehat{\boldsymbol{\eta}} \odot \mathbf{y}) + \alpha\boldsymbol{1}) = \operatorname{sign}(\mathcal{K}_{\mathrm{test}}(\widehat{\boldsymbol{\eta}} \odot \mathbf{y}) + \alpha \boldsymbol{1}),\end{align*}

where \mathcal{K}_{\mathrm{test}} is the n_{\mathrm{test}} \times n matrix whose $(i,j)$th entry is obtained by applying K to the i\text{th} row of X_{\mathrm{test}} and the j\text{th} row of X, and where \widehat{b} is any entry of

\begin{align*} \mathcal{K}(\widehat{\boldsymbol{\eta}} \odot \mathbf{y}) - \mathbf{y}\end{align*}

for which the corresponding entry in \widehat{\boldsymbol{\eta}} is strictly between 0 and C.

Exercise

Consider applying an SVM with Gaussian kernel K(\mathbf{x},\mathbf{y}) = \mathrm{e}^{-\gamma |\mathbf{x} - \mathbf{y} |^2} to a binary classification problem. Show that the function which maps a feature vector \mathbf{x} to its predicted class can be written as a composition of the signum function and a constant plus a linear combination of translations of the function \mathbf{x}\mapsto\operatorname{e}^{-\gamma |\mathbf{x}|^2}. Where are these translations centered? Use this observation to reason about the behavior of the classifier when \gamma is large.

(In the scatter plot, the two axes represent features, and class is indicated with + or \times. In the surface plot, the vertical axis represents predicted class (before the signum function is applied), while the other two axes represent features.)

Bruno
Bruno Bruno