In this blog post, we derive the dual problem for the Weston-Watkins support vector machine (SVM). The dual optimization for the Weston-Watkins SVM can be found in the literature, e.g., Keerthi et al. (2008) and Weston and Watkins (1999). However, they often omit the tedious derivation.
Set-up and notations
Let \(k \ge 2\) be an integer representing the number of classes and \([k] = \{1,\dots, k\}\).
Let \(n\) be the size of the training data.
For each \(i \in [n]\), \(x_i \in \mathbb{R}^d\) is column vector and \(y_i \in [k]\).
Let \(W = [w_1,\dots, w_k] \in \mathbb{R}^{d\times k}\) where \(w_j\) is the \(j\)-th column of \(W\).
Let \(e_i \in \mathbb{R}^k\) be the \(i\)-th elementary basis (column) vector.
Let \(M \in \mathbb{R}_{>0}\) be a number. We are only interested when \(M \in \{1, 1/2\}\).
Primal problem
The Weston-Watkins SVM minimizes over \(W\) the following regularized empirical risk:
\[ \frac{1}{2}\|W\|_F^2 + C \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \max \{0, 1 - M(w_{y_i}' x_i - w_j'x_i)\}. \]
When \(M = 1\), then When \(M = 1/2\), we get the formulation of Doǧan, Glasmachers, and Igel (2016).
If \(\widetilde W = [\widetilde w_1,\dots, \widetilde w_k]\) is the optimizer, then the classifier is \[ x \mapsto \mathrm{argmax}_{j \in [k]} \widetilde w_j 'x. \]
Now, for each \(j, l \in [k]\), let \(\Delta_{j,l} = e_j - e_l\). Then we can rewrite the regularized empirical risk as
\[ \frac{1}{2}\|W\|_F^2 + C \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \max \{0, 1 - M \Delta_{y_i, j}' W' x_i\}. \]
Introducing the slack variable \(\xi_{ij}\), we can minimize
\[ \frac{1}{2}\|W\|_F^2 + C \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \xi_{ij} \]
subject to
\[ \begin{cases} \xi_{ij} \ge 0 \\ \xi_{ij} \ge 1 - M\Delta_{y_i, j}' W' x_i \end{cases} \]
or, equivalently, subject to
\[ \begin{cases} 0 \ge 1 - M\mathrm{tr}( W' x_i\Delta_{y_i, j}') - \xi_{ij}\\ 0 \ge -\xi_{ij} \end{cases} \]
where \(\mathrm{tr}\) is the trace operator. We observe that \(W' x_i\Delta_{y_i, j}' \in \mathbb{R}^{k\times k}\).
Dual problem
Let \(\alpha_{ij} \ge 0\) and \(\beta_{ij} \ge 0\) be the dual variables for the above constraints, respectively.
The Lagrangian is
\[ L(W, \xi, \alpha, \beta) = \frac{1}{2}\|W\|_F^2 + \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} C \xi_{ij} - \beta_{ij}\xi_{ij} + \alpha_{ij} (1 - M\mathrm{tr}( W' x_i\Delta_{y_i, j}') - \xi_{ij}) \]
Rearranging, we get
\[ L(W, \xi, \alpha, \beta) = \frac{1}{2}\|W\|_F^2 + \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \xi_{ij} (C - \beta_{ij} -\alpha_{ij}) + \alpha_{ij} (1 - M\mathrm{tr}( W' x_i\Delta_{y_i, j}')) \]
Setting to zero the gradient of \(L\) with respect to \(W\), we get
\[ 0 = \nabla_W L = W - M \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} x_i\Delta_{y_i, j}' \]
Equivalently, \[ W = M \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} x_i\Delta_{y_i, j}' \]
Next, setting to zero the gradient of \(L\) with respect to \(\xi_{ij}\), we get
\[ 0 = \nabla_{\xi_{ij}} L = C - \beta_{ij} - \alpha_{ij}. \]
Thus, the dependencies in the Lagrangian on \(\xi_{ij}\) and \(\beta_{ij}\) are removed and so
\[ L(W, \alpha) = \frac{1}{2}\|W\|_F^2 + \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} (1 - M\mathrm{tr}( W' x_i\Delta_{y_i, j}')) \]
Now, \[ \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} \mathrm{tr}( W' x_i\Delta_{y_i, j}') = \mathrm{tr}\left( W' \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} x_i\Delta_{y_i, j}'\right) = \frac{1}{M} \mathrm{tr}(W'W) = \frac{1}{M} \|W\|_F^2. \]
Hence, we have
\[ L(W, \alpha) = -\frac{1}{2}\|W\|_F^2 + \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} \]
Now, let \(\alpha_{iy_i} = -\sum_{j\in[n]: j \ne y_i} \alpha_i\). Using the definition of \(\Delta_{y_i, j} = e_{y_i} - e_j\), we get \[ \frac{1}{M} W = \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} x_i\Delta_{y_i, j}' = \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} - \alpha_{ij} x_i e_j' + \sum_{i=1}^n ( \sum_{j \in [k]: j \ne y_i} \alpha_{ij} ) x_i e_{y_i}' = - \sum_{i=1}^n \sum_{j \in [k]} \alpha_{ij} x_i e_j' \]
Now, let us define the column vector \[ \alpha_i = \begin{bmatrix} \alpha_{i1}\\ \vdots \\ \alpha_{ik} \end{bmatrix} \in \mathbb{R}^k \] Then we have \[ W = - M \sum_{i = 1}^n x_i \alpha_i' \] where we observe that \(x_i \alpha_i' \in \mathbb{R}^{d \times k}\).
Now, \[ \frac{1}{M^2} \|W\|_F^2 = \frac{1}{M^2} \mathrm{tr} (W'W) = \mathrm{tr} \left( \sum_{i, s\in [n]} \alpha_s x_s' x_i \alpha_i' \right) = \mathrm{tr} \left( \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s \right) = \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s. \] This eliminates the dependencies in the Lagrangian on \(W\) and so \[ L(\alpha) = -\frac{M^2}{2} \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s + \sum_{i \in [n]} \sum_{j \in [k]: j \ne y_i} \alpha_{ij} \]
Expression for the dual problem
Thus, the dual problem is
\[ \mathrm{minimize}_{\alpha} \quad f(\alpha) := \frac{M^2}{2} \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s - \sum_{i \in [n]} \sum_{j \in [k]: j \ne y_i} \alpha_{ij} \] subject to \[ \begin{cases} 0 \le \alpha_{ij} \le C &: j \ne y_i\\ \alpha_{iy_i} = \sum_{j \in [k]: j \ne y_i} - \alpha_j &: j = y_i. \end{cases} \]
When \(M = 1\), the above is the same as equation (16) of Keerthi et al. (2008). Actually, there is a typo in (16) where the \(\alpha_{ij}\)s should have a negative sign in front. Later, Keerthi et al. (2008) computes the derivative of \(f\) with the correct sign.
Derivatives
Below, fix such a \(i \in [n]\) and \(j \in [k]\) such that \(j \ne y_i\).
In this section, we compute \(\frac{\partial f}{\partial \alpha_{ij}}\).
First, let \(i \ne s\) and consider the term \[ x_s' x_i \alpha_i'\alpha_s = x_s' x_i (\alpha_{ij} \alpha_{sj} + \alpha_{iy_i} \alpha_{sy_i} + \mathrm{constant} ) \] where \(\mathrm{constant}\) collects terms that do not depend on \(\alpha_{ij}\).
Taking derivative of \(x_s' x_i \alpha_i'\alpha_s\) with respect to \(\alpha_{ij}\), we get
\[ x_s' x_i (\alpha_{sj} - \alpha_{sy_i}) \] where we recall that \(\alpha_{iy_i} = \sum_{l \in [k]: l \ne y_i} -\alpha_{il}\).
Similarly, when \(i = s\), we have that the derivative of \(x_i' x_i \alpha_i'\alpha_i\) is
\[ 2 x_i' x_i (\alpha_{ij} - \alpha_{i y_i|}). \]
From this, we get that \[ \frac{\partial f}{\partial \alpha_{ij}} (\alpha) = -1 + M^2 \sum_{s \in [n]} x_i' x_s (\alpha_{sj} - \alpha_{s y_i}) \]
Now, observe that
\[ \sum_{s \in [n]} x_i' x_s (\alpha_{sj} - \alpha_{s y_i}) = x_i ' \left(\sum_{s \in [n]} \alpha_{sj} x_s -\alpha_{s y_i} x_s\right) \]
Recall from earlier that
\[ W = - M \sum_{i = 1}^n x_i \alpha_i' \]
Thus, \[ w_j = We_j = - M \sum_{i = 1}^n \alpha_{ij}x_i \quad \mbox{and} \quad w_{y_i}= We_{y_i} = - M \sum_{i = 1}^n \alpha_{iy_i}x_{i}. \]
Thus, we get \[ Mx_i ' \left(\sum_{s \in [n]} \alpha_{sj} x_s -\alpha_{s y_i} x_s\right) = w_{y_i}' x_i - w_{j} ' x_i. \]
From this, we conclude that \[ \frac{\partial f}{\partial \alpha_{ij}} (\alpha) = M(w_{y_i}' x_i - w_{j}'x_i) -1. \] Since we are minimizing, it is more convenient to consider the negative gradient: \[ -\frac{\partial f}{\partial \alpha_{ij}} (\alpha) = 1- M (w_{y_i}' x_i - w_{j}'x_i). \] Again, when \(M = 1\), note that this equivalent to equation (17) from Keerthi et al. (2008).
When \(M = 1/2\), this is equivalent to Shark’s calcGradient
function, where the negative gradient is computed on line 415.
Coordinate descent in Shark
In this section, we fix \(i \in [n]\). We will explain how Shark (Igel, Heidrich-Meisner, and Glasmachers (2008)) solves the subproblem:
\[ \mathrm{minimize}_{\alpha_{i1},\dots, \alpha_{ik}} \quad f(\alpha) := \frac{M^2}{2} \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s - \sum_{i \in [n]} \sum_{j \in [k]: j \ne y_i} \alpha_{ij} \] subject to \[ \begin{cases} 0 \le \alpha_{ij} \le C &: j \ne y_i\\ \alpha_{iy_i} = \sum_{j \in [k]: j \ne y_i} - \alpha_j &: j = y_i. \end{cases} \]
Note that \(\alpha_{st}\) is fixed for all \(s \ne i\) and all \(j \in [k]\). By cycling through the \(i\in [n]\) and (approximately) solving the above subproblem, we get a form of coordinate descent.
Let \(\widehat{\alpha}\) be the next iterate of \(\alpha\) where
\[ \widehat{\alpha}_s = \alpha_s,\quad \forall s \ne i. \]
Then we have
\[ f(\widehat{\alpha}) = \frac{M^2}{2}\|x_i\|^2 \|\widehat\alpha_i\|^2 + M^2 x_i' \left( \sum_{s \in [n]: s \ne i} x_s \alpha_s' \right) \widehat \alpha_i - \sum_{j \in [n]: j \ne y_i} \widehat{\alpha}_{ij} +C_i \] where \(C_i \in \mathbb{R}\) does not depend on \(\widehat{\alpha}_{i}\).
Now, observe that
\[ x_i' \left( \sum_{s \in [n]: s \ne i} x_s \alpha_s' \right) \widehat \alpha_i = x_i' \left( \sum_{s \in [n]} x_s \alpha_s' \right) \widehat \alpha_i - x'_ix_i \alpha_i' \widehat\alpha_i = - x_i' W \widehat \alpha_i - \|x_i\|^2 \alpha_i' \widehat\alpha_i \]
Thus \[ f(\widehat{\alpha}) = \frac{M^2}{2}\|x_i\|^2 \|\widehat\alpha_i\|^2 -M^2 x_i' W\widehat{\alpha}_i -M^2\|x_i\|^2\alpha_i \widehat{\alpha}_i - \sum_{j \in [n]: j \ne y_i} \widehat{\alpha}_{ij} +C_i. \]
Let \(\mathbb{1}_{\neg y_i} \in \mathbb{R}^k\) be the vector whose entries are all ones except the \(y_i\)-th entry, which is zero. The above can be written succinctly as
\[ f(\widehat{\alpha}) = \frac{M^2}{2}\|x_i\|^2 \|\widehat\alpha_i\|^2 - \widehat{\alpha}_i' ( M^2 W' x_i + M^2\|x_i\|^2 \alpha_i + \mathbb{1}_{\neg y_i} ) +C_i. \]
Update of a single block dual variable
Shark considers updates of the following form:
\[ \widehat{\alpha}_s = \begin{cases} \alpha_s &: s \ne i\\ \alpha_i + \mu &: s = i \end{cases} \] where \(\mu \in \mathbb{R}^k\) is a step vector satisfying
\[
\mu_{y_i} = -\sum_{t \in [k]: t \ne y_i} \mu_t.
\]
The above constraint is to ensure that \(\widehat{\alpha}\) remains (dual) feasible.
The update step is implemented in
the updateWeightVectors
function assuming that \(\mu\) is given. In the next section, we discuss how Shark computes the \(\mu\) vector.
Computing the step vector
Let \(m\) be arbitrary such that \(\alpha_{ij} + m \in [0,C]\). We consider the updates \[ \begin{cases} \hat{\alpha}_{it} = \alpha_{it} &: t \not\in \{j,y_i\} \\ \hat{\alpha}_{it} = \alpha_{it} + m &: t =j \\ \hat{\alpha}_{it} = \alpha_{it} - m &: t =y_i \\ \end{cases} \]
Similarly, define
\[ \widehat W = [\hat w_1, \cdots \hat w_k] = - M \sum_{i = 1}^n x_i \hat \alpha_i'. \]
Then by construction, we have for \(t \in [k]\) that
\[ \hat w_t = \begin{cases} w_t &: t \not \in \{j, y_i\}\\ w_j - Mmx_i &: t = j \\ w_{y_i} + Mmx_i &: t = y_i. \end{cases} \]
Thus, we have \[ -\frac{\partial f}{\partial \alpha_{it}}( \hat \alpha) = 1 - M(\hat w_{y_i} ' x_i - \hat w_t' x_i) = \begin{cases} 1 - M(w_{y_i} ' x_i - w_j' x_i) - 2 M^2m \|x_i\|^2 &: t = j \\ 1 - M(w_{y_i} ' x_i - w_t' x_i) - 1 M^2m \|x_i\|^2 &: t \ne j \\ \end{cases} \]
Thus, \[ -\frac{\partial f}{\partial \alpha_{it}}( \hat \alpha) = \begin{cases} -\frac{\partial f}{\partial \alpha_{it}}( \alpha) - 2 M^2m \|x_i\|^2 &: t = j \\ -\frac{\partial f}{\partial \alpha_{it}}( \alpha) - 1 M^2m \|x_i\|^2 &: t \ne j \\ \end{cases} \]
In Shark, we have \(M = 1/2\) and so the gradient update is
\[ -\frac{\partial f}{\partial \alpha_{it}}( \hat \alpha) = \begin{cases} -\frac{\partial f}{\partial \alpha_{it}}( \alpha) - m \frac{\|x_i\|^2}{2} &: t = j, \\ -\frac{\partial f}{\partial \alpha_{it}}( \alpha) - 0.5 m \frac{\|x_i\|^2}{2} &: t \ne j. \end{cases} \]
This is the mathematical justification behind this code snippet from Shark.
Greedy step size
Consider a function \[ \Phi(x) = ax^2 + bx+c. \] The derivative at \(x\) is \[ \Phi'(x) = 2ax + b. \] Setting this to zero, we have \[ -\frac{b}{2a} =x= x + (-\frac{b}{2a}-x) = x + m \]
Next, observe that \[ m= -\frac{b}{2a}-x = -\frac{1}{2a} (2ax+b) = -\frac{1}{2a} \Phi'(x) \]
Applying this principle to the problem above, where we have \(x = \alpha_{ij}\), \(\Phi'(x) = \frac{\partial f}{\partial \alpha_{ij}} (\alpha)\) and \(a =M^2\|x_i\|^2\). Plugging everything in, we have
\[ m = \frac{1}{2(M\|x_i\|)^2} \left(- \frac{\partial f}{\partial \alpha_{ij}} (\alpha)\right) = - \frac{\partial f}{\partial \alpha_{ij}} (\alpha) \left(\frac{1}{2(M\|x_i\|)^2}\right). \]
Since \(M = 1/2\), we have
\[ m = - \frac{\partial f}{\partial \alpha_{ij}} (\alpha) \frac{2}{\|x_i\|^2}. \]
This is the update at line 464.
Doǧan, Ürün, Tobias Glasmachers, and Christian Igel. 2016. “A Unified View on Multi-Class Support Vector Classification.” The Journal of Machine Learning Research 17 (1). JMLR. org: 1550–1831.
Igel, Christian, Verena Heidrich-Meisner, and Tobias Glasmachers. 2008. “Shark.” Journal of Machine Learning Research 9 (Jun): 993–96.
Keerthi, S Sathiya, Sellamanickam Sundararajan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. 2008. “A Sequential Dual Method for Large Scale Multi-Class Linear Svms.” In Proceedings of the 14th Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 408–16.
Weston, Jason, and Chris Watkins. 1999. “Support Vector Machines for Multi-Class Pattern Recognition.” In Proc. 7th European Symposium on Artificial Neural Networks, 1999.