Machine Learning Foundations Homework 4

Website of Machine Learning Foundations by Hsuan-Tien Lin: https://www.csie.ntu.edu.tw/~htlin/mooc/.

Question 1

Overfitting and Deterministic Noise

Deterministic noise depends on $\mathcal{H}$ , as some models approximate $f$ better than others. Assume $\mathcal{H}^\prime \subset \mathcal{H}$ and that $f$ is fixed. In general (but not necessarily in all cases), if we use $\mathcal{H}^\prime$ instead of $\mathcal{H}$ , how does deterministic noise behave?

In general, deterministic noise will decrease.
In general, deterministic noise will increase.
In general, deterministic noise will be the same.
If $d_{\mathrm{vc}}\left(\mathcal{H}^{\prime}\right) \leq \frac{1}{2} d_{\mathrm{vc}}(\mathcal{H})$ deterministic noise will increase, else it will decrease.
If $d_{\mathrm{vc}}\left(\mathcal{H}^{\prime}\right) \leq \frac{1}{2} d_{\mathrm{vc}}(\mathcal{H}),$ deterministic noise will decrease, else it will increase.

Sol: In general, deterministic noise will increase because our hypothesis set gets simpler, which makes it less likely to obtain a good $g$ (recall that deterministic noise is the difference between the best hypothesis $h^*$ in the given hypothesis set and the target function $f$ ).

Question 2

Regularization With Polynomials

Polynomial models can be viewed as linear models in a space $\mathcal{Z}$ , under a nonlinear transform $\Phi: \mathcal{X} \rightarrow \mathcal{Z}$ . Here, $\Phi$ transforms the scalar $x$ into a vector $\mathbf{z}$ of Legendre polynomials, $\mathbf{z}=\left(1, L_{1}(x), L_{2}(x), \ldots, L_{Q}(x)\right)$ . Our hypothesis set will be expressed as a linear combination of these polynomials,

$\mathcal{H}_{Q}=\left\{h | h(x)=\mathbf{w}^{\mathrm{T}} \mathbf{z}=\sum_{q=0}^{Q} w_{q} L_{q}(x)\right\},$

where $L_0(x)=1$ .

Consider the following hypothesis set defined by the constraint:

$\mathcal{H}\left(Q, c, Q_{o}\right)=\left\{h | h(x)=\mathbf{w}^{\mathrm{T}} \mathbf{z} \in \mathcal{H}_{Q} ; w_{q}=c \text { for } q \geq Q_{o}\right\},$

which of the following statement is correct?

$\mathcal{H}(10,0,3) \cup \mathcal{H}(10,1,4)=\mathcal{H}_{3}$
$\mathcal{H}(10,1,3) \cap \mathcal{H}(10,1,4)=\mathcal{H}_{1}$
$\mathcal{H}(10,0,3) \cap \mathcal{H}(10,0,4)=\mathcal{H}_{2}$
$\mathcal{H}(10,0,3) \cup \mathcal{H}(10,0,4)=\mathcal{H}_{4}$
none of the other choices

Sol:

$\mathcal{H}(10, 0, 3)$ : $w_q = 0, \forall q \ge 3$ , so it contains polynomials of the form $\sum_{q=0}^{2} w_{q} L_{q}(x)$ ;

$\mathcal{H}(10, 0, 4)$ : $w_q = 0, \forall q \ge 4$ , so it contains polynomials of the form $\sum_{q=0}^{3} w_{q} L_{q}(x)$ .

Therefore, $\mathcal{H}(10,0,3) \cap \mathcal{H}(10,0,4)=\mathcal{H}_{2}$ is the answer.

Question 3

Regularization and Weight Decay

Consider the augmented error

$E_{\mathrm{aug}}(\mathbf{w})=E_{\mathrm{in}}(\mathbf{w})+\frac{\lambda}{N} \mathbf{w}^{T} \mathbf{w},$

with some $\lambda \gt 0$ .

If we want to minimize the augmented error $E_{\text{aug}}(\mathbf{w})$ by gradient descent, with $\eta$ as learning rate, which of the followings are the correct update rules?

$\mathbf{w}(t+1) \longleftarrow \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
$\mathbf{w}(t+1) \longleftarrow\left(1-\frac{2 \eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
$\mathbf{w}(t+1) \longleftarrow\left(1+\frac{2 \eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
$\mathbf{w}(t+1) \longleftarrow\left(1+\frac{\eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
$\mathbf{w}(t+1) \longleftarrow\left(1-\frac{\eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$

Sol: Since

$\nabla E_{\text{aug}}(\mathbf{w}) = \nabla E_{\text{in}}(\mathbf{w}) + \frac{\lambda}{N} \cdot 2\mathbf{w},$

we have the update rule

$\begin{align} \mathbf{w}(t+1) &= \mathbf{w}(t) - \eta \nabla E_{\text{aug}}(\mathbf{w}(t)) \\ &= \left(1-\frac{2 \eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t)). \end{align}$

Question 4

Let $\mathbf{w}_{\text{lin}}$ be the optimal solution for the plain-vanilla linear regression and $\mathbf{w}_{\mathrm{reg}}(\lambda)$ be the optimal solution for the formulation above. Select all the correct statement below:

$\left\|\mathbf{w}_{\mathrm{reg}}(\lambda)\right\|=\left\|\mathbf{w}_{\mathrm{lin}}\right\|$ for any $\lambda>0$
$\left\|\mathbf{w}_{\mathrm{reg}}(\lambda)\right\|$ is a non-decreasing function of $\lambda$ for $\lambda \geq 0$
$\left\|\mathbf{w}_{\mathrm{reg}}(\lambda)\right\|$ is a non-increasing function of $\lambda$ for $\lambda \geq 0$
$\left\|\mathbf{w}_{\mathrm{reg}}(\lambda)\right\|>\left\|\mathbf{w}_{\mathrm{lin}}\right\|$ for any $\lambda>0$
none of the other choices

Sol:

From the constrained minimization of $E_{\text{in}}$ :

$\min_{\mathbf{w}} E_{\text{in}}(\mathbf{w})\; \text{subject to} \; \mathbf{w}^T \mathbf{w} \le C,$

which is equivalent to the unconstrained minimization of $E_{\text{aug}}$ , we can deduce that if $\mathbf{w}_{\text{lin}}$ satisfies the constraint $\mathbf{w}^T \mathbf{w} \le C$ , then $\mathbf{w}_{\text{lin}}=\mathbf{w}_{\text{aug}}$ , otherwise, $\mathbf{w}_{\text{lin}} \gt \mathbf{w}_{\text{aug}}$ .

As for the monotonicity of $\mathbf{w}_{\text{aug}}$ , the increasing of $\lambda$ is equivalent to the decreasing of $C$ , restricting the growth of $\mathbf{w}$ and hence, $\left\|\mathbf{w}_{\mathrm{reg}}(\lambda)\right\|$ is non-increasing.

Question 5

Leave-One-Out Cross-Validation

You are given the data points: $(-1,0),(\rho, 1),(1,0), \rho \geq 0$ , and a choice between two models: constant $\left[h_{0}(x)=b_{0}\right]$ and linear $\left[h_{1}(x)=a_{1} x+b_{1}\right]$ . For which value of $\rho$ would the two models be tied using leaving-one-out cross-validation with the squared error measure?

$\sqrt{\sqrt{3}+4}$
$\sqrt{\sqrt{3}-1}$
$\sqrt{9+4 \sqrt{6}}$
$\sqrt{9-\sqrt{6}}$
none of the other choices

Sol:

$\begin{align} E_{\text{CV}, h_0} &= \frac{1}{3} \sum_{n=1}^{3} \text{e}_{n, h_0} \\ &= \frac{1}{3} \sum_{n=1}^{3} [h_0^-(x_n) - y_n)]^2. \end{align}$

For the first model, which is a constant model, the best $b_0$ we can choose over training set is the average of the $y$ values, i.e.,

$\begin{align} h_0^-(x_1) &= 1/2 \\ h_0^-(x_2) &= 0 \\ h_0^-(x_3) &= 1/2. \end{align}$

Then substitute into $E_{\text{CV}, h_0}$ :

$E_{\text{CV}, h_0} = \frac{1}{3} [(1/2 - 0)^2 + (0-1)^2 + (1/2 - 0)^2] = \frac{1}{3} [1/4 + 1 + 1/4].$

As for the second model, the best line using two points is exactly the line through them, and hence

$\begin{align} h_1^-(x_1) &= \frac{1}{\rho-1} (x_1 - 1) = \frac{-2}{\rho-1} \\ h_1^-(x_2) &= 0 \\ h_1^-(x_3) &= \frac{1}{\rho+1} (x_3 + 1) = \frac{2}{\rho+1}. \end{align}$

Then $\begin{align} E_{\text{CV}, h_1} &= \frac{1}{3} \sum_{n=1}^{3} [h_1^-(x_n) - y_n)]^2 \\ &= \frac{1}{3} [(\frac{-2}{\rho - 1})^2 + 1 + (\frac{2}{\rho+1})^2]. \end{align}$

Equate $E_{\text{CV}, h_0}$ and $E_{\text{CV}, h_1}$ , we can solve for $\rho = \sqrt{9+4 \sqrt{6}}$ .

Question 6

Learning Principles

In Problem 6-7, suppose that for 5 weeks in a row, a letter arrives in the mail that predicts the outcome of the upcoming Monday night baseball game. (Assume there are no tie). You keenly watch each Monday and to your surprise, the prediction is correct each time. On the day after the fifth game, a letter arrives, stating that if you wish to see next week’s prediction, a payment of NTD 1,000 is required. Which of the following statement is true?

To make sure that at least one person receives correct predictions on all 5 games from him, at least 64 letters are sent before the fifth game.
If the sender wants to make sure that at least one person receives correct predictions on all 5 games from him, the sender should target to begin with at least 5 people.
To make sure that at least one person receives correct predictions on all 5 games from him, after the first letter ‘predicts’ the outcome of the first game, the sender should target at least 16 people with the second letter.
none of the other choices.

Sol: If we just randomly guess the result, then the probability of getting the correct result is $\frac{1}{2}$ . So to make 5 consecutive correct ‘predictions’, the probability based on randomly guessing is $\frac{1}{32}$ .

The ‘strategy’ is like the following:

Before the first game: mailling out 32 letters with 16 predicting A wins and the other 16 B wins, so half of them will be correct.
Before the second game: in those we correctly predicted the first time, choose 8 of them to mail out letters which predict A wins and the other 8 with letters predicting B wins.
Repeat the same process until the fifth game.

After the fifth game, there will be one person who received correct predictions on all 5 games. Then we mail out him a payment letter.

So the correct answer is the third statement.

Question 7

If the cost of printing and mailling out each letter is NTD 10. If the sender sends the minimum number of letters out, how much money can be made for the above ‘fraud’ to success once? That is, one of the recipients does send him NTD 1,000 to receive the prediction of the 6-th game?

NTD 400
NTD 340
NTD 460
NTD 430
NTD 370

Sol: The total number of letters the sender needs to send is

$32 + 16 + 8 + 4 + 2 + 1 = 63,$

where the last ‘1’ is the letter with the payment requirement attached. So the money can be made is

$1000 - 630 = 370.$

Question 8

For Problem 8-10, we consider the following. In our credit card example, the bank starts with some vague idea of what constitutes a good credit risk. So as customers $\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N$ arrive, the bank applies its vague idea to approve credit cards for some of these customers based on a formula $a(\mathbf{x})$ . Then, only those who get credit cards are monitored to see if they default or not. For simplicity, suppose that the first $N=10,000$ customers were given credit cards by the credit approval function $a(\mathbf{x})$ . Now that the bank knows the behavior of these customers, it comes to you to improve their algorithm for approving credit. The bank gives you the data $\left(\mathbf{x}_{1}, y_{1}\right), \ldots,\left(\mathbf{x}_{N}, y_{N}\right)$ . Before you look at the data, you do mathematical derivations and come up with a credit approval function. You now test it on the data and, to your delight, obtain perfect prediciton.

What is $M$ , the size of your hypothesis set?

$N$
$1$
$N^2$
$2^N$
We have no idea about it

Sol: Since we come up with one particular credit approval function before looking at the data, this is the only function we are considering, so the size of the hypothesis set is $M=1$ .

Question 9

With such an $M$ , what does the Hoeffding bound say about the probability that the true average error rate of $g$ is worse than 1% for $N=10,000$ ?

Sol: By Hoeffding inequality (with one fixed hypothesis)

$P(\vert E_{\text{in}}(g) - E_{\text{out}}(g) \vert \gt \epsilon) \le 2 \exp(-2N\epsilon^2),$

where $\epsilon=0.01, N=10,000$ in this case, the probability is $2 \exp(-2N\epsilon^2) = 0.2707$ .

Sol:

You assure the bank that you have got a system $g$ for approving credit cards for new customers, which is nearly error-free. Your confidence is given by your answer to the previous question. The bank is thrilled and uses your $g$ to approve credit for new customers. To their dismay, more than half their credit cards are being defaulted on. Assume that the customers that were sent to the old credit approval function and the customers that were sent to your $g$ are indeed i.i.d. from the same distribution, and the bank is lucky enough (so the ‘bad luck’ that “the true error of $g$ is worse than 1%” does not happen).

Select all the true claims for this situation.

By applying $a(\mathbf{x}) \mathrm{XOR} g(\mathbf{x})$ to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
By applying $a(\mathbf{x}) \mathrm{OR} g(\mathbf{x})$ to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
By applying $a(\mathbf{x}) \mathrm{AND} g(\mathbf{x})$ to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
By applying $a(\mathbf{x}) \mathrm{XNOR} g(\mathbf{x})$ to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.

Sol: The data we have are not clean in the sense that they were contaminated by the function $a$ since we used $a$ to obtain label $y$ . Our prediciton function $g$ is reliable only if the data are indeed the result of the function $a$ . Therefore, we should use the conjunction $a(\mathbf{x}) \mathrm{AND} g(\mathbf{x})$ to approve credit.

Question 11

Virtual Examples and Regularization

Consider linear regression with virtual examples. That is, we add $K$ virtual examples $\left(\tilde{\mathbf{x}}_{1}, \tilde{y}_{1}\right),\left(\tilde{\mathbf{x}}_{2}, \tilde{y}_{2}\right), \ldots,\left(\tilde{\mathbf{x}}_{K}, \tilde{y}_{K}\right)$ to the training data set, and solve

$\min _{\mathbf{w}} \frac{1}{N+K}\left(\sum_{n=1}^{N}\left(y_{n}-\mathbf{w}^{T} \mathbf{x}_{n}\right)^{2}+\sum_{k=1}^{K}\left(\tilde{y}_{k}-\mathbf{w}^{T} \tilde{\mathbf{x}}_{k}\right)^{2}\right).$

We will show that using some ‘special’ virtual examples, which were claimed to be possible way to combat overfitting in Lecture 9, is related to regularization, another possible way to combat overfitting discussed in Lecture 10. Let $\tilde{\mathbf{X}}=\left[\tilde{\mathbf{x}}_{1} \tilde{\mathbf{x}}_{2} \ldots \tilde{\mathbf{x}}_{K}\right]^{T}$ , and $\tilde{\mathbf{y}}=\left[\tilde{y}_{1}, \tilde{y}_{2}, \ldots, \tilde{y}_{K}\right]^{T}$ .

What is the optimal $\mathbf{w}$ to the optimization problem above, assuming that all the inversions exists?

$\left(\mathbf{X}^{T} \mathbf{X}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{X}}\right)^{-1}\left(\mathbf{X}^{T} \mathbf{y}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
$\left(\mathbf{X}^{T} \mathbf{X}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{X}}\right)^{-1}\left(\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
$\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1}\left(\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
$\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1}\left(\mathbf{X}^{T} \mathbf{y}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
none of the other choices

Sol:

$\frac{1}{N+K}\left(\sum_{n=1}^{N}\left(y_{n}-\mathbf{w}^{T} \mathbf{x}_{n}\right)^{2}+\sum_{k=1}^{K}\left(\tilde{y}_{k}-\mathbf{w}^{T} \tilde{\mathbf{x}}_{k}\right)^{2}\right) = \frac{1}{N+K} \left[\Vert\mathbf{y} - X \mathbf{w} \Vert^2 + \Vert\tilde{\mathbf{y}} - \tilde{X} \mathbf{w} \Vert^2 \right].$

Take gradient and set to $\mathbf{0}$ :

$\begin{align} \frac{1}{N+K} \left[(-X)^T \cdot 2 (\mathbf{y} - X \mathbf{w}) + (-\tilde{X})^T \cdot 2 (\tilde{\mathbf{y}} - \tilde{X} \mathbf{w}) \right] &= \mathbf{0} \\ (X^T X + \tilde{X}^T \tilde{X}) \mathbf{w} &= X^T \mathbf{y} + \tilde{X}^T \tilde{y} \\ \mathbf{w} &= \left(\mathbf{X}^{T} \mathbf{X}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{X}}\right)^{-1}\left(\mathbf{X}^{T} \mathbf{y}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right). \end{align}$

Question 12

For what $\tilde{X}$ and $\tilde{y}$ will the solution of this linear regression equal to

$\mathbf{w}_{\mathrm{reg}}=\operatorname{argmin}_{\mathbf{w}} \frac{\lambda}{N}\|\mathbf{w}\|^{2}+\frac{1}{N}\|\mathbf{X} \mathbf{w}-\mathbf{y}\|^{2} ?$

Sol: Take $\tilde{X} = \sqrt{\lambda} I$ and $\tilde{\mathbf{y}} = \mathbf{0}$ , then

$\Vert\tilde{\mathbf{y}} - \tilde{X} \mathbf{w} \Vert^2 = \lambda \Vert \mathbf{w} \Vert^2.$

Thus, in this case these two solutions are equal.

Question 13

Experiment with Regularized Linear Regression and Validation

Consider regularized linear regression (also called ridge regression) for classification.

$\mathbf{w}_{\mathrm{reg}}=\operatorname{argmin}_{\mathbf{w}} \frac{\lambda}{N}\|\mathbf{w}\|^{2}+\frac{1}{N}\|\mathbf{X} \mathbf{w}-\mathbf{y}\|^{2},$

Run the algorithm on the following data set hw4_train.dat as $\mathcal{D}$

and the following set hw4_test.dat for evaluating $E_{\text{out}}$ .

Because the data sets are for classification, please consider only the $0/1$ error for all the problems below.

Let $\lambda=10$ , which of the followings is the corresponding $E_\text{in}$ and $E_{\text{out}}$ ?

$E_{\text{in}}=0.015, E_{\text{out}}=0.020$
$E_{\text{in}}=0.030, E_{\text{out}}=0.015$
$E_{\text{in}}=0.020, E_{\text{out}}=0.010$
$E_{\text{in}}=0.035, E_{\text{out}}=0.020$
$E_{\text{in}}=0.050, E_{\text{out}}=0.045$

Sol:

import pandas as pd
import numpy as np
from numpy.linalg import pinv

def load_data(filename):
    df = pd.read_csv(filename, header=None, sep='\s+')
    X_df, y_df = df.iloc[:, :-1], df.iloc[:, -1]
    X, y = X_df.to_numpy(), y_df.to_numpy()
    n, _ = X.shape
    X = np.c_[np.ones((n, 1)), X]
    return X, y

# Read in training and test data
X, y = load_data('hw4_train.dat')
X_test, y_test = load_data('hw4_test.dat')

def ridge_reg(X, y, lambd):
    """
    Args:
        X: ndarray of shape = [N, d + 1]
        y: ndarray of shape = [N, ]
        lambd: float
    Returns:
        w: ndarray of shape = [d, ]
    """
    _, d = X.shape
    w = pinv(X.T @ X + lambd * np.eye(d)) @ X.T @ y
    return w

w_ridge = ridge_reg(X, y, lambd=10)
w_ridge

Output:

array([-0.93238149,  1.04618645,  1.046171  ])

def calc_err(X, y, w):
    """
    Args:
        X: ndarray of shape = [N, d + 1]
        y: ndarray of shape = [N, ] 
        w: ndarray of shape = [d + 1, ]
    Returns:
        err: float
    """
    
    y_hat = np.sign(X @ w)
    err = np.mean(y_hat != y)
    return err

E_in = calc_err(X, y, w_ridge)
E_out = calc_err(X_test, y_test, w_ridge)

print(f"E_in: {E_in}\nE_out: {E_out}")

Output:

E_in: 0.05
E_out: 0.045

Question 14

Among $\log _{10} \lambda=\{2,1,0,-1, \ldots,-8,-9,-10\}$ . What is the $\lambda$ with the minimum $E_{\text{in}}$ ? Compute $\lambda$ and its corresponding $E_{\text{in}}$ and $E_{\text{out}}$ then select the closest answer. Break the tie by selecting the largest $\lambda$ .

$\log _{10} \lambda=-4, E_{\text {in }}=0.015, E_{\text {out }}=0.020$
$\log _{10} \lambda=-10, E_{\text {in }}=0.030, E_{\text {out }}=0.040$
$\log _{10} \lambda=-6, E_{\text {in }}=0.030, E_{\text {out }}=0.040$
$\log _{10} \lambda=-2, E_{\text {in }}=0.030, E_{\text {out }}=0.040$
$\log _{10} \lambda=-8, E_{\text {in }}=0.015, E_{\text {out }}=0.020$

Sol:

lambd_vals = np.logspace(start=2, stop=-10, num=13)

E_in_list = []
E_out_list = []

for lambd in lambd_vals:
    w = ridge_reg(X, y, lambd)
    E_in = calc_err(X, y, w)
    E_out = calc_err(X_test, y_test, w)
    E_in_list.append(E_in)
    E_out_list.append(E_out)

for lambd, E_in, E_out in zip(lambd_vals, E_in_list, E_out_list):
    print(f"lambda={lambd:>6}, E_in={E_in:>6}, E_out={E_out:>6}")

Output:

lambda= 100.0, E_in=  0.24, E_out= 0.261
lambda=  10.0, E_in=  0.05, E_out= 0.045
lambda=   1.0, E_in= 0.035, E_out=  0.02
lambda=   0.1, E_in= 0.035, E_out= 0.016
lambda=  0.01, E_in=  0.03, E_out= 0.016
lambda= 0.001, E_in=  0.03, E_out= 0.016
lambda=0.0001, E_in=  0.03, E_out= 0.016
lambda= 1e-05, E_in=  0.03, E_out= 0.016
lambda= 1e-06, E_in= 0.035, E_out= 0.016
lambda= 1e-07, E_in=  0.03, E_out= 0.015
lambda= 1e-08, E_in= 0.015, E_out=  0.02
lambda= 1e-09, E_in= 0.015, E_out=  0.02
lambda= 1e-10, E_in= 0.015, E_out=  0.02

Thus, the answer is $\log _{10} \lambda=-8, E_{\text {in }}=0.015, E_{\text {out }}=0.020$ .

Question 15

Among $\log _{10} \lambda=\{2,1,0,-1, \ldots,-8,-9,-10\}$ . What is the $\lambda$ with the minimum $E_{\text{out}}$ ? Compute $\lambda$ and its corresponding $E_{\text{in}}$ and $E_{\text{out}}$ then select the closest answer. Break the tie by selecting the largest $\lambda$ .

$\log _{10} \lambda=-5, E_{\text {in }}=0.015, E_{\text {out }}=0.030$
$\log _{10} \lambda=-7, E_{\text {in }}=0.030, E_{\text {out }}=0.015$
$\log _{10} \lambda=-3, E_{\text {in }}=0.015, E_{\text {out }}=0.015$
$\log _{10} \lambda=-9, E_{\text {in }}=0.030, E_{\text {out }}=0.030$
$\log _{10} \lambda=-1, E_{\text {in }}=0.015, E_{\text {out }}=0.015$

Sol: The answer is $\log _{10} \lambda=-7, E_{\text {in }}=0.030, E_{\text {out }}=0.015$ .

Question 16

Now split the given training examples in $\mathcal{D}$ to the first 120 examples for $\mathcal{D}_{train}$ and 80 for $\mathcal{D}_{val}$ .

Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem.

Run the algorithm on $\mathcal{D}_{train}$ to get $g_{\lambda}^-$ , and validate $g_{\lambda}^-$ with $\mathcal{D}_{val}$ .

Among $\log _{10} \lambda=\{2,1,0,-1, \ldots,-8,-9,-10\}$ . What is the $\lambda$ with the minimum $E_{train}\left(g_{\lambda}^-\right)$ ? Compute $\lambda$ and the corresponding $E_{train}\left(g_{\lambda}^-\right)$ , $E_{val}\left(g_{\lambda}^-\right)$ and $E_{out}\left(g_{\lambda}^-\right)$ then select the closest answer. Break the tie by selecting the largest $\lambda$ .

$\log _{10} \lambda=-0, E_{t r a i n}\left(g_{\lambda}^-\right)=0.000, E_{v a l}\left(g_{\lambda}^-\right)=0.050, E_{o u t}\left(g_{\lambda}^-\right)=0.025$
$\log _{10} \lambda=-2, E_{t r a i n}\left(g_{\lambda}^-\right)=0.010, E_{v a l}\left(g_{\lambda}^-\right)=0.050, E_{o u t}\left(g_{\lambda}^-\right)=0.035$
$\log _{10} \lambda=-8, E_{t r a i n}\left(g_{\lambda}^-\right)=0.000, E_{v a l}\left(g_{\lambda}^-\right)=0.050, E_{o u t}\left(g_{\lambda}^-\right)=0.025$
$\log _{10} \lambda=-4, E_{t r a i n}\left(g_{\lambda}^-\right)=0.000, E_{v a l}\left(g_{\lambda}^-\right)=0.010, E_{o u t}\left(g_{\lambda}^-\right)=0.035$
$\log _{10} \lambda=-6, E_{t r a i n}\left(g_{\lambda}^-\right)=0.010, E_{v a l}\left(g_{\lambda}^-\right)=0.00, E_{o u t}\left(g_{\lambda}^-\right)=0.025$

Sol:

# Split train/val= 120/80
X_train, X_val = X[:120], X[120:]
y_train, y_val = y[:120], y[120:]


E_train_list = []
E_val_list = []
E_out_list = []

for lambd in lambd_vals:
    w = ridge_reg(X_train, y_train, lambd)
    E_train = calc_err(X_train, y_train, w)
    E_val = calc_err(X_val, y_val, w)
    E_out = calc_err(X_test, y_test, w)
    E_train_list.append(E_train)
    E_val_list.append(E_val)
    E_out_list.append(E_out)

for lambd, E_train, E_val, E_out in zip(lambd_vals, E_train_list, E_val_list, E_out_list):
    print(f"lambda={lambd:>6}, E_train={E_train:>6f}, E_val={E_val:>6f}, E_out={E_out:>6f}")

Output:

lambda= 100.0, E_train=0.341667, E_val=0.412500, E_out=0.414000
lambda=  10.0, E_train=0.075000, E_val=0.125000, E_out=0.080000
lambda=   1.0, E_train=0.033333, E_val=0.037500, E_out=0.028000
lambda=   0.1, E_train=0.033333, E_val=0.037500, E_out=0.022000
lambda=  0.01, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 0.001, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda=0.0001, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-05, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-06, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-07, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-08, E_train=0.000000, E_val=0.050000, E_out=0.025000
lambda= 1e-09, E_train=0.000000, E_val=0.100000, E_out=0.038000
lambda= 1e-10, E_train=0.008333, E_val=0.125000, E_out=0.040000

So the answer is $\log _{10} \lambda=-8, E_{t r a i n}\left(g_{\lambda}^-\right)=0.000, E_{v a l}\left(g_{\lambda}^-\right)=0.050, E_{o u t}\left(g_{\lambda}^-\right)=0.025$ .

Question 17

Among $\log _{10} \lambda=\{2,1,0,-1, \ldots,-8,-9,-10\}$ . What is the $\lambda$ with the minimum $E_{val}\left(g_{\lambda}^-\right)$ ? Compute $\lambda$ and the corresponding $E_{train}\left(g_{\lambda}^-\right)$ , $E_{val}\left(g_{\lambda}^-\right)$ and $E_{out}\left(g_{\lambda}^-\right)$ then select the closest answer. Break the tie by selecting the largest $\lambda$ .

$\log _{10} \lambda=-8, E_{t r a i n}\left(g_{\lambda}^-\right)=0.066, E_{v a l}\left(g_{\lambda}^-\right)=0.028, E_{o u t}\left(g_{\lambda}^-\right)=0.028$
$\log _{10} \lambda=0, E_{t r a i n}\left(g_{\lambda}^-\right)=0.033, E_{v a l}\left(g_{\lambda}^-\right)=0.038, E_{o u t}\left(g_{\lambda}^-\right)=0.028$
$\log _{10} \lambda=-3, E_{t r a i n}\left(g_{\lambda}^-\right)=0.000, E_{v a l}\left(g_{\lambda}^-\right)=0.028, E_{o u t}\left(g_{\lambda}^-\right)=0.038$
$\log _{10} \lambda=-9, E_{t r a i n}\left(g_{\lambda}^-\right)=0.033, E_{v a l}\left(g_{\lambda}^-\right)=0.028, E_{o u t}\left(g_{\lambda}^-\right)=0.028$
$\log _{10} \lambda=-6, E_{t r a i n}\left(g_{\lambda}^-\right)=0.066, E_{v a l}\left(g_{\lambda}^-\right)=0.038, E_{o u t}\left(g_{\lambda}^-\right)=0.038$

Sol: The closest answer is $\log _{10} \lambda=0, E_{t r a i n}\left(g_{\lambda}^-\right)=0.033, E_{v a l}\left(g_{\lambda}^-\right)=0.038, E_{o u t}\left(g_{\lambda}^-\right)=0.028$ .

Question 18

Run the algorithm with the optimal $\lambda$ of the previous on the whole $\mathcal{D}$ to get $g_{\lambda}$ . Compute $E_{in}(g_\lambda)$ and $E_{out}(g_\lambda)$ then select the closest answer.

$E_{i n}\left(g_{\lambda}\right)=0.035, E_{o u t}\left(g_{\lambda}\right)=0.020$
$E_{i n}\left(g_{\lambda}\right)=0.055, E_{o u t}\left(g_{\lambda}\right)=0.020$
$E_{i n}\left(g_{\lambda}\right)=0.015, E_{o u t}\left(g_{\lambda}\right)=0.020$
$E_{i n}\left(g_{\lambda}\right)=0.045, E_{o u t}\left(g_{\lambda}\right)=0.030$
$E_{i n}\left(g_{\lambda}\right)=0.025, E_{o u t}\left(g_{\lambda}\right)=0.030$

Sol:

lambd = 1
w = ridge_reg(X, y, lambd)
E_in = calc_err(X, y, w)
E_out = calc_err(X_test, y_test, w)

print(f"E_in: {E_in}\nE_out: {E_out}")

Output:

E_in: 0.035
E_out: 0.02

Question 19

Now split the given training examples in $\mathcal{D}$ to five folds, the first 40 being fold 1, the next 40 being fold 2, and so on. Again, we take a fixed split because the given examples are already randomly permuted.

Among $\log _{10} \lambda=\{2,1,0,-1, \ldots,-8,-9,-10\}$ , what is the $\lambda$ with the minimum $E_{cv}$ , where $E_{cv}$ comes from the five folds defined above? Compute $\lambda$ and the corresponding $E_{cv}$ then select the closest answer. Break the tie by selecting the largest $\lambda$ .

$\log _{10} \lambda=-2, E_{c v}=0.020$
$\log _{10} \lambda=-6, E_{c v}=0.020$
$\log _{10} \lambda=0, E_{c v}=0.030$
$\log _{10} \lambda=-4, E_{c v}=0.030$
$\log _{10} \lambda=-8, E_{c v}=0.030$

Sol:

num_of_folds = 5
E_cv_list = []

for lambd in lambd_vals:
    sum_of_cv_error = 0
    for k in range(num_of_folds):
        k_th_val_fold = slice(k * 40, (k + 1) * 40)
        k_th_train_fold = np.r_[slice(0, k * 40), slice((k + 1) * 40, 200)]
        X_val, y_val = X[k_th_val_fold], y[k_th_val_fold]
        X_train, y_train = X[k_th_train_fold], y[k_th_train_fold]
        w = ridge_reg(X_train, y_train, lambd)
        sum_of_cv_error += calc_err(X_val, y_val, w)
    E_cv = sum_of_cv_error / num_of_folds
    E_cv_list.append(E_cv)

for lambd, E_cv in zip(lambd_vals, E_cv_list):
    print(f"lambda={lambd:>6}, E_cv={E_cv:>6f}")

Output:

lambda= 100.0, E_cv=0.290000
lambda=  10.0, E_cv=0.060000
lambda=   1.0, E_cv=0.035000
lambda=   0.1, E_cv=0.035000
lambda=  0.01, E_cv=0.035000
lambda= 0.001, E_cv=0.035000
lambda=0.0001, E_cv=0.035000
lambda= 1e-05, E_cv=0.035000
lambda= 1e-06, E_cv=0.035000
lambda= 1e-07, E_cv=0.035000
lambda= 1e-08, E_cv=0.030000
lambda= 1e-09, E_cv=0.050000
lambda= 1e-10, E_cv=0.050000

So the answer is $\log _{10} \lambda=-8, E_{c v}=0.030$ .

Question 20

Run the algorithm with the optimal $\lambda$ of the previous problem on the whole $\mathcal{D}$ to get $g_\lambda$ . Compute $E_{in}(g_\lambda)$ and $E_{out}(g_\lambda)$ then select the closest answer.

$E_{i n}\left(g_{\lambda}\right)=0.025, E_{o u t}\left(g_{\lambda}\right)=0.020$
$E_{i n}\left(g_{\lambda}\right)=0.035, E_{o u t}\left(g_{\lambda}\right)=0.030$
$E_{i n}\left(g_{\lambda}\right)=0.005, E_{o u t}\left(g_{\lambda}\right)=0.010$
$E_{i n}\left(g_{\lambda}\right)=0.015, E_{o u t}\left(g_{\lambda}\right)=0.020$
$E_{i n}\left(g_{\lambda}\right)=0.045, E_{o u t}\left(g_{\lambda}\right)=0.020$

Sol:

lambd = 1e-8
w = ridge_reg(X, y, lambd)
E_in = calc_err(X, y, w)
E_out = calc_err(X_test, y_test, w)

print(f"E_in: {E_in}\nE_out: {E_out}")

Output:

E_in: 0.015
E_out: 0.02

Reference