Website of Machine Learning Foundations by Hsuan-Tien Lin: https://www.csie.ntu.edu.tw/~htlin/mooc/.

Question 1

Overfitting and Deterministic Noise

Deterministic noise depends on , as some models approximate better than others. Assume and that is fixed. In general (but not necessarily in all cases), if we use instead of , how does deterministic noise behave?

  • In general, deterministic noise will decrease.
  • In general, deterministic noise will increase.
  • In general, deterministic noise will be the same.
  • If deterministic noise will increase, else it will decrease.
  • If deterministic noise will decrease, else it will increase.

Sol: In general, deterministic noise will increase because our hypothesis set gets simpler, which makes it less likely to obtain a good (recall that deterministic noise is the difference between the best hypothesis in the given hypothesis set and the target function ).

Question 2

Regularization With Polynomials

Polynomial models can be viewed as linear models in a space , under a nonlinear transform . Here, transforms the scalar into a vector of Legendre polynomials, . Our hypothesis set will be expressed as a linear combination of these polynomials,

where .

Consider the following hypothesis set defined by the constraint:

which of the following statement is correct?

  • $\mathcal{H}(10,0,3) \cup \mathcal{H}(10,1,4)=\mathcal{H}_{3}$
  • $\mathcal{H}(10,1,3) \cap \mathcal{H}(10,1,4)=\mathcal{H}_{1}$
  • $\mathcal{H}(10,0,3) \cap \mathcal{H}(10,0,4)=\mathcal{H}_{2}$
  • $\mathcal{H}(10,0,3) \cup \mathcal{H}(10,0,4)=\mathcal{H}_{4}$
  • none of the other choices

Sol:

: , so it contains polynomials of the form ;

: , so it contains polynomials of the form .

Therefore, is the answer.

Question 3

Regularization and Weight Decay

Consider the augmented error

with some .

If we want to minimize the augmented error by gradient descent, with as learning rate, which of the followings are the correct update rules?

  • $\mathbf{w}(t+1) \longleftarrow \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
  • $\mathbf{w}(t+1) \longleftarrow\left(1-\frac{2 \eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
  • $\mathbf{w}(t+1) \longleftarrow\left(1+\frac{2 \eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
  • $\mathbf{w}(t+1) \longleftarrow\left(1+\frac{\eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
  • $\mathbf{w}(t+1) \longleftarrow\left(1-\frac{\eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$

Sol: Since

we have the update rule

Question 4

Let be the optimal solution for the plain-vanilla linear regression and be the optimal solution for the formulation above. Select all the correct statement below:

  • for any
  • is a non-decreasing function of for
  • is a non-increasing function of for
  • for any
  • none of the other choices

Sol:

From the constrained minimization of :

which is equivalent to the unconstrained minimization of , we can deduce that if satisfies the constraint , then , otherwise, .

As for the monotonicity of , the increasing of is equivalent to the decreasing of , restricting the growth of and hence, is non-increasing.

Question 5

Leave-One-Out Cross-Validation

You are given the data points: , and a choice between two models: constant and linear . For which value of would the two models be tied using leaving-one-out cross-validation with the squared error measure?

  • $\sqrt{\sqrt{3}+4}$
  • $\sqrt{\sqrt{3}-1}$
  • $\sqrt{9+4 \sqrt{6}}$
  • $\sqrt{9-\sqrt{6}}$
  • none of the other choices

Sol:

For the first model, which is a constant model, the best we can choose over training set is the average of the values, i.e.,

Then substitute into :

As for the second model, the best line using two points is exactly the line through them, and hence

Then

Equate and , we can solve for .

Question 6

Learning Principles

In Problem 6-7, suppose that for 5 weeks in a row, a letter arrives in the mail that predicts the outcome of the upcoming Monday night baseball game. (Assume there are no tie). You keenly watch each Monday and to your surprise, the prediction is correct each time. On the day after the fifth game, a letter arrives, stating that if you wish to see next week’s prediction, a payment of NTD 1,000 is required. Which of the following statement is true?

  • To make sure that at least one person receives correct predictions on all 5 games from him, at least 64 letters are sent before the fifth game.
  • If the sender wants to make sure that at least one person receives correct predictions on all 5 games from him, the sender should target to begin with at least 5 people.
  • To make sure that at least one person receives correct predictions on all 5 games from him, after the first letter ‘predicts’ the outcome of the first game, the sender should target at least 16 people with the second letter.
  • none of the other choices.

Sol: If we just randomly guess the result, then the probability of getting the correct result is . So to make 5 consecutive correct ‘predictions’, the probability based on randomly guessing is .

The ‘strategy’ is like the following:

  1. Before the first game: mailling out 32 letters with 16 predicting A wins and the other 16 B wins, so half of them will be correct.
  2. Before the second game: in those we correctly predicted the first time, choose 8 of them to mail out letters which predict A wins and the other 8 with letters predicting B wins.
  3. Repeat the same process until the fifth game.

After the fifth game, there will be one person who received correct predictions on all 5 games. Then we mail out him a payment letter.

So the correct answer is the third statement.

Question 7

If the cost of printing and mailling out each letter is NTD 10. If the sender sends the minimum number of letters out, how much money can be made for the above ‘fraud’ to success once? That is, one of the recipients does send him NTD 1,000 to receive the prediction of the 6-th game?

  • NTD 400
  • NTD 340
  • NTD 460
  • NTD 430
  • NTD 370

Sol: The total number of letters the sender needs to send is

where the last ‘1’ is the letter with the payment requirement attached. So the money can be made is

Question 8

For Problem 8-10, we consider the following. In our credit card example, the bank starts with some vague idea of what constitutes a good credit risk. So as customers arrive, the bank applies its vague idea to approve credit cards for some of these customers based on a formula . Then, only those who get credit cards are monitored to see if they default or not. For simplicity, suppose that the first customers were given credit cards by the credit approval function . Now that the bank knows the behavior of these customers, it comes to you to improve their algorithm for approving credit. The bank gives you the data . Before you look at the data, you do mathematical derivations and come up with a credit approval function. You now test it on the data and, to your delight, obtain perfect prediciton.

What is , the size of your hypothesis set?

  • $N$
  • $1$
  • $N^2$
  • $2^N$
  • We have no idea about it

Sol: Since we come up with one particular credit approval function before looking at the data, this is the only function we are considering, so the size of the hypothesis set is .

Question 9

With such an , what does the Hoeffding bound say about the probability that the true average error rate of is worse than 1% for ?

Sol: By Hoeffding inequality (with one fixed hypothesis)

where in this case, the probability is .

Sol:

You assure the bank that you have got a system for approving credit cards for new customers, which is nearly error-free. Your confidence is given by your answer to the previous question. The bank is thrilled and uses your to approve credit for new customers. To their dismay, more than half their credit cards are being defaulted on. Assume that the customers that were sent to the old credit approval function and the customers that were sent to your are indeed i.i.d. from the same distribution, and the bank is lucky enough (so the ‘bad luck’ that “the true error of is worse than 1%” does not happen).

Select all the true claims for this situation.

  • By applying to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
  • By applying to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
  • By applying to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
  • By applying to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.

Sol: The data we have are not clean in the sense that they were contaminated by the function since we used to obtain label . Our prediciton function is reliable only if the data are indeed the result of the function . Therefore, we should use the conjunction to approve credit.

Question 11

Virtual Examples and Regularization

Consider linear regression with virtual examples. That is, we add virtual examples to the training data set, and solve

We will show that using some ‘special’ virtual examples, which were claimed to be possible way to combat overfitting in Lecture 9, is related to regularization, another possible way to combat overfitting discussed in Lecture 10. Let , and .

What is the optimal to the optimization problem above, assuming that all the inversions exists?

  • $\left(\mathbf{X}^{T} \mathbf{X}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{X}}\right)^{-1}\left(\mathbf{X}^{T} \mathbf{y}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
  • $\left(\mathbf{X}^{T} \mathbf{X}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{X}}\right)^{-1}\left(\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
  • $\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1}\left(\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
  • $\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1}\left(\mathbf{X}^{T} \mathbf{y}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
  • none of the other choices

Sol:

Take gradient and set to :

Question 12

For what and will the solution of this linear regression equal to

Sol: Take and , then

Thus, in this case these two solutions are equal.

Question 13

Experiment with Regularized Linear Regression and Validation

Consider regularized linear regression (also called ridge regression) for classification.

Run the algorithm on the following data set hw4_train.dat as

and the following set hw4_test.dat for evaluating .

Because the data sets are for classification, please consider only the error for all the problems below.

Let , which of the followings is the corresponding and ?

  • $E_{\text{in}}=0.015, E_{\text{out}}=0.020$
  • $E_{\text{in}}=0.030, E_{\text{out}}=0.015$
  • $E_{\text{in}}=0.020, E_{\text{out}}=0.010$
  • $E_{\text{in}}=0.035, E_{\text{out}}=0.020$
  • $E_{\text{in}}=0.050, E_{\text{out}}=0.045$

Sol:

import pandas as pd
import numpy as np
from numpy.linalg import pinv
def load_data(filename):
    df = pd.read_csv(filename, header=None, sep='\s+')
    X_df, y_df = df.iloc[:, :-1], df.iloc[:, -1]
    X, y = X_df.to_numpy(), y_df.to_numpy()
    n, _ = X.shape
    X = np.c_[np.ones((n, 1)), X]
    return X, y
# Read in training and test data
X, y = load_data('hw4_train.dat')
X_test, y_test = load_data('hw4_test.dat')
def ridge_reg(X, y, lambd):
    """
    Args:
        X: ndarray of shape = [N, d + 1]
        y: ndarray of shape = [N, ]
        lambd: float
    Returns:
        w: ndarray of shape = [d, ]
    """
    _, d = X.shape
    w = pinv(X.T @ X + lambd * np.eye(d)) @ X.T @ y
    return w
w_ridge = ridge_reg(X, y, lambd=10)
w_ridge

Output:

array([-0.93238149,  1.04618645,  1.046171  ])
def calc_err(X, y, w):
    """
    Args:
        X: ndarray of shape = [N, d + 1]
        y: ndarray of shape = [N, ] 
        w: ndarray of shape = [d + 1, ]
    Returns:
        err: float
    """
    
    y_hat = np.sign(X @ w)
    err = np.mean(y_hat != y)
    return err
E_in = calc_err(X, y, w_ridge)
E_out = calc_err(X_test, y_test, w_ridge)

print(f"E_in: {E_in}\nE_out: {E_out}")

Output:

E_in: 0.05
E_out: 0.045

Question 14

Among . What is the with the minimum ? Compute and its corresponding and then select the closest answer. Break the tie by selecting the largest .

Sol:

lambd_vals = np.logspace(start=2, stop=-10, num=13)
E_in_list = []
E_out_list = []

for lambd in lambd_vals:
    w = ridge_reg(X, y, lambd)
    E_in = calc_err(X, y, w)
    E_out = calc_err(X_test, y_test, w)
    E_in_list.append(E_in)
    E_out_list.append(E_out)
for lambd, E_in, E_out in zip(lambd_vals, E_in_list, E_out_list):
    print(f"lambda={lambd:>6}, E_in={E_in:>6}, E_out={E_out:>6}")

Output:

lambda= 100.0, E_in=  0.24, E_out= 0.261
lambda=  10.0, E_in=  0.05, E_out= 0.045
lambda=   1.0, E_in= 0.035, E_out=  0.02
lambda=   0.1, E_in= 0.035, E_out= 0.016
lambda=  0.01, E_in=  0.03, E_out= 0.016
lambda= 0.001, E_in=  0.03, E_out= 0.016
lambda=0.0001, E_in=  0.03, E_out= 0.016
lambda= 1e-05, E_in=  0.03, E_out= 0.016
lambda= 1e-06, E_in= 0.035, E_out= 0.016
lambda= 1e-07, E_in=  0.03, E_out= 0.015
lambda= 1e-08, E_in= 0.015, E_out=  0.02
lambda= 1e-09, E_in= 0.015, E_out=  0.02
lambda= 1e-10, E_in= 0.015, E_out=  0.02

Thus, the answer is .

Question 15

Among . What is the with the minimum ? Compute and its corresponding and then select the closest answer. Break the tie by selecting the largest .

Sol: The answer is .

Question 16

Now split the given training examples in to the first 120 examples for and 80 for .

Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem.

Run the algorithm on to get , and validate with .

Among . What is the with the minimum ? Compute and the corresponding , and then select the closest answer. Break the tie by selecting the largest .

Sol:

# Split train/val= 120/80
X_train, X_val = X[:120], X[120:]
y_train, y_val = y[:120], y[120:]


E_train_list = []
E_val_list = []
E_out_list = []

for lambd in lambd_vals:
    w = ridge_reg(X_train, y_train, lambd)
    E_train = calc_err(X_train, y_train, w)
    E_val = calc_err(X_val, y_val, w)
    E_out = calc_err(X_test, y_test, w)
    E_train_list.append(E_train)
    E_val_list.append(E_val)
    E_out_list.append(E_out)
for lambd, E_train, E_val, E_out in zip(lambd_vals, E_train_list, E_val_list, E_out_list):
    print(f"lambda={lambd:>6}, E_train={E_train:>6f}, E_val={E_val:>6f}, E_out={E_out:>6f}")

Output:

lambda= 100.0, E_train=0.341667, E_val=0.412500, E_out=0.414000
lambda=  10.0, E_train=0.075000, E_val=0.125000, E_out=0.080000
lambda=   1.0, E_train=0.033333, E_val=0.037500, E_out=0.028000
lambda=   0.1, E_train=0.033333, E_val=0.037500, E_out=0.022000
lambda=  0.01, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 0.001, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda=0.0001, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-05, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-06, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-07, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-08, E_train=0.000000, E_val=0.050000, E_out=0.025000
lambda= 1e-09, E_train=0.000000, E_val=0.100000, E_out=0.038000
lambda= 1e-10, E_train=0.008333, E_val=0.125000, E_out=0.040000

So the answer is .

Question 17

Among . What is the with the minimum ? Compute and the corresponding , and then select the closest answer. Break the tie by selecting the largest .

Sol: The closest answer is .

Question 18

Run the algorithm with the optimal of the previous on the whole to get . Compute and then select the closest answer.

  • $E_{i n}\left(g_{\lambda}\right)=0.035, E_{o u t}\left(g_{\lambda}\right)=0.020$
  • $E_{i n}\left(g_{\lambda}\right)=0.055, E_{o u t}\left(g_{\lambda}\right)=0.020$
  • $E_{i n}\left(g_{\lambda}\right)=0.015, E_{o u t}\left(g_{\lambda}\right)=0.020$
  • $E_{i n}\left(g_{\lambda}\right)=0.045, E_{o u t}\left(g_{\lambda}\right)=0.030$
  • $E_{i n}\left(g_{\lambda}\right)=0.025, E_{o u t}\left(g_{\lambda}\right)=0.030$

Sol:

lambd = 1
w = ridge_reg(X, y, lambd)
E_in = calc_err(X, y, w)
E_out = calc_err(X_test, y_test, w)

print(f"E_in: {E_in}\nE_out: {E_out}")

Output:

E_in: 0.035
E_out: 0.02

Question 19

Now split the given training examples in to five folds, the first 40 being fold 1, the next 40 being fold 2, and so on. Again, we take a fixed split because the given examples are already randomly permuted.

Among , what is the with the minimum , where comes from the five folds defined above? Compute and the corresponding then select the closest answer. Break the tie by selecting the largest .

Sol:

num_of_folds = 5
E_cv_list = []

for lambd in lambd_vals:
    sum_of_cv_error = 0
    for k in range(num_of_folds):
        k_th_val_fold = slice(k * 40, (k + 1) * 40)
        k_th_train_fold = np.r_[slice(0, k * 40), slice((k + 1) * 40, 200)]
        X_val, y_val = X[k_th_val_fold], y[k_th_val_fold]
        X_train, y_train = X[k_th_train_fold], y[k_th_train_fold]
        w = ridge_reg(X_train, y_train, lambd)
        sum_of_cv_error += calc_err(X_val, y_val, w)
    E_cv = sum_of_cv_error / num_of_folds
    E_cv_list.append(E_cv)

for lambd, E_cv in zip(lambd_vals, E_cv_list):
    print(f"lambda={lambd:>6}, E_cv={E_cv:>6f}")

Output:

lambda= 100.0, E_cv=0.290000
lambda=  10.0, E_cv=0.060000
lambda=   1.0, E_cv=0.035000
lambda=   0.1, E_cv=0.035000
lambda=  0.01, E_cv=0.035000
lambda= 0.001, E_cv=0.035000
lambda=0.0001, E_cv=0.035000
lambda= 1e-05, E_cv=0.035000
lambda= 1e-06, E_cv=0.035000
lambda= 1e-07, E_cv=0.035000
lambda= 1e-08, E_cv=0.030000
lambda= 1e-09, E_cv=0.050000
lambda= 1e-10, E_cv=0.050000

So the answer is .

Question 20

Run the algorithm with the optimal of the previous problem on the whole to get . Compute and then select the closest answer.

  • $E_{i n}\left(g_{\lambda}\right)=0.025, E_{o u t}\left(g_{\lambda}\right)=0.020$
  • $E_{i n}\left(g_{\lambda}\right)=0.035, E_{o u t}\left(g_{\lambda}\right)=0.030$
  • $E_{i n}\left(g_{\lambda}\right)=0.005, E_{o u t}\left(g_{\lambda}\right)=0.010$
  • $E_{i n}\left(g_{\lambda}\right)=0.015, E_{o u t}\left(g_{\lambda}\right)=0.020$
  • $E_{i n}\left(g_{\lambda}\right)=0.045, E_{o u t}\left(g_{\lambda}\right)=0.020$

Sol:

lambd = 1e-8
w = ridge_reg(X, y, lambd)
E_in = calc_err(X, y, w)
E_out = calc_err(X_test, y_test, w)

print(f"E_in: {E_in}\nE_out: {E_out}")

Output:

E_in: 0.015
E_out: 0.02

Reference

  1. https://acecoooool.github.io/blog/2017/02/10/MLF&MLT/MLF4-1/
  2. https://acecoooool.github.io/blog/2017/02/10/MLF&MLT/MLF4-2/
  3. https://blog.csdn.net/a1015553840/article/details/51173679