Machine Learning Foundations Homework 4
Website of Machine Learning Foundations by Hsuan-Tien Lin: https://www.csie.ntu.edu.tw/~htlin/mooc/.
Question 1
Overfitting and Deterministic Noise
Deterministic noise depends on , as some models approximate better than others. Assume and that is fixed. In general (but not necessarily in all cases), if we use instead of , how does deterministic noise behave?
- In general, deterministic noise will decrease.
- In general, deterministic noise will increase.
- In general, deterministic noise will be the same.
- If deterministic noise will increase, else it will decrease.
- If deterministic noise will decrease, else it will increase.
Sol: In general, deterministic noise will increase because our hypothesis set gets simpler, which makes it less likely to obtain a good (recall that deterministic noise is the difference between the best hypothesis in the given hypothesis set and the target function ).
Question 2
Regularization With Polynomials
Polynomial models can be viewed as linear models in a space , under a nonlinear transform . Here, transforms the scalar into a vector of Legendre polynomials, . Our hypothesis set will be expressed as a linear combination of these polynomials,
where .
Consider the following hypothesis set defined by the constraint:
which of the following statement is correct?
- $\mathcal{H}(10,0,3) \cup \mathcal{H}(10,1,4)=\mathcal{H}_{3}$
- $\mathcal{H}(10,1,3) \cap \mathcal{H}(10,1,4)=\mathcal{H}_{1}$
- $\mathcal{H}(10,0,3) \cap \mathcal{H}(10,0,4)=\mathcal{H}_{2}$
- $\mathcal{H}(10,0,3) \cup \mathcal{H}(10,0,4)=\mathcal{H}_{4}$
- none of the other choices
Sol:
: , so it contains polynomials of the form ;
: , so it contains polynomials of the form .
Therefore, is the answer.
Question 3
Regularization and Weight Decay
Consider the augmented error
with some .
If we want to minimize the augmented error by gradient descent, with as learning rate, which of the followings are the correct update rules?
- $\mathbf{w}(t+1) \longleftarrow \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
- $\mathbf{w}(t+1) \longleftarrow\left(1-\frac{2 \eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
- $\mathbf{w}(t+1) \longleftarrow\left(1+\frac{2 \eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
- $\mathbf{w}(t+1) \longleftarrow\left(1+\frac{\eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
- $\mathbf{w}(t+1) \longleftarrow\left(1-\frac{\eta \lambda}{N}\right) \mathbf{w}(t)-\eta \nabla E_{\mathrm{in}}(\mathbf{w}(t))$
Sol: Since
we have the update rule
Question 4
Let be the optimal solution for the plain-vanilla linear regression and be the optimal solution for the formulation above. Select all the correct statement below:
- for any
- is a non-decreasing function of for
- is a non-increasing function of for
- for any
- none of the other choices
Sol:
From the constrained minimization of :
which is equivalent to the unconstrained minimization of , we can deduce that if satisfies the constraint , then , otherwise, .
As for the monotonicity of , the increasing of is equivalent to the decreasing of , restricting the growth of and hence, is non-increasing.
Question 5
Leave-One-Out Cross-Validation
You are given the data points: , and a choice between two models: constant and linear . For which value of would the two models be tied using leaving-one-out cross-validation with the squared error measure?
- $\sqrt{\sqrt{3}+4}$
- $\sqrt{\sqrt{3}-1}$
- $\sqrt{9+4 \sqrt{6}}$
- $\sqrt{9-\sqrt{6}}$
- none of the other choices
Sol:
For the first model, which is a constant model, the best we can choose over training set is the average of the values, i.e.,
Then substitute into :
As for the second model, the best line using two points is exactly the line through them, and hence
Then
Equate and , we can solve for .
Question 6
Learning Principles
In Problem 6-7, suppose that for 5 weeks in a row, a letter arrives in the mail that predicts the outcome of the upcoming Monday night baseball game. (Assume there are no tie). You keenly watch each Monday and to your surprise, the prediction is correct each time. On the day after the fifth game, a letter arrives, stating that if you wish to see next week’s prediction, a payment of NTD 1,000 is required. Which of the following statement is true?
- To make sure that at least one person receives correct predictions on all 5 games from him, at least 64 letters are sent before the fifth game.
- If the sender wants to make sure that at least one person receives correct predictions on all 5 games from him, the sender should target to begin with at least 5 people.
- To make sure that at least one person receives correct predictions on all 5 games from him, after the first letter ‘predicts’ the outcome of the first game, the sender should target at least 16 people with the second letter.
- none of the other choices.
Sol: If we just randomly guess the result, then the probability of getting the correct result is . So to make 5 consecutive correct ‘predictions’, the probability based on randomly guessing is .
The ‘strategy’ is like the following:
- Before the first game: mailling out 32 letters with 16 predicting A wins and the other 16 B wins, so half of them will be correct.
- Before the second game: in those we correctly predicted the first time, choose 8 of them to mail out letters which predict A wins and the other 8 with letters predicting B wins.
- Repeat the same process until the fifth game.
After the fifth game, there will be one person who received correct predictions on all 5 games. Then we mail out him a payment letter.
So the correct answer is the third statement.
Question 7
If the cost of printing and mailling out each letter is NTD 10. If the sender sends the minimum number of letters out, how much money can be made for the above ‘fraud’ to success once? That is, one of the recipients does send him NTD 1,000 to receive the prediction of the 6-th game?
- NTD 400
- NTD 340
- NTD 460
- NTD 430
- NTD 370
Sol: The total number of letters the sender needs to send is
where the last ‘1’ is the letter with the payment requirement attached. So the money can be made is
Question 8
For Problem 8-10, we consider the following. In our credit card example, the bank starts with some vague idea of what constitutes a good credit risk. So as customers arrive, the bank applies its vague idea to approve credit cards for some of these customers based on a formula . Then, only those who get credit cards are monitored to see if they default or not. For simplicity, suppose that the first customers were given credit cards by the credit approval function . Now that the bank knows the behavior of these customers, it comes to you to improve their algorithm for approving credit. The bank gives you the data . Before you look at the data, you do mathematical derivations and come up with a credit approval function. You now test it on the data and, to your delight, obtain perfect prediciton.
What is , the size of your hypothesis set?
- $N$
- $1$
- $N^2$
- $2^N$
- We have no idea about it
Sol: Since we come up with one particular credit approval function before looking at the data, this is the only function we are considering, so the size of the hypothesis set is .
Question 9
With such an , what does the Hoeffding bound say about the probability that the true average error rate of is worse than 1% for ?
Sol: By Hoeffding inequality (with one fixed hypothesis)
where in this case, the probability is .
Sol:
You assure the bank that you have got a system for approving credit cards for new customers, which is nearly error-free. Your confidence is given by your answer to the previous question. The bank is thrilled and uses your to approve credit for new customers. To their dismay, more than half their credit cards are being defaulted on. Assume that the customers that were sent to the old credit approval function and the customers that were sent to your are indeed i.i.d. from the same distribution, and the bank is lucky enough (so the ‘bad luck’ that “the true error of is worse than 1%” does not happen).
Select all the true claims for this situation.
- By applying to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
- By applying to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
- By applying to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
- By applying to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
Sol: The data we have are not clean in the sense that they were contaminated by the function since we used to obtain label . Our prediciton function is reliable only if the data are indeed the result of the function . Therefore, we should use the conjunction to approve credit.
Question 11
Virtual Examples and Regularization
Consider linear regression with virtual examples. That is, we add virtual examples to the training data set, and solve
We will show that using some ‘special’ virtual examples, which were claimed to be possible way to combat overfitting in Lecture 9, is related to regularization, another possible way to combat overfitting discussed in Lecture 10. Let , and .
What is the optimal to the optimization problem above, assuming that all the inversions exists?
- $\left(\mathbf{X}^{T} \mathbf{X}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{X}}\right)^{-1}\left(\mathbf{X}^{T} \mathbf{y}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
- $\left(\mathbf{X}^{T} \mathbf{X}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{X}}\right)^{-1}\left(\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
- $\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1}\left(\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
- $\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1}\left(\mathbf{X}^{T} \mathbf{y}+\tilde{\mathbf{X}}^{T} \tilde{\mathbf{y}}\right)$
- none of the other choices
Sol:
Take gradient and set to :
Question 12
For what and will the solution of this linear regression equal to
Sol: Take and , then
Thus, in this case these two solutions are equal.
Question 13
Experiment with Regularized Linear Regression and Validation
Consider regularized linear regression (also called ridge regression) for classification.
Run the algorithm on the following data set hw4_train.dat
as
and the following set hw4_test.dat
for evaluating .
Because the data sets are for classification, please consider only the error for all the problems below.
Let , which of the followings is the corresponding and ?
- $E_{\text{in}}=0.015, E_{\text{out}}=0.020$
- $E_{\text{in}}=0.030, E_{\text{out}}=0.015$
- $E_{\text{in}}=0.020, E_{\text{out}}=0.010$
- $E_{\text{in}}=0.035, E_{\text{out}}=0.020$
- $E_{\text{in}}=0.050, E_{\text{out}}=0.045$
Sol:
import pandas as pd
import numpy as np
from numpy.linalg import pinv
def load_data(filename):
df = pd.read_csv(filename, header=None, sep='\s+')
X_df, y_df = df.iloc[:, :-1], df.iloc[:, -1]
X, y = X_df.to_numpy(), y_df.to_numpy()
n, _ = X.shape
X = np.c_[np.ones((n, 1)), X]
return X, y
# Read in training and test data
X, y = load_data('hw4_train.dat')
X_test, y_test = load_data('hw4_test.dat')
def ridge_reg(X, y, lambd):
"""
Args:
X: ndarray of shape = [N, d + 1]
y: ndarray of shape = [N, ]
lambd: float
Returns:
w: ndarray of shape = [d, ]
"""
_, d = X.shape
w = pinv(X.T @ X + lambd * np.eye(d)) @ X.T @ y
return w
w_ridge = ridge_reg(X, y, lambd=10)
w_ridge
Output:
array([-0.93238149, 1.04618645, 1.046171 ])
def calc_err(X, y, w):
"""
Args:
X: ndarray of shape = [N, d + 1]
y: ndarray of shape = [N, ]
w: ndarray of shape = [d + 1, ]
Returns:
err: float
"""
y_hat = np.sign(X @ w)
err = np.mean(y_hat != y)
return err
E_in = calc_err(X, y, w_ridge)
E_out = calc_err(X_test, y_test, w_ridge)
print(f"E_in: {E_in}\nE_out: {E_out}")
Output:
E_in: 0.05
E_out: 0.045
Question 14
Among . What is the with the minimum ? Compute and its corresponding and then select the closest answer. Break the tie by selecting the largest .
Sol:
lambd_vals = np.logspace(start=2, stop=-10, num=13)
E_in_list = []
E_out_list = []
for lambd in lambd_vals:
w = ridge_reg(X, y, lambd)
E_in = calc_err(X, y, w)
E_out = calc_err(X_test, y_test, w)
E_in_list.append(E_in)
E_out_list.append(E_out)
for lambd, E_in, E_out in zip(lambd_vals, E_in_list, E_out_list):
print(f"lambda={lambd:>6}, E_in={E_in:>6}, E_out={E_out:>6}")
Output:
lambda= 100.0, E_in= 0.24, E_out= 0.261
lambda= 10.0, E_in= 0.05, E_out= 0.045
lambda= 1.0, E_in= 0.035, E_out= 0.02
lambda= 0.1, E_in= 0.035, E_out= 0.016
lambda= 0.01, E_in= 0.03, E_out= 0.016
lambda= 0.001, E_in= 0.03, E_out= 0.016
lambda=0.0001, E_in= 0.03, E_out= 0.016
lambda= 1e-05, E_in= 0.03, E_out= 0.016
lambda= 1e-06, E_in= 0.035, E_out= 0.016
lambda= 1e-07, E_in= 0.03, E_out= 0.015
lambda= 1e-08, E_in= 0.015, E_out= 0.02
lambda= 1e-09, E_in= 0.015, E_out= 0.02
lambda= 1e-10, E_in= 0.015, E_out= 0.02
Thus, the answer is .
Question 15
Among . What is the with the minimum ? Compute and its corresponding and then select the closest answer. Break the tie by selecting the largest .
Sol: The answer is .
Question 16
Now split the given training examples in to the first 120 examples for and 80 for .
Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem.
Run the algorithm on to get , and validate with .
Among . What is the with the minimum ? Compute and the corresponding , and then select the closest answer. Break the tie by selecting the largest .
Sol:
# Split train/val= 120/80
X_train, X_val = X[:120], X[120:]
y_train, y_val = y[:120], y[120:]
E_train_list = []
E_val_list = []
E_out_list = []
for lambd in lambd_vals:
w = ridge_reg(X_train, y_train, lambd)
E_train = calc_err(X_train, y_train, w)
E_val = calc_err(X_val, y_val, w)
E_out = calc_err(X_test, y_test, w)
E_train_list.append(E_train)
E_val_list.append(E_val)
E_out_list.append(E_out)
for lambd, E_train, E_val, E_out in zip(lambd_vals, E_train_list, E_val_list, E_out_list):
print(f"lambda={lambd:>6}, E_train={E_train:>6f}, E_val={E_val:>6f}, E_out={E_out:>6f}")
Output:
lambda= 100.0, E_train=0.341667, E_val=0.412500, E_out=0.414000
lambda= 10.0, E_train=0.075000, E_val=0.125000, E_out=0.080000
lambda= 1.0, E_train=0.033333, E_val=0.037500, E_out=0.028000
lambda= 0.1, E_train=0.033333, E_val=0.037500, E_out=0.022000
lambda= 0.01, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 0.001, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda=0.0001, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-05, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-06, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-07, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-08, E_train=0.000000, E_val=0.050000, E_out=0.025000
lambda= 1e-09, E_train=0.000000, E_val=0.100000, E_out=0.038000
lambda= 1e-10, E_train=0.008333, E_val=0.125000, E_out=0.040000
So the answer is .
Question 17
Among . What is the with the minimum ? Compute and the corresponding , and then select the closest answer. Break the tie by selecting the largest .
Sol: The closest answer is .
Question 18
Run the algorithm with the optimal of the previous on the whole to get . Compute and then select the closest answer.
- $E_{i n}\left(g_{\lambda}\right)=0.035, E_{o u t}\left(g_{\lambda}\right)=0.020$
- $E_{i n}\left(g_{\lambda}\right)=0.055, E_{o u t}\left(g_{\lambda}\right)=0.020$
- $E_{i n}\left(g_{\lambda}\right)=0.015, E_{o u t}\left(g_{\lambda}\right)=0.020$
- $E_{i n}\left(g_{\lambda}\right)=0.045, E_{o u t}\left(g_{\lambda}\right)=0.030$
- $E_{i n}\left(g_{\lambda}\right)=0.025, E_{o u t}\left(g_{\lambda}\right)=0.030$
Sol:
lambd = 1
w = ridge_reg(X, y, lambd)
E_in = calc_err(X, y, w)
E_out = calc_err(X_test, y_test, w)
print(f"E_in: {E_in}\nE_out: {E_out}")
Output:
E_in: 0.035
E_out: 0.02
Question 19
Now split the given training examples in to five folds, the first 40 being fold 1, the next 40 being fold 2, and so on. Again, we take a fixed split because the given examples are already randomly permuted.
Among , what is the with the minimum , where comes from the five folds defined above? Compute and the corresponding then select the closest answer. Break the tie by selecting the largest .
Sol:
num_of_folds = 5
E_cv_list = []
for lambd in lambd_vals:
sum_of_cv_error = 0
for k in range(num_of_folds):
k_th_val_fold = slice(k * 40, (k + 1) * 40)
k_th_train_fold = np.r_[slice(0, k * 40), slice((k + 1) * 40, 200)]
X_val, y_val = X[k_th_val_fold], y[k_th_val_fold]
X_train, y_train = X[k_th_train_fold], y[k_th_train_fold]
w = ridge_reg(X_train, y_train, lambd)
sum_of_cv_error += calc_err(X_val, y_val, w)
E_cv = sum_of_cv_error / num_of_folds
E_cv_list.append(E_cv)
for lambd, E_cv in zip(lambd_vals, E_cv_list):
print(f"lambda={lambd:>6}, E_cv={E_cv:>6f}")
Output:
lambda= 100.0, E_cv=0.290000
lambda= 10.0, E_cv=0.060000
lambda= 1.0, E_cv=0.035000
lambda= 0.1, E_cv=0.035000
lambda= 0.01, E_cv=0.035000
lambda= 0.001, E_cv=0.035000
lambda=0.0001, E_cv=0.035000
lambda= 1e-05, E_cv=0.035000
lambda= 1e-06, E_cv=0.035000
lambda= 1e-07, E_cv=0.035000
lambda= 1e-08, E_cv=0.030000
lambda= 1e-09, E_cv=0.050000
lambda= 1e-10, E_cv=0.050000
So the answer is .
Question 20
Run the algorithm with the optimal of the previous problem on the whole to get . Compute and then select the closest answer.
- $E_{i n}\left(g_{\lambda}\right)=0.025, E_{o u t}\left(g_{\lambda}\right)=0.020$
- $E_{i n}\left(g_{\lambda}\right)=0.035, E_{o u t}\left(g_{\lambda}\right)=0.030$
- $E_{i n}\left(g_{\lambda}\right)=0.005, E_{o u t}\left(g_{\lambda}\right)=0.010$
- $E_{i n}\left(g_{\lambda}\right)=0.015, E_{o u t}\left(g_{\lambda}\right)=0.020$
- $E_{i n}\left(g_{\lambda}\right)=0.045, E_{o u t}\left(g_{\lambda}\right)=0.020$
Sol:
lambd = 1e-8
w = ridge_reg(X, y, lambd)
E_in = calc_err(X, y, w)
E_out = calc_err(X_test, y_test, w)
print(f"E_in: {E_in}\nE_out: {E_out}")
Output:
E_in: 0.015
E_out: 0.02
Reference