Website of Machine Learning Foundations by Hsuan-Tien Lin: https://www.csie.ntu.edu.tw/~htlin/mooc/.

Question 1

Question 1-2 are about linear regression.

Consider a noisy target , where (with the added coordinate , , is an unknown vector, and is a noise term with zero mean and variance. Assume is independent of and all other ’s. If linear regression is carried out using a training data set , and outputs the parameter vector , it can be shown that the expected in-sample error with respect to is given by:

For and , which among the following choices is the smallest number of example that will result in an expected greater than 0.008?

  • 500
  • 25
  • 100
  • 1000
  • 10

Sol:

sigma = 0.1
d = 8

expected_E_in = lambda N: sigma**2 * (1 - (d + 1) / N)
for N in [500, 25, 100, 1000, 10]:
    print(f"{N:>4}: {expected_E_in(N)}")

Output:

 500: 0.009820000000000002
  25: 0.006400000000000001
 100: 0.009100000000000002
1000: 0.009910000000000002
  10: 0.001

So the answer is .

Question 2

Recall that we have introduced the hat matrix in class, where for examples and features. Assume is invertible, which statements of are true?

  • (a) is always positive semi-definite.
  • (b) is always invertible.
  • (c) some eigenvalues of are possibly bigger than 1
  • (d) eigenvalues of are exactly 1.
  • (e)

Sol:

  • (a): true, since the eigenvalues of are all non-negative.
  • (b): false, since the rank of is , which may be less than .
  • (c): false.
  • (d): true.
  • (e): true, since is an orthogonal projection matrix, project twice does not have much more effect.

Question 3

Question 3-5 are about error and SGD

Which of the following is an upper bound of for ?

Sol:

If we draw the picture, then we can find that only the second error function is an upper bound for the 0-1 error function.

Question 4

Which of the following is not a everywhere-differentiable function of ?

Sol:

The error function is not differentiable at where .

Question 5

When using SGD on the following error functions and ‘ignoring’ some singular points that are not differentiable, which of the following error function results in PLA?

Sol: The answer is .

At , this error function is differentiable.

If , then

If , then

We can combine the two cases together as

While in PLA, we update only at a misclassified point, i.e., , and the update value is exactly , which is the opposite direction of its gradient.

Question 6

For Question 6-10, you will play with gradient descent algorithm and variants. Consider a function

What is the gradient around ?

Sol:

The gradient of is

So .

Question 7

In class, we have taught that the update rule of the gradient descent algorithm is

Please start from , and fix , what is after five updates?

Sol:

import numpy as np

E = lambda u, v: np.exp(u) + np.exp(2 * v) + np.exp(u * v) + \
                    u ** 2 - 2 * u * v + 2 * v ** 2 - 3 * u - 2 * v

grad_E = lambda u, v: np.array([np.exp(u) + v * np.exp(u * v) + 2 * u - 2 * v - 3,
                                2 * np.exp(2 * v) + u * np.exp(u * v) - 2 * u + 4 * v - 2])
eta = 0.01
ut, vt = 0, 0

for i in range(5):
    ut, vt = (ut, vt) - eta * grad_E(ut, vt)

# After 5 updates    
E(ut, vt) 

Output:

2.8250003566832635

So is about 2.825.

Question 8

Continue from Question 7, if we approximate the by , where is the second-order Taylor’s expansion of around . Suppose

What are the values of around ?

Sol: The Hessian matrix of is

Then

If we denote , then the second-order Taylor’s expansion of around can also be written as

So . By Question 6, , and .

Question 9

Continue from Question 8 and denote the Hessian matrix be , and assume that the Hessian matrix is positive definite. What is the optimal to minimize ? The direction is called the Newton Direction.

Sol:

The second-order Taylor’s expansion of around can be written as

Take derivative with respect to and to get

Set them to equal 0, we obtain

Question 10

Using the Newton direction (without ) to update, please start from , what is after five updates?

from numpy.linalg import inv

# Define Hessian matrix
H_E = lambda u, v: np.array([[np.exp(u) + v ** 2 * np.exp(u * v) + 2, np.exp(u * v) + u * v * np.exp(u * v) - 2],
                             [np.exp(u * v) + u * v * np.exp(u * v) - 2, 4 * np.exp(2 * v) + u ** 2 * np.exp(u * v) + 4]])

ut, vt = 0, 0

for i in range(5):
    newton_dir = -inv(H_E(ut, vt)) @ grad_E(ut, vt)
    ut, vt = (ut, vt) + newton_dir

# After 5 updates    
E(ut, vt)  

Output:

2.360823345643139

As we can see, the value of drops more quickly using Newton’s method than gradient descent in this example. But this gain is at the cost of computing Hessian matrix.

Question 11

For Question 11-12, you will play with feature transforms

Consider six inputs . What is the biggest subset of those input vectors that can be shattered by the union of quadratic, linear, or constant hypothesis of ?

Sol:

Consider the general quadratic transform of by

In our case, takes inputs to , which is shown as the following matrix:

This matrix has full rank, which means for any , the equation

has a solution .

Thus, these six points can be shattered by lines in the transformed space. However, lines in the transformed space correspond to exactly quadratic curves in the original space, and hence, can be shattered by the union of quadratic, linear, or constant hypothesis of .

Question 12

Assume that a transformer peeks the data and decides the following transform “intelligently” from the data of size . The transform maps to , where

Consider a learning algorithm that performs linear classification after the feature transform. That is, the algorithm effectively works on an that includes all possible . What is (i.e., the maximum number of points that can be shattered by the process above)?

Sol:

If we apply to the training data , then

where is the -th standard basis vector in .

If we put these vectors in the matrix row by row, then we can find that is exactly the identity matrix . So by the same argument as Question 11, the training data can be shattered by . This is true for all , so the VC dimension of this hypothesis set is .

Question 13

For Question 13-15, you will play with linear regression and feature transforms. Consider the target function:

Generate a training set of points on with uniform probablity of picking each . Generate simulated noise by flipping the sign of the output in a random 10% subset of the generated training set.

Carry out linear regression without transformation, i.e., with feature vector: , to find the weight , and use directly for classification. What is the closest value to the classification (0/1) in-sample error ()? Run the experiment 1000 times and take the average in order to reduce variation in your results.

import random
from matplotlib import pyplot as plt
from numpy.linalg import pinv
# Target function
# def f(x1, x2):
#     def sign(x):
#         return 1 if x > 0 else -1
    
#     return sign(x1 ** 2 + x2 ** 2 - 0.6)

def generate_data(N):
    """
    Args:
        N: int
    Returns:
        X: ndarray of shape = [N, d]
        y: ndarray of shape = [N, ]
    """
    X = np.random.uniform(low=-1.0, high=1.0, size=(N, 2))
    y = np.zeros(N)
    
    y = np.sign(X[:, 0] ** 2 + X[:, 1] ** 2 - 0.6)
#     # Below does not take advantage of the numpy vectorized operation and is slow
#     i = 0
#     for x in X:
#         y[i] = f(*x)
#         i += 1
    
    # Add noise to y
    p = 0.1
    offset = int(N * p)
    random_index = np.random.permutation(N)
    
    noisy_y_index = random_index[:offset]
    y[noisy_y_index] *= -1
    
    return X, y
N = 1000
X, y = generate_data(N)
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ax.plot(X[y==1][:, 0], X[y==1][:, 1], color='blue', marker='o', linestyle='None', label='+1')
ax.plot(X[y==-1][:, 0], X[y==-1][:, 1], color='red', marker='x', linestyle='None', label='-1'); 
q6.jpg
def lin(X, y):
    """
    Args:
        X: ndarray of shape = [N, d]
        y: ndarray of shape = [N, ]    
    Returns:
        w: ndarray of shape = [d, ]
    """
    N = X.shape[0]
    X = np.c_[np.ones((N, 1)), X]
    w = pinv(X.T @ X) @ X.T @ y
    return w
  
def calc_err(X, y, w):
    """
    Args:
        X: ndarray of shape = [N, d]
        y: ndarray of shape = [N, ] 
        w: ndarray of shape = [d, ]
    Returns:
        err: float
    """
    N = X.shape[0]
    X = np.c_[np.ones((N, 1)), X]
    
    y_hat = np.sign(X @ w)
    err = np.mean(y_hat != y)
    return err  
w_lin = lin(X, y)
calc_err(X, y, w_lin)

Output:

0.467
def simulation_lin_no_trans(N):
    err_list = []
    for _ in range(1000):    
        X, y = generate_data(N)    
        w_lin = lin(X, y)
        err = calc_err(X, y, w_lin)
        err_list.append(err)
    avg_err = np.mean(err_list)
    return avg_err
  
simulation_lin_no_trans(N=1000)  

Output:

0.508207

Question 14 Now, transform the training data into the following nonlinear feature vector:

Find the vector that corresponds to the solution of linear regression, and take it for classification.

Which of the following hypotheses is closest to the one you find using linear regression on the transformed input? Closest here means agrees the most with your hypothesis (has the most probability of agreeing on a randomly selected point).

# def quadratic(x1, x2):
#     return np.array([x1, x2, x1 * x2, x1 ** 2, x2 ** 2])

def transform_X(X):
    """
    Args:
        X: ndarray of shape = [N, d]
    Returns:
        Z: ndarray of shape = [N, d]
    """
    N = X.shape[0]
    d = 5
    Z = np.zeros((N, d))
    Z[:, :2] = X[:]
    Z[:, 2] = X[:, 0] * X[:, 1]
    Z[:, 3] = X[:, 0] ** 2
    Z[:, 4] = X[:, 1] ** 2
    
    # Below does not take advantage of the numpy vectorized operation and is slow
#     i = 0
#     for x in X:
#         Z[i] = quadratic(*x)
#         i += 1    
    return Z
X, y = generate_data(N)  
Z = transform_X(X)
w_lin = lin(Z, y)
w_lin

Output:

array([-0.9914943 , -0.07906799,  0.1171203 , -0.02982396,  1.53896951, 1.62686342])

So the closest hypothesis in the given hypotheses is

We can also run the simulation with the transformed data to see the classification error drops dramatically:

def simulation_lin_with_trans(N):
    err_list = []
    for _ in range(1000):    
        X, y = generate_data(N)  
        Z = transform_X(X)
        w_lin = lin(Z, y)
        err = calc_err(Z, y, w_lin)
        err_list.append(err)
    avg_err = np.mean(err_list)
    return avg_err
    
simulation_lin_with_trans(N=1000)    

Output:

0.124313

Question 15

What is the closest value to the classification out-of-sample error of your hypothesis? Estimate it by generating a new set of 1000 points and adding noise as before. Average over 1000 runs to reduce the variation in your results.

def test_lin(N):
    err_list = []
    for _ in range(1000):    
        X, y = generate_data(N)  
        Z = transform_X(X)
        w_lin = lin(Z, y)
        
        X_test, y_test = generate_data(N)
        Z_test = transform_X(X_test)
        # Calculate E_out
        err = calc_err(Z_test, y_test, w_lin)
        err_list.append(err)
    avg_err = np.mean(err_list)
    return avg_err
    
test_lin(N=1000)    

Output:

0.126341

Question 16

For Question 16-17, you will derive an algorithm for multinomial (multiclass) logistic regression. For a -class classification problem, we will denote the output space . The hypotheses considered by MLR are indexed by a list of weight vectors , each weight vector of length . Each list represents a hypothesis

that can be used to approximate the target distribution . MLR then seeks for the maximum likelihood solution over all such hypotheses.

For general , derive an like page 11 of Lecture 10 slides by minimizing the negative log likelihood.

Sol:

The likelihood function of multinomial logistic hypothesis is

So the log likelihood is

and maximizing log likelihood is equivalent to minimize its negative, i.e.,

Since the second term does not depend on , we can simply ignore it. Then

Therefore, the error function we want to minimize is

Question 17

For the derived above, its gradient can be represented by , write down .

Sol:

Take the derivative with respect to , we can obtain

Question 18

For Question 18-20, you will play with logistic regression.

Please use the following set hw3_train.dat for training and the following set hw3_test.dat for testing.

Implement the fixed learning rate gradient descent algorithm for logistic regression. Run the algorithm with and . What is from your algorithm, evaluated using the 0/1 error on the test set?

def sigmoid(s):
    return 1 / (1 + np.exp(-s))
def calc_grad_err(X, y, w):
    """
    Args:
        X: ndarray of shape = [N, d + 1]
        y: ndarray of shape = [N, 1]
        w: ndarray of shape = [d + 1, 1]
    returns:
        grad: ndarray of shape = [d + 1, 1]
    """
    N = X.shape[0]
    grad = (X * (-y)).T @ sigmoid(X @ w * (-y)) / N
    return grad
def lr(X, y, eta):
    """
    Args:
        X: ndarray of shape = [N, d]
        y: ndarray of shape = [N, ]
        eta: float
    Returns:
        wt: ndarray of shape = [d + 1, ]
    """
    N, d = X.shape
    X = np.c_[np.ones((N, 1)), X]
#     wt = np.random.normal(size=(d+1, 1))  # initialize using normal distribution requires more steps to converge in this example
    wt = np.zeros((d + 1, 1))               # initialize all parameters to zero converges more quickly
    y = y.reshape(-1, 1)                    # reshape y to a two dimensional ndarray to make `calc_grad_err` function easy to implement
     
    T = 2000
    for t in range(T):
        grad_t = calc_grad_err(X, y, wt)
        wt -= eta * grad_t

    wt = wt.flatten()
    return wt
    
def calc_zero_one_err(X, y, w):
    """
    Args:
        X: ndarray of shape = [N, d]
        y: ndarray of shape = [N, ]
        w: ndarray of shape = [d + 1, ]
    Returns:
        avg_err: float
    """
    N = X.shape[0]
    X = np.c_[np.ones((N, 1)), X]
    y_hat = np.sign(X @ w)  # h(x) > 1/2 <==> w^T x > 0
    avg_err = np.mean(y_hat != y)
    return avg_err
import pandas as pd
# Read in training and test data
df_train = pd.read_csv('hw3_train.dat', header=None, sep='\s+')
X_train_df, y_train_df = df_train.iloc[:, :-1], df_train.iloc[:, -1]
X_train, y_train = X_train_df.to_numpy(), y_train_df.to_numpy()

df_test = pd.read_csv('hw3_test.dat', header=None, sep='\s+')
X_test_df, y_test_df = df_test.iloc[:, :-1], df_test.iloc[:, -1]
X_test, y_test = X_test_df.to_numpy(), y_test_df.to_numpy()
eta = 0.001
w_lr = lr(X_train, y_train, eta)
# E_in
calc_zero_one_err(X_train, y_train, w_lr)

Output:

0.466
# E_out
calc_zero_one_err(X_test, y_test, w_lr)

Output:

0.475

Question 19

Implement the fixed learning rate gradient descent algorithm for logistic regression. Run the algorithm with and , what is from your algorithm, evaluated using the 0/1 error on the test set?

eta = 0.01
w_lr = lr(X_train, y_train, eta)
# E_in
calc_zero_one_err(X_train, y_train, w_lr)

Output:

0.197
# E_out
calc_zero_one_err(X_test, y_test, w_lr)

Output:

0.22

Question 20

Implement the fixed learning rate stochastic gradient descent algorithm for logistic regression. Instead of randomly choosing in each iteration, please simply pick the example with the cyclic order

Run the algorithm with and , what is from your algorithm, evaluated using the 0/1 error on the test set?

def lr_sgd(X, y, eta):
    """
    Args:
        X: ndarray of shape = [N, d]
        y: ndarray of shape = [N, ]
        eta: float
    Returns:
        wt: ndarray of shape = [d + 1, ]
    """
    N, d = X.shape
    X = np.c_[np.ones((N, 1)), X]
    wt = np.zeros((d + 1, 1))               
    y = y.reshape(-1, 1)                    
     
    T = 2000
    n = 0
    for t in range(T):
        xt, yt = X[n].reshape(-1, 1).T, y[n]
        grad_t = calc_grad_err(xt, yt, wt)
        wt -= eta * grad_t
        n = n + 1 if n != N - 1 else 0

    wt = wt.flatten()
    return wt
    
eta = 0.001
w_lr = lr_sgd(X_train, y_train, eta)
# E_in
calc_zero_one_err(X_train, y_train, w_lr)

Output:

0.464
# E_out
calc_zero_one_err(X_test, y_test, w_lr)

Output:

0.473

So for this data set, we need more iterations for SGD to converge.

Reference

  1. https://acecoooool.github.io/blog/2017/02/09/MLF&MLT/MLF3-1/
  2. https://acecoooool.github.io/blog/2017/02/09/MLF&MLT/MLF3-2/
  3. https://blog.csdn.net/a1015553840/article/details/51103645