explain of learning rate, data preprocessing and overfitting in machine learning
TOC
Learning Rate
- Learning rate is a value indicating how much wegihts and biases to change.
$$ W_{n} = W_{n-1} - \alpha \cdot \sum \frac{\partial Cost(W)}{\partial W} $$ $$ b_{n} = b_{n-1} - \alpha \cdot \sum \frac{\partial Cost(b)}{\partial b} $$
- If learning rate is large, weights and biases are moved to a lot along the cost function.
Overshooting - Too Large Learning Rate
- If learning rate is too large,
import numpy as np
import matplotlib.pyplot as plt
def calc_cost(w):
# hypothesis = W * x
hypo = [w * _x for _x in x]
_mse = list(map(lambda _hypo, _answer : (_hypo - _answer) ** 2, hypo, y))
sumMse = sum(_mse)
# 1 / m * sum(W * x - y)^2
return 1 / (len(W)) * sumMse
# Generate input and answer
# y = x
x = [i * 0.01 for i in range(-100, 100)]
y = [i for i in x]
# For the Mean Square Error
# To show the effect of the large learning rate,
# weight has large range
W = [i * 0.001 for i in range(-5000, 5001)]
# Costs
costs = []
# Draw cost function
for w in W:
costs.append(calc_cost(w))
plt.plot(W, costs, "r")
# Start from particular points
w = 0.18
learning_rate = 100
# Actual weight to learn
W = [i * 0.001 for i in range(-1000, 1001)]
# Cost and Gradient descent
for i in range(3):
# Calculate cost
_cost = calc_cost(w)
# Draw calculated value
#if w > 0.95:
plt.plot(w, _cost, "o", \
label="Trial: {0} W: {1:3.2f}, Cost(W): {2:3.2f}".format(i, w, _cost))
# Calculate gradient descent
gradients = list(map(\
lambda _input, _answer : ((w * _input) - _answer) * _input, x, y))
sumGrad = sum(gradients)
# Descent and update w
w = w - learning_rate / len(W) * sumGrad
plt.title("Effect of Large Learning Rate")
plt.xlabel("W")
plt.ylabel("Cost(W)")
plt.xlim(-5, 5)
plt.grid()
plt.legend(numpoints=1,loc='upper right')
plt.show()
Image 1. Effect of large learning rate
- Large learning rate leads weight and cost wrong.
- In this example, weight climbed up the cost graph, and cost become high.
- If learning rate is too small,
import numpy as np
import matplotlib.pyplot as plt
def calc_cost(w):
# hypothesis = W * x
hypo = [w * _x for _x in x]
_mse = list(map(lambda _hypo, _answer : (_hypo - _answer) ** 2, hypo, y))
sumMse = sum(_mse)
# 1 / m * sum(W * x - y)^2
return 1 / (len(W)) * sumMse
# Generate input and answer
# y = x
x = [i * 0.01 for i in range(-100, 100)]
y = [i for i in x]
# For the Mean Square Error
# To show the effect of the large learning rate,
# weight has large range
W = [i * 0.001 for i in range(0, 2001)]
# Costs
costs = []
# Draw cost function
for w in W:
costs.append(calc_cost(w))
plt.plot(W, costs, "r")
# Start from particular points
w = 0.18
learning_rate = 0.000001
# Actual weight to learn
W = [i * 0.001 for i in range(-1000, 1001)]
# Cost and Gradient descent
for i in range(10000):
# Calculate cost
_cost = calc_cost(w)
# Draw calculated value
#if w > 0.95:
plt.plot(w, _cost, "o")
# Calculate gradient descent
gradients = list(map(\
lambda _input, _answer : ((w * _input) - _answer) * _input, x, y))
sumGrad = sum(gradients)
# Descent and update w
w = w - learning_rate / len(W) * sumGrad
plt.title("Effect of Small Learning Rate")
plt.xlabel("W")
plt.ylabel("Cost(W)")
plt.xlim(0.179, 0.181)
plt.grid()
plt.show()
Image 2. Effect of small learning rate
- It moves too slow.
- In this example, weight is trained 10000 times, but it still near the start point.
- Therefore, it is very important to set learning rate.
- Learning rate is set intuitively now, but it will be introduced soon how to choose optimized it.
Data Preprocessing
- Input data is from anywhere, so its ranges are various. Some data will be placed between -1 ~ 1, and another data will be placed -100 ~ 100.
- If we have 2 factors, A and B, for 1 training instance, and range of A is -1 ~ 1 and B's range is -1000 ~ 1000, It is difficult to optimize weight. Since A's range is too smaller than B, A is more sensitive.
import numpy as np
from matplotlib import patches
import matplotlib.pyplot as plt
# Set size of plot
width = 7
height = 7
plt.figure(figsize=(width, height))
ax = plt.gca()
xcenter, ycenter = 0, 0
width, height = 2, 25
for i in range(20):
width = width - width * 0.1
height = height - height * 0.1
theta = np.deg2rad(np.arange(0.0, 360.0, 1.0))
x = 0.5 * width * np.cos(theta)
y = 0.5 * height * np.sin(theta)
x += xcenter
y += ycenter
e1 = patches.Ellipse((xcenter, ycenter), width, height,
fill=False, zorder=2)
ax.add_patch(e1)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.xlabel("A")
plt.ylabel("B")
plt.title("Ellipse for Different Ranges' Data")
plt.show()
Image 3. Ellipse for different ranges' data
- In this example, reducing the value of A is more affective to be close to the center than B. This is unfair for B. Therefore, the found local minimum might not be a global minimum, but only for A, because A will more affect to gradient descent which is a way to shortest path for local minimum.
- To overcome this problem, preprocessing is useful. - CS231n
- zero-centered data
- normalized data
Image 4. Data preprocessing
- Standardization is one of popular normalization method.
$$ X^{'} = \frac{X - \mu}{\sigma} $$
- \( \mu \) is a mean value of X, and \( \sigma \) is a standard deviation of X.
- This can be represented by the below code in python.
X_std[:, 0] = (X[:, 0] - X[:, 0].mean()) / X[:, 0].std()
Overfitting
- A model describes random error or noise instead of the underlying relationship. - Wiki
- Main task of Machine Learning is to find a general model to fit a set of training data. However, some training data might be noises, so it can disturb training. Furthermore, they can distort the model to be close to them. In that case, we say the model is overfitting.
Image 5. Overfitting
-
The black line in the image is the general decision model. However, when some noises are strongly affected during the training, the model could be green line.
-
To overcome overfitting,
- Train with many training data
- Reduce the number of factors
-
Also, there are some techniques to reduce overfitting.
Regularization
- Weight decay
- Not to have too big number of weight.
- To do that, add square of weight to cost function with regularization strength(\(\lambda\)).
$$ cost(W) = \frac{1}{N} \sum_{i} Diff(H(X_i), Y_i) + \lambda \cdot \sum W^2 $$ $$ W = W - \alpha {\partial \over\partial W} cost(W, b) $$
- Overfitting is usually appeared when weight is too big.
- If the square of weight is added to cost function, it makes weight small faster. Therefore, it helps to avoid overfitting.
- Regularization strength is similar to learning rate. If it is big, penalty of weight is high. As a result, weight become smaller fast.
COMMENTS