explain of learning rate, data preprocessing and overfitting in machine learning

tensorflow

Learning Rate

Learning rate is a value indicating how much wegihts and biases to change.

$$ W_{n} = W_{n-1} - \alpha \cdot \sum \frac{\partial Cost(W)}{\partial W} $$ $$ b_{n} = b_{n-1} - \alpha \cdot \sum \frac{\partial Cost(b)}{\partial b} $$

If learning rate is large, weights and biases are moved to a lot along the cost function.

Overshooting - Too Large Learning Rate

If learning rate is too large,

import numpy as np
import matplotlib.pyplot as plt

def calc_cost(w):
    # hypothesis = W * x
    hypo = [w * _x for _x in x]

    _mse = list(map(lambda _hypo, _answer : (_hypo - _answer) ** 2, hypo, y))
    sumMse = sum(_mse)

    # 1 / m * sum(W * x - y)^2
    return 1 / (len(W)) * sumMse

# Generate input and answer
# y = x
x = [i * 0.01 for i in range(-100, 100)]
y = [i for i in x]

# For the Mean Square Error
# To show the effect of the large learning rate,
#  weight has large range
W = [i * 0.001 for i in range(-5000, 5001)]

# Costs
costs = []
# Draw cost function
for w in W:
    costs.append(calc_cost(w))
plt.plot(W, costs, "r")

# Start from particular points
w = 0.18
learning_rate = 100

# Actual weight to learn
W = [i * 0.001 for i in range(-1000, 1001)]
# Cost and Gradient descent
for i in range(3):
    # Calculate cost
    _cost = calc_cost(w)

    # Draw calculated value
    #if w > 0.95:
    plt.plot(w, _cost, "o", \
    label="Trial: {0} W: {1:3.2f}, Cost(W): {2:3.2f}".format(i, w, _cost))
    
    # Calculate gradient descent
    gradients = list(map(\
        lambda _input, _answer : ((w * _input) - _answer) * _input, x, y))

    sumGrad = sum(gradients)

    # Descent and update w
    w = w - learning_rate / len(W) * sumGrad

plt.title("Effect of Large Learning Rate")
plt.xlabel("W")
plt.ylabel("Cost(W)")
plt.xlim(-5, 5)
plt.grid()
plt.legend(numpoints=1,loc='upper right')
plt.show()

Image 1. Effect of large learning rate

Large learning rate leads weight and cost wrong.
In this example, weight climbed up the cost graph, and cost become high.
If learning rate is too small,

import numpy as np
import matplotlib.pyplot as plt

def calc_cost(w):
    # hypothesis = W * x
    hypo = [w * _x for _x in x]

    _mse = list(map(lambda _hypo, _answer : (_hypo - _answer) ** 2, hypo, y))
    sumMse = sum(_mse)

    # 1 / m * sum(W * x - y)^2
    return 1 / (len(W)) * sumMse

# Generate input and answer
# y = x
x = [i * 0.01 for i in range(-100, 100)]
y = [i for i in x]

# For the Mean Square Error
# To show the effect of the large learning rate,
#  weight has large range
W = [i * 0.001 for i in range(0, 2001)]

# Costs
costs = []
# Draw cost function
for w in W:
    costs.append(calc_cost(w))
plt.plot(W, costs, "r")

# Start from particular points
w = 0.18
learning_rate = 0.000001

# Actual weight to learn
W = [i * 0.001 for i in range(-1000, 1001)]
# Cost and Gradient descent
for i in range(10000):
    # Calculate cost
    _cost = calc_cost(w)

    # Draw calculated value
    #if w > 0.95:
    plt.plot(w, _cost, "o")
    
    # Calculate gradient descent
    gradients = list(map(\
        lambda _input, _answer : ((w * _input) - _answer) * _input, x, y))

    sumGrad = sum(gradients)

    # Descent and update w
    w = w - learning_rate / len(W) * sumGrad

plt.title("Effect of Small Learning Rate")
plt.xlabel("W")
plt.ylabel("Cost(W)")
plt.xlim(0.179, 0.181)
plt.grid()
plt.show()

Image 2. Effect of small learning rate

It moves too slow.
In this example, weight is trained 10000 times, but it still near the start point.
Therefore, it is very important to set learning rate.
Learning rate is set intuitively now, but it will be introduced soon how to choose optimized it.

Data Preprocessing

Input data is from anywhere, so its ranges are various. Some data will be placed between -1 ~ 1, and another data will be placed -100 ~ 100.
If we have 2 factors, A and B, for 1 training instance, and range of A is -1 ~ 1 and B's range is -1000 ~ 1000, It is difficult to optimize weight. Since A's range is too smaller than B, A is more sensitive.

import numpy as np
from matplotlib import patches
import matplotlib.pyplot as plt

# Set size of plot
width = 7
height = 7
plt.figure(figsize=(width, height))

ax = plt.gca()

xcenter, ycenter = 0, 0
width, height = 2, 25

for i in range(20):
    width = width - width * 0.1
    height = height - height * 0.1

    theta = np.deg2rad(np.arange(0.0, 360.0, 1.0))
    x = 0.5 * width * np.cos(theta)
    y = 0.5 * height * np.sin(theta)

    x += xcenter
    y += ycenter

    e1 = patches.Ellipse((xcenter, ycenter), width, height,
                         fill=False, zorder=2)
    ax.add_patch(e1)

plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.xlabel("A")
plt.ylabel("B")
plt.title("Ellipse for Different Ranges' Data")
plt.show()

Image 3. Ellipse for different ranges' data

In this example, reducing the value of A is more affective to be close to the center than B. This is unfair for B. Therefore, the found local minimum might not be a global minimum, but only for A, because A will more affect to gradient descent which is a way to shortest path for local minimum.
To overcome this problem, preprocessing is useful. - CS231n
- zero-centered data
- normalized data

Image 4. Data preprocessing

Standardization is one of popular normalization method.

$$ X^{'} = \frac{X - \mu}{\sigma} $$

$ \mu $ is a mean value of X, and $ \sigma $ is a standard deviation of X.
This can be represented by the below code in python.

X_std[:, 0] = (X[:, 0] - X[:, 0].mean()) / X[:, 0].std()

Overfitting

A model describes random error or noise instead of the underlying relationship. - Wiki
Main task of Machine Learning is to find a general model to fit a set of training data. However, some training data might be noises, so it can disturb training. Furthermore, they can distort the model to be close to them. In that case, we say the model is overfitting.

Image 5. Overfitting

The black line in the image is the general decision model. However, when some noises are strongly affected during the training, the model could be green line.
To overcome overfitting,
- Train with many training data
- Reduce the number of factors
Also, there are some techniques to reduce overfitting.

Regularization

Weight decay
Not to have too big number of weight.
To do that, add square of weight to cost function with regularization strength($\lambda$).

$$ cost(W) = \frac{1}{N} \sum_{i} Diff(H(X_i), Y_i) + \lambda \cdot \sum W^2 $$ $$ W = W - \alpha {\partial \over\partial W} cost(W, b) $$

Overfitting is usually appeared when weight is too big.
If the square of weight is added to cost function, it makes weight small faster. Therefore, it helps to avoid overfitting.
Regularization strength is similar to learning rate. If it is big, penalty of weight is high. As a result, weight become smaller fast.

Universe In Computer

Header$type=social_icons

$type=grid$count=3$meta=0$sn=0$rm=0

15. Learning Rate & Data Processing & Overfitting

TOC

Learning Rate

Overshooting - Too Large Learning Rate

Data Preprocessing

Overfitting

Regularization

라벨:

COMMENTS

Labels

RECENT$type=list-tab$date=0$au=0$c=5

REPLIES$type=list-tab$com=0$c=4$src=recent-comments

RANDOM$type=list-tab$date=0$au=0$c=5$src=random-posts

$type=grid$count=3$meta=0$sn=0$rm=0

15. Learning Rate & Data Processing & Overfitting

TOC

Learning Rate

Overshooting - Too Large Learning Rate

Data Preprocessing

Overfitting

Regularization

라벨:

SHARE:

COMMENTS

Labels

RECENT$type=list-tab$date=0$au=0$c=5

REPLIES$type=list-tab$com=0$c=4$src=recent-comments

RANDOM$type=list-tab$date=0$au=0$c=5$src=random-posts