Gradient descent for linear regression
TOC
Gradient Descent Algorithm
- A first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. - Wiki
-
Gradient Descent Algorithm repeats to track down as much as the slope at the current point along the cost until it finds the local minimum.
-
New W and b are the subtraction between the old values and the multiply of the learning rate and partial derivatives of the cost function at the current point, W and b.
$$ W = W - \alpha {\partial \over\partial W} cost(W, b) $$
$$ b = b - \alpha {\partial \over\partial b} cost(W, b) $$
- \( \alpha \) is a learning rate. learning rate is how much jump along the cost function. If it is large, W and b will changed a lot.
- learning rate is a tunable value, and usually it is less than 1. Also as test runs, it is good to change the values small.
Gradient Descent Algorithm for Linear Regression
- Gradient Descent Algorithm
$$ W_{new} = W - \alpha {\frac{1}{m} \sum_{i=1}^m ((W * x_{i})-y_{i}) * x_i} $$
- Proof of Gradient Descent Algorithm for simplified linear regression
$$ cost(W) = \frac{1}{m} \sum_{i=1}^m ((W * x_{i})-y_{i})^2 $$ $$ W_{new} = W - \alpha {\partial \over\partial W} cost(W) $$ $$ W_{new} = W - \alpha {\partial \over\partial W} {\frac{1}{m} \sum_{i=1}^m ((W * x_{i})-y_{i})^2} $$ $$ W_{new} = W - \alpha {\partial \over\partial W} {\frac{1}{2m^{'}} \sum_{i=1}^m ((W * x_{i})-y_{i})^2} $$ $$ W_{new} = W - \alpha {\frac{1}{2m^{'}} \sum_{i=1}^m 2((W * x_{i})-y_{i}) * x_i} $$ $$ W_{new} = W - \alpha {\frac{1}{m} \sum_{i=1}^m ((W * x_{i})-y_{i}) * x_i} $$
import numpy as np
import matplotlib.pyplot as plt
def calc_cost(w):
# hypothesis = W * x
hypo = [w * _x for _x in x]
_mse = list(map(lambda _hypo, _answer : \
(_hypo - _answer) ** 2, hypo, y))
sumMse = sum(_mse)
# 1 / m * sum(W * x - y)^2
return 1 / (len(W)) * sumMse
# Set size of plot
width = 8
height = 7
plt.figure(figsize=(width, height))
# number of points
num_points = 300
# For the sample data
# Answer values
x_ans = [i * 0.01 for i in range(-200, 200)]
W_ans = 0.1
# Generate normal random values for input and output
x = [np.random.normal(0.0, 0.55) for i in range(num_points + 1)]
y = [W_ans * _x + np.random.normal(0.0, 0.03) for _x in x]
# For the Mean Square Error
# Weight from 0 to 0.2
W = [i * 0.001 for i in range(0, 201)]
# Cost
cost = []
# Draw cost function
for w in W:
cost.append(calc_cost(w))
plt.plot(W, cost, "r")
# Start from particular points
w = 0.18
learning_rate = 0.1
# Cost and Gradient descent
for i in range(10):
# Calculate cost
_cost = calc_cost(w)
# Draw calculated value
plt.plot(w, _cost, "o", \
label="W: {0:5.3f}, Cost(W): {1:5.5f}".format(w, _cost))
# Calculate gradient descent
gradients = list(map(\
lambda _input, _answer : ((w * _input) - _answer) * _input, x, y))
sumGrad = sum(gradients)
# Descent and update w
w = w - learning_rate / len(W) * sumGrad
plt.title("Gradient Descent Algorithm")
plt.xlim(0.15, 0.19)
plt.ylim(0.0025, 0.0045)
plt.xlabel("W")
plt.ylabel("Cost(W)")
plt.grid()
plt.legend(numpoints=1,loc='lower right')
plt.show()
Convex Function
- Gradient Descent Algorithm uses partial derivative, so it has a problem.
- It tries to find a point which is local minimum. However, local minimum can be placed in several points, not only one, like below image. Therefore, we should find global minimum which is the lowest point.
-
Therefore, the starting value is very important for this case.
-
However, if cost function is a convex function, this problem is no matter.
-
Convec function has only one local minimum like below image. That is the global minimum.
- Therefore, when you solve a problem with machine learning, you should try to find cost function which is convex function. If not, you need some techniques to find better starting point and make machine learning with multiple starting points.
COMMENTS