Cost function and gradient descent for binary classification
Toc
- Mean Square Error with Binary Regression
- Cross Entropy Error(CEE) for Binary Classification
- Gradient Descent for Binary Classification
Mean Square Error with Binary Regression
- MSE(Mean Square Error) function is representative cost function, and it is used for linear classification.
- Let's apply MSE to binary regression.
import numpy as np
import matplotlib.pyplot as plt
# Result of simplified hypothesis
H = [i * 0.001 for i in range(1, 1000)]
# Answer - if x is less than 0, y is 0.
# If x is larger than 0, y is 1.
Y = [i > 0 for i in range(-100, 100)]
# Cross Entropy Error
cost0 = lambda _y : -(1-_y) * np.log(1 - h)
cost1 = lambda _y : -_y * np.log(h)
cost = lambda _y : -(1-_y) * np.log(1 - h) -_y * np.log(h)
# Lists for outputs
costs1 = []
costs0 = []
costs = []
for h in H:
# For y = 0
diffSqrts0 = list(map(cost0, Y))
sumDiffSqrt0 = sum(diffSqrts0)
costs0.append(sumDiffSqrt0 / len(H))
# For y = 1
diffSqrts1 = list(map(cost1, Y))
sumDiffSqrt1 = sum(diffSqrts1)
costs1.append(sumDiffSqrt1 / len(H))
# For both
diffSqrts = list(map(cost, Y))
sumDiffSqrt = sum(diffSqrts)
costs.append(sumDiffSqrt / len(H))
# Graphs
plt.plot(H, costs0, label="y = 0")
plt.plot(H, costs1, label="y = 1")
plt.plot(H, costs, label="y = 0 or 1")
plt.title("Cross Entropy Error")
plt.xlabel("H")
plt.ylabel("Cost(H)")
plt.xlim(-0.1, 1.1)
plt.grid()
plt.legend(loc="upper center")
plt.show()
Image 1. MSE for binary regression
- Unlike our expectation, MSE for binary classification is almost reverse of sigmoid function.
- For this graph, it is meaningless to find global minimum.
- Therefore, MSE cannot be used as cost function for binary classification.
Cross Entropy Error(CEE) for Binary Classification
- For binary classification, CEE is used as a cost function.
- Simplified CEE for binary classification.
$$ H(x) = \frac{1}{1 + e^{-wx}} $$ $$ c(H(x), y) = \begin{cases} -log(H(x)) & : y = 1 \\ -log(1-H(x)) & : y = 0 \end{cases} $$ $$ c(H(x), y) = -(y \cdot \log(H(x))) - (1-y) \cdot (\log(1-H(x))) $$ $$ cost(w) = \frac{1}{m} \sum_{i=0}^m c(H(x), y) $$
import matplotlib.pyplot as plt
# Simplified hypothesis
hypo = lambda _x : 1 / (1 + np.exp(-w * _x))
# Input
X = [i * 0.01 for i in range(-200, 200)]
# Answer - if x is less than 0, y is 0.
# If x is larger than 0, y is 1.
Y = [i > 0 for i in range(-200, 200)]
cost = lambda _output, _answer : (_output - _answer) ** 2
costs = []
W = [i * 0.1 for i in range(-1000, 1001)]
for w in W:
_hypo = list(map(hypo, X))
diffSqrts = list(map(cost, _hypo, Y))
sumDiffSqrt= sum(diffSqrts)
costs.append(sumDiffSqrt / len(W))
# Draw cost function
plt.plot(W, costs)
plt.title("Mean Square Error")
plt.xlabel("W")
plt.ylabel("Cost(W)")
plt.show()
Image 2. CEE for binary classification
- To make it simple, x axis is the result of hypothesis.
- The result of hypothesis is from sigmoid function, so it is between 0 to 1.
- When y = 1, errors become smaller as h is bigger. if h is 1, its error is 0.
- Errors for y = 0 show reversed shape. They are getting bigger as h is bigger.
- Therefore, if hypothesis is close to answer y, its error is small. It not, the error is almost infinite value.
- This is good feature as cost function.
- Furthermore, the sum of y = 0 and y = 1 is similar to the graph of 2 dimensional equation.
- As a result, gradient descent algorithm is effective for binary classification to find the global minimum, too.
Gradient Descent for Binary Classification
- This is the same as the linear classification.
$$ W = W - \alpha {\partial \over\partial W} cost(W) $$
import numpy as np
import matplotlib.pyplot as plt
# Hypothesis
hypo = lambda _w, _x: 1 / (1 + np.exp(-_w * _x))
# Cost
cost = lambda _hypo, _y : -(1-_y) * np.log(1 - _hypo) -_y * np.log(_hypo)
# Gradient
def cost_gradient(w, X, Y):
# Euler method to calculate derivative
# h is delta of W
h = 1e-2
# For partial derivatives
tmp_val = w
# Calculate forward values
W = [tmp_val + h for i in range(len(X))]
_hypo = list(map(hypo, W, X))
fxh1 = sum(list(map(cost, _hypo, Y))) / len(W)
# Calculate backward values
W = [tmp_val - h for i in range(len(X))]
_hypo = list(map(hypo, W, X))
fxh2 = sum(list(map(cost, _hypo, Y))) / len(W)
# Calculate the diff
grad = (fxh1 - fxh2) / (2*h)
return grad
# Input
X = [i * 0.01 for i in range(-200, 200)]
# Answer
Y = [i > 0 for i in range(-200, 200)]
# Weight
w = 0.6
# Learning rate
lr = 0.1
costs = []
trials = [i for i in range(10)]
for t in trials:
W = [w for i in range(len(X))]
_hypo = list(map(hypo, W, X))
_cost = sum(list(map(cost, _hypo, Y))) / len(_hypo)
costs.append(_cost)
grad = cost_gradient(w, X, Y)
w = w - lr * grad
plt.plot(t, _cost, "o", label="w = {0:4.3f}".format(w))
plt.plot(trials, costs)
plt.xlabel("trials")
plt.ylabel("cost")
plt.grid()
plt.legend(numpoints=1, loc="upper right", ncol=2)
plt.show()
Image 3. Gradient descent for binary classification
COMMENTS