Back propagation in Deep Neural Network
Affine
- Simple single factor affine is
$$ Y = W \cdot X + b $$
- Also, affine can be represented with graph.
- This is affine python class.
import numpy ad np
class Affine():
def __init__(self, W, b):
self.W = W
self.b = b
self.x = None
self.dW = None
self.db -= None
def forward(self, x):
self.x = x
return np.dot(x, self.W) + self.b
def backward(self, d):
self.dW = np.dot(self.x.T, d)
self.db = np.sum(d, axis=0)
dx = np.dot(d, self.W.T)
return dx
Sigmoid
- Sigmoid is a popular activation function.
- Equation
$$ Y = \frac{1}{1 + \exp^{-X}} $$
- Graph
- Python class
import numpy as np
class Sigmoid ():
def __init__(self):
self.value = None
def forward(self, x):
out = 1 / ( 1 + np.exp(-x))
self.value = out
return out
def backward(self, d)
dx = d * (1 - self.value) * self.value
return dx
Softmax-with-loss
- Softmax-with-loss is a combination of softmax and cost function.
- During training, cost should be calculated to update weights and bias. Therefore, softmax-with-loss is more appropriate for training. For inference, softmax is not necessary, because the highest value would be chosen.
- For example, let's assume that 3 labels classification neural network.
- It is complicated, so forward and backward graphs are divided. Furthermore, softmax and cross entropy error are divided.
- Forward graph: Input -> Softmax -> Cross Entropy Error
- Forward graph: Softmax -> Cross Entropy Error -> Output
- L1, L2, L3 are labels
- Backward graph: Output -> Cross Entropy Error -> Softmax
- The output of softmax-with-loss, Y, is the cost. Therefore, the differential value of cost node is \( \frac{\partial{COST}}{\partial{Y}} = \frac{\partial{Y}}{\partial{Y}} = 1 \).
- Backward graph: Cross Entropy Error -> Softmax -> Input
- If a node spreads its output in forward path, it has multiple inputs in backward path. For that case, add the inputs. See RECIP node.
- Python code
import numpy as np
class SoftmaxWithLoss():
def __init__(self):
self.loss = None
self.Y = None
self.labels = None
def forward(self, X, labels):
self.labels = labels
self.Y = self.softmax(X)
self.loss = self.cross_entropy_error(self.Y, self.labels)
return self.loss
def backward(self, d=1):
batch_size = self.T.shape[0]
dx = (self.Y - self.labels) / batch_size
return dx
def softmax(self, X):
ret = None
if x.ndim == 2:
X = X.T
X = X - np.max(X, axis=0)
Y = np.exp(x) / np.sum(np.exp(X), axis=0)
ret = Y.T
else:
# To avoid overflow
X = X - np.max(X)
ret = np.exp(X) / np.sum(np.exp(X))
return ret
def cross_entropy_error(self, Y, labels):
# Translate one-hot encoded labels to answer index.
labels = labels.argmax(axis=1)
batch_size = Y.shape[0]
log_val = np.log(Y[np.arange(batch_size), labels])
return -np.sum(log_val) / batch_size
Rectified Linear Unit (ReLU)
- ReLU is the most representative activation function. - Wiki
- The output of differential of ReLU is 1 or 0. Therefore, it makes computation cost less.
- It will be explained in detail later, but here only explains its back propagation.
- Equation
$$ Y = \begin{cases} X & : X > 0 \\ 0 & : X \le 0 \end{cases} $$
$$ \frac{\partial{Y}}{\partial{X}} = \begin{cases} 1 & : X > 0 \\ 0 & : X \le 0 \end{cases} $$
- Graph if X is larger than 0.
- Graph if X is less than or equal to 0.
- Python code
import numpy as np
class RELU():
def __init__(self):
self.mask = None
def forward(self, X):
self.mask = (X <= 0)
out = X.copy()
out[self.mask] = 0
return out
def backward(self, d):
d[self.mask] = 0
dx = d
return dx
COMMENTS