24. ReLU and Activation Functions

DNN for XOR problem and vanishing gradient. Benefit of ReLU

tensorflow

Vanishing Gradient Problem

  • Here is a DNN for XOR which has 9 hidden layers.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# From reproducibility
tf.set_random_seed(777)

# Learning rate
learning_rate = 0.1

# Inputs data
x_data = [[0, 0],
          [0, 1],
          [1, 0],
          [1, 1]]
# Labels
y_data = [[0],
          [1],
          [1],
          [0]]

# Inputs array
x_data = np.array(x_data, dtype=np.float32)
# Labels array
y_data = np.array(y_data, dtype=np.float32)

# Placeholder for Inputs and Labels
X = tf.placeholder(tf.float32, [None, 2])
Y = tf.placeholder(tf.float32, [None, 1])

# Weights for each layers
W_i = tf.Variable(tf.random_uniform([2, 5], -1.0, 1.0), name='weight_input')
W_h1 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_1')
W_h2 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_2')
W_h3 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_3')
W_h4 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_4')
W_h5 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_5')
W_h6 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_6')
W_h7 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_7')
W_h8 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_8')
W_h9 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_9')
W_o = tf.Variable(tf.random_uniform([5, 1], -1.0, 1.0), name='weight_output')

# Biases for each layers
b_i = tf.Variable(tf.zeros([5]), name='bias_input')
b_h1 = tf.Variable(tf.zeros([5]), name='bias_hidden_1')
b_h2 = tf.Variable(tf.zeros([5]), name='bias_hidden_2')
b_h3 = tf.Variable(tf.zeros([5]), name='bias_hidden_3')
b_h4 = tf.Variable(tf.zeros([5]), name='bias_hidden_4')
b_h5 = tf.Variable(tf.zeros([5]), name='bias_hidden_5')
b_h6 = tf.Variable(tf.zeros([5]), name='bias_hidden_6')
b_h7 = tf.Variable(tf.zeros([5]), name='bias_hidden_7')
b_h8 = tf.Variable(tf.zeros([5]), name='bias_hidden_8')
b_h9 = tf.Variable(tf.zeros([5]), name='bias_hidden_9')
b_o = tf.Variable(tf.zeros([1]), name='bias_output')

# Layers
L_i = tf.sigmoid(tf.matmul(X, W_i) + b_i)
L_h1 = tf.sigmoid(tf.matmul(L_i, W_h1) + b_h1)
L_h2 = tf.sigmoid(tf.matmul(L_h1, W_h2) + b_h2)
L_h3 = tf.sigmoid(tf.matmul(L_h2, W_h3) + b_h3)
L_h4 = tf.sigmoid(tf.matmul(L_h3, W_h4) + b_h4)
L_h5 = tf.sigmoid(tf.matmul(L_h4, W_h5) + b_h5)
L_h6 = tf.sigmoid(tf.matmul(L_h5, W_h6) + b_h6)
L_h7 = tf.sigmoid(tf.matmul(L_h6, W_h7) + b_h7)
L_h8 = tf.sigmoid(tf.matmul(L_h7, W_h8) + b_h8)
L_h9 = tf.sigmoid(tf.matmul(L_h8, W_h9) + b_h9)
hypothesis = tf.sigmoid(tf.matmul(L_h9, W_o) + b_o)

# Cost function
cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) *
                       tf.log(1 - hypothesis))

# Optimizer
train = tf.train.\
            GradientDescentOptimizer(learning_rate=learning_rate).\
            minimize(cost)

# Set threshold.
#  True if hypothesis>0.5 else False
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)

# Accuracy
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y),\
            dtype=tf.float32))

costs= []
accs = []

# Launch graph
with tf.Session() as sess:
    # Initialize TensorFlow variables
    sess.run(tf.global_variables_initializer())

    for step in range(10001):
        # Train
        sess.run(train, feed_dict={X: x_data, Y: y_data})
        _cost = sess.run(cost, feed_dict={
                X: x_data, Y: y_data})
        costs.append(_cost)
        _acc = sess.run(accuracy, feed_dict={X: x_data, Y: y_data})
        accs.append(_acc)

    h, c, a = sess.run([hypothesis, predicted, accuracy],
                       feed_dict={X: x_data, Y: y_data})
    print("\nHypothesis: ", h, "\nCorrect: ", c, "\nAccuracy: ", a)

steps = [i for i in range(len(accs))]

plt.plot(steps, costs)
plt.title("Costs")
plt.xlabel("Steps")
plt.ylabel("Cost")
plt.show()

plt.plot(steps, accs)
plt.title("Accuracies")
plt.xlabel("Steps")
plt.ylabel("Accuracy")
plt.show()
Hypothesis:  [[ 0.49999905]
 [ 0.50000137]
 [ 0.49999875]
 [ 0.50000113]] 
Correct:  [[ 0.]
 [ 1.]
 [ 0.]
 [ 1.]] 
Accuracy:  0.5
  • Its cost and accuracy are
Image 1. Costs
Image 2. Accuracy
  • Even though its test data set is the same as the train data set, its accuracy is not 100%.
  • From the previous post, it is verified 2 layer DNN works well for XOR problem. However, this DNN has 9 layer.
  • This is because of sigmoid. Multiple sigmoid layers mitigate the effect of each weights and bias along the back propagation.
  • In the below graph, S is sigmoid node.

$$ \frac{\partial Y}{\partial} COST \tag{1} $$ $$ T2 \cdot \frac{\partial Y}{\partial} COST \tag{2} $$ $$ T3 \cdot \frac{\partial Y}{\partial} COST \tag{3} $$ $$ e^{-X2} \cdot T2^2 \cdot \frac{\partial Y}{\partial} COST \tag{4} $$ $$ e^{-X2} \cdot T2^2 \cdot T1 \cdot \frac{\partial Y}{\partial} COST \tag{5} $$ $$ e^{-X2} \cdot T2^2 \cdot L \cdot \frac{\partial Y}{\partial} COST \tag{6} $$ $$ e^{-X2} \cdot T2^2 \cdot L \cdot e^{-X1} \cdot L^2 \cdot \frac{\partial Y}{\partial} COST \tag{7} $$ $$ e^{-X2} \cdot T2^2 \cdot L \cdot e^{-X1} \cdot L^2 \cdot X \cdot \frac{\partial Y}{\partial} COST \tag{8} $$ $$ e^{-X2} \cdot T2^2 \cdot L \cdot e^{-X1} \cdot L^2 \cdot K \cdot \frac{\partial Y}{\partial} COST \tag{9} $$

  • The results of sigmoid are decimal values between 0 ~ 1. It means T1, T2, T3 in the equations make the derivative values small. Finally, the last node of back propagation will have almost 0 update value.
  • Therefore, for DNN, sigmoid prevents from updating weights and biases. This is called as Vanishing Gradient Problem.

ReLU

$$ Y = \begin{cases} X & : X > 0 \\ 0 & : X \le 0 \end{cases} $$

  • If sigmoid is replaced by ReLU, T1, T2, and T3 are the same as X1, X2, X3 themselves. Therefore, along the back propagation, the derivative values are not shrunk.

Activation Function for Output Layer

  • However, the reason why sigmoid is used is to normalize input from 0 to 1. - 09. Binary Classification
  • This feature of sigmoid makes neuron dull against very high or low inputs, and returns the probability of the answer.
  • However, ReLU bypasses the input which is larger than 0, so it does not limit the high input. Therefore, it can leads wrong training, and does not returns the probability.
  • To prevent neural network from wrong training, sigmoid should be used as the activation function in output layer. For the activation function of hidden layer, ReLU is better. In that case, the side effect of sigmoid is only happened at the first layer of the back propagation.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# From reproducibility
tf.set_random_seed(777)

# Learning rate
learning_rate = 0.1

# Inputs data
x_data = [[0, 0],
          [0, 1],
          [1, 0],
          [1, 1]]
# Labels
y_data = [[0],
          [1],
          [1],
          [0]]

# Inputs array
x_data = np.array(x_data, dtype=np.float32)
# Labels array
y_data = np.array(y_data, dtype=np.float32)

# Placeholder for Inputs and Labels
X = tf.placeholder(tf.float32, [None, 2])
Y = tf.placeholder(tf.float32, [None, 1])

# Weights for each layers
W_i = tf.Variable(tf.random_uniform([2, 5], -1.0, 1.0), name='weight_input')
W_h1 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_1')
W_h2 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_2')
W_h3 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_3')
W_h4 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_4')
W_h5 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_5')
W_h6 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_6')
W_h7 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_7')
W_h8 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_8')
W_h9 = tf.Variable(tf.random_uniform([5, 5], -1.0, 1.0), name='weight_hidden_9')
W_o = tf.Variable(tf.random_uniform([5, 1], -1.0, 1.0), name='weight_output')

# Biases for each layers
b_i = tf.Variable(tf.zeros([5]), name='bias_input')
b_h1 = tf.Variable(tf.zeros([5]), name='bias_hidden_1')
b_h2 = tf.Variable(tf.zeros([5]), name='bias_hidden_2')
b_h3 = tf.Variable(tf.zeros([5]), name='bias_hidden_3')
b_h4 = tf.Variable(tf.zeros([5]), name='bias_hidden_4')
b_h5 = tf.Variable(tf.zeros([5]), name='bias_hidden_5')
b_h6 = tf.Variable(tf.zeros([5]), name='bias_hidden_6')
b_h7 = tf.Variable(tf.zeros([5]), name='bias_hidden_7')
b_h8 = tf.Variable(tf.zeros([5]), name='bias_hidden_8')
b_h9 = tf.Variable(tf.zeros([5]), name='bias_hidden_9')
b_o = tf.Variable(tf.zeros([1]), name='bias_output')

# Layers
L_i = tf.nn.relu(tf.matmul(X, W_i) + b_i)
L_h1 = tf.nn.relu(tf.matmul(L_i, W_h1) + b_h1)
L_h2 = tf.nn.relu(tf.matmul(L_h1, W_h2) + b_h2)
L_h3 = tf.nn.relu(tf.matmul(L_h2, W_h3) + b_h3)
L_h4 = tf.nn.relu(tf.matmul(L_h3, W_h4) + b_h4)
L_h5 = tf.nn.relu(tf.matmul(L_h4, W_h5) + b_h5)
L_h6 = tf.nn.relu(tf.matmul(L_h5, W_h6) + b_h6)
L_h7 = tf.nn.relu(tf.matmul(L_h6, W_h7) + b_h7)
L_h8 = tf.nn.relu(tf.matmul(L_h7, W_h8) + b_h8)
L_h9 = tf.nn.relu(tf.matmul(L_h8, W_h9) + b_h9)
L_o = tf.sigmoid(tf.matmul(L_h9, W_o) + b_o)

hypothesis = L_o

# Cost function
cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) *
                       tf.log(1 - hypothesis))

# Optimizer
train = tf.train.\
            GradientDescentOptimizer(learning_rate=learning_rate).\
            minimize(cost)

# Set threshold.
#  True if hypothesis>0.5 else False
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)

# Accuracy
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y),\
                dtype=tf.float32))

costs= []
accs = []

# Launch graph
with tf.Session() as sess:
    # Initialize TensorFlow variables
    sess.run(tf.global_variables_initializer())

    for step in range(10001):
        # Train
        sess.run(train, feed_dict={X: x_data, Y: y_data})
        _cost = sess.run(cost, feed_dict={
                X: x_data, Y: y_data})
        costs.append(_cost)
        _acc = sess.run(accuracy, feed_dict={X: x_data, Y: y_data})
        accs.append(_acc)

    h, c, a = sess.run([hypothesis, predicted, accuracy],
                       feed_dict={X: x_data, Y: y_data})
    print("\nHypothesis: ", h, "\nCorrect: ", c, "\nAccuracy: ", a)

steps = [i for i in range(len(accs))]

plt.plot(steps, costs)
plt.title("Costs")
plt.xlabel("Steps")
plt.ylabel("Cost")
plt.show()

plt.plot(steps, accs)
plt.title("Accuracies")
plt.xlabel("Steps")
plt.ylabel("Accuracy")
plt.show()
Hypothesis:  [[ 0.00202512]
 [ 0.99999821]
 [ 0.99999785]
 [ 0.00202512]] 
Correct:  [[ 0.]
 [ 1.]
 [ 1.]
 [ 0.]] 
Accuracy:  1.0
Image 3. Costs
Image 4. Accuracy

Type of Activation Functions

  • Besides sigmoid and ReLU, there are many Activation functions

    • tanh
    • Leaky ReLU
    • Maxout
    • ELU
  • This is the list of the representative activation functions in each layers for problem set.

Problem Hidden Layer Output Layer
Linear Identity Identity
Logistic ReLU Sigmoid
Softmax ReLU Softmax
Table 1. Activation function for each layers

COMMENTS

Name

0 weights,1,abstract class,1,active function,3,adam,2,Adapter,1,affine,2,argmax,1,back propagation,3,binary classification,3,blog,2,Bucket list,1,C++,11,Casting,1,cee,1,checkButton,1,cnn,3,col2im,1,columnspan,1,comboBox,1,concrete class,1,convolution,2,cost function,6,data preprocessing,2,data set,1,deep learning,31,Design Pattern,12,DIP,1,django,1,dnn,2,Don't Repeat Your code,1,drop out,2,ensemble,2,epoch,2,favicon,1,fcn,1,frame,1,gradient descent,5,gru,1,he,1,identify function,1,im2col,1,initialization,1,Lab,9,learning rate,2,LifeLog,1,linear regression,6,logistic function,1,logistic regression,3,logit,3,LSP,1,lstm,1,machine learning,31,matplotlib,1,menu,1,message box,1,mnist,3,mse,1,multinomial classification,3,mutli layer neural network,1,Non Virtual Interface,1,normalization,2,Note,21,numpy,4,one-hot encoding,3,OOP Principles,2,Open Close Principle,1,optimization,1,overfitting,1,padding,2,partial derivative,2,pooling,2,Prototype,1,pure virtual function,1,queue runner,1,radioButton,1,RBM,1,regularization,1,relu,2,reshape,1,restricted boltzmann machine,1,rnn,2,scrolledText,1,sigmoid,2,sigmoid function,1,single layer neural network,1,softmax,6,softmax classification,3,softmax cross entropy with logits,1,softmax function,2,softmax regression,3,softmax-with-loss,2,spinBox,1,SRP,1,standardization,1,sticky,1,stride,1,tab,1,Template Method,1,TensorFlow,31,testing data,1,this,2,tkinter,5,tooltip,1,Toplevel,1,training data,1,vanishing gradient,1,Virtual Copy Constructor,1,Virtual Destructor,1,Virtual Function,1,weight decay,1,xavier,2,xor,3,
ltr
item
Universe In Computer: 24. ReLU and Activation Functions
24. ReLU and Activation Functions
DNN for XOR problem and vanishing gradient. Benefit of ReLU
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiE9QfIQg9MqxmXv8wo1jRHrMgva3N0n9uaoJIHiM44Vt8k6nlufCwcOrXM4piATO-QqQmLgh_JEZUv2KXJVRIATvdu0xwckn-JPaRyfJpu9tFP929dbQgKHcd0zfVFfe9EjSkH18A4MxU4/s0/
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiE9QfIQg9MqxmXv8wo1jRHrMgva3N0n9uaoJIHiM44Vt8k6nlufCwcOrXM4piATO-QqQmLgh_JEZUv2KXJVRIATvdu0xwckn-JPaRyfJpu9tFP929dbQgKHcd0zfVFfe9EjSkH18A4MxU4/s72-c/
Universe In Computer
https://kunicom.blogspot.com/2017/08/24-relu-and-activation-functions.html
https://kunicom.blogspot.com/
https://kunicom.blogspot.com/
https://kunicom.blogspot.com/2017/08/24-relu-and-activation-functions.html
true
2543631451419919204
UTF-8
Loaded All Posts Not found any posts VIEW ALL Readmore Reply Cancel reply Delete By Home PAGES POSTS View All RECOMMENDED FOR YOU LABEL ARCHIVE SEARCH ALL POSTS Not found any post match with your request Back Home Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sun Mon Tue Wed Thu Fri Sat January February March April May June July August September October November December Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec just now 1 minute ago $$1$$ minutes ago 1 hour ago $$1$$ hours ago Yesterday $$1$$ days ago $$1$$ weeks ago more than 5 weeks ago Followers Follow THIS CONTENT IS PREMIUM Please share to unlock Copy All Code Select All Code All codes were copied to your clipboard Can not copy the codes / texts, please press [CTRL]+[C] (or CMD+C with Mac) to copy