Skip to main content

Activation Functions

Neuron Model

The operation of a neuron in an artificial neural network is, to sum up, the product of the input and associated weight (also called connection weight) and produce an output using activation/transfer function. It maps the output values in the desired range such as 0 to 1 or -1 to 1 or 0 to inf. This output will act as an input for neuron of next layer.


For input x0, connection weight is w0. Inside the cell, the first half is responsible for summation of the product of input and associated weight and second half is applying a transfer/activation function on that summation.
There is no activation at the input unit of a neuron or we can say it is identity function (f(x) = x).

All the neuron of a particular layer gets the same activation functions. Different layers can have different activation function. For example: In a classification network generally, there is a sigmoid/ReLU activation in intermediate layers and softmax at the output layer.

Why do we need activation functions?


The one line answer is “To add non-linearity”.

We want out Neural Networks to be a Universal Function Approximators. It means we can compute and learn any complex function which maps inputs to outputs. If we do not apply a Non-linear Activation function, the final output would simply be a simple linear function. It cannot learn function mapping from complex and high dimensional data like images, videos, speech etc. Non-linear function is needed for the Universal Function Approximators. 
The Universal Approximator theorem states that under certain conditions, for any continuous function f:[0,1]d→R and any ϵ, there exist a neural network with one hidden layer and a sufficiently large number of hidden units m which approximates f on [0,1]d uniformly to within ϵ. One of the conditions for the universal approximation theorem to be valid is that the neural network is a composition of nonlinear activation functions: if only linear functions are used, the theorem is not valid anymore.

What if we add more layers without using any Non-Linear Function?


A neural network with linear activation functions and n layers each having some m hidden units is equivalent to a linear network without any hidden layers because the composition of any number of linear functions is a linear function only, It can be proved using below equation.

Y =h(x)

= bn +W(bn-1 + Wn-1(…(b1+W1x)…))

=bn+Wnbn-1+WnWn-1bn-1+…+WnWn-1…W1x

= b1+W1

Hence adding more layers (“going deep”) doesn’t help for a linear neural network at all, unlike nonlinear neural network.

Therefore, we need to apply a non-linear Activation function to make the network more powerful to represent non-linear complex functional mapping function between input and output for any data.

Commonly used Activation Functions:


Sigmoid



  • f(x) = 1/ (1+exp(-x) ) 
  • It squashes the input in the range [0,1].
  • Derivative: z(1-z)
  • Historically popular since they have nice interpretation as a saturating firing rate of a neuron.
  • exp is a bit compute expensive.
  • Saturated neuron kills gradient.
  • Sigmoid output is not zero centered.
  • slow convergence.
  • Vanishing Gradient Problem.

Hyperbolic tangent Function tanh(x)

tanh(x)


  • Squashes input to range [-1, 1]
  • Derivative: z(1-z2)
  • Zero Centered
  • exp is a bit compute expensive.
  • Saturated neuron kills gradient.

Rectified Linear Unit (ReLU)

  • f(x) = max(0, x) 
  • Derivative : 1, if x> 0 else 0 
  • Range [0, inf ]
  • Does not saturate ( in + region)
  • Computationally very efficient
  • Converges much faster than sigmoid/tanh
  • Not Zero Centered
  • Neuron Dies sometimes

Leaky ReLU

  • f(x) = max(0.01x ,x)
  • All benefit of ReLU
  • Will not Die
  • Parametric ReLU : max(ax,x) , a can be variable

Quick Summary 

  • Use ReLU, but be careful about your learning rate.
  • Try Out Leaky ReLU/ELU
  • You can try tanh but don't expect much.
  • Don't use sigmoid in intermediate layers.


Comments

Post a Comment

Popular posts from this blog

Transparent Image overlay(Alpha blending) with OpenCV and Python

(a)Final Blended Image                     (b) Background Image                             (c)Foreground Image                               Alpha blending Alpha blending is the process of overlaying a foreground image with transparency over a background Image. The transparent image is generally a PNG image.It consists of four channels (RGBA).The fourth channel is the alpha channel which holds the transparency magnitude. Image (b) is a background image and image (c) is the foreground / overlay image. Image (a) is the final blended image obtained by blending  the overalay image using the alpha mask.  Below is the image(Fig d) of the alpha channel of the overlay  image. (d).Alpha Channel At every pixel of the image, we blend the background and foreground image color(F) and background color (B) using the alpha mask . At every pixel value of alpha lie in range(0,255), a pixel intensity of 0 means black color and pixel instensity of 255 means whit

Fast Pixel Processing with OpenCV and Python

In this post. I will explain how fast pixel manipulation of an image can be done in Python and OpenCV. Image processing is a CPU intensive task. It involves processing on large arrays. Hence when you are implementing your Image Processing algorithm, you algorithm needs to be highly efficient. The type of operation that can be applied on an Image can be classified into three categories.