What is the vanishing gradient problem in deep neural networks and which activation function helps mitigate it?

Data Science with Python Hard

Data Science with Python — Hard

What is the vanishing gradient problem in deep neural networks and which activation function helps mitigate it?

Key points

  • Gradients become small in deep networks with sigmoid/tanh activations
  • Early layers learn slowly due to vanishing gradients
  • ReLU activation maintains a constant gradient of 1 for positive inputs
  • Saturating activations exacerbate the vanishing gradient problem

Ready to go further?

Related questions