What is the vanishing gradient problem in deep neural networks and which activation function helps mitigate it?

Question

Data Science with Python — Hard

What is the vanishing gradient problem in deep neural networks and which activation function helps mitigate it?

Accepted Answer

The vanishing gradient problem in deep neural networks occurs when gradients become extremely small during backpropagation, especially with saturating activations like sigmoid/tanh. This causes early layers to learn very slowly. ReLU helps mitigate this issue by providing a constant gradient of 1 for positive inputs.