Data Science with Python — Hard
Key points
- Gradients become small in deep networks with sigmoid/tanh activations
- Early layers learn slowly due to vanishing gradients
- ReLU activation maintains a constant gradient of 1 for positive inputs
- Saturating activations exacerbate the vanishing gradient problem
Ready to go further?
Related questions
