Machine Learning — Hard
Key points
- Attention masking in Transformers is essential for autoregressive training
- Causal masking in the decoder prevents future positions from influencing current outputs
- Padding masks help avoid attending to irrelevant padded positions
Ready to go further?
Related questions
