What is the purpose of attention masking in Transformer models during training?

Question

Machine Learning — Hard

What is the purpose of attention masking in Transformer models during training?

Accepted Answer

In Transformer models, attention masking is crucial for autoregressive training in encoder-decoder setups. Causal masking in the decoder ensures that each output only depends on past outputs, while padding masks prevent attending to unnecessary positions.