What is ‘sparse attention’ and why was it developed as an alternative to full self-attention?

Question

AI Fundamentals — Hard

What is ‘sparse attention’ and why was it developed as an alternative to full self-attention?

Accepted Answer

Sparse attention computes interactions for only a subset of token pairs, reducing complexity for long sequences. It was developed to address the high computational and memory demands of full self-attention in transformers.