What is ‘multi-head attention’ in the transformer architecture?

Question

AI Fundamentals — Hard

What is ‘multi-head attention’ in the transformer architecture?

Accepted Answer

Multi-head attention in the transformer architecture involves running attention multiple times in parallel with different learned projections. This allows the model to jointly attend to information from different representation subspaces, enhancing its ability to capture complex relationships in the data.