What is ‘multi-head attention’ in the transformer architecture?

AI Fundamentals Hard

AI Fundamentals — Hard

What is ‘multi-head attention’ in the transformer architecture?

Key points

  • Multi-head attention runs attention multiple times in parallel
  • Different learned projections are used for each attention head
  • Helps the model attend to information from various representation subspaces
  • Enhances the model's ability to capture complex relationships

Ready to go further?

Related questions