Home

TransformerCoreMathTutorial

A tutorial on the core math behind the multi headed attention transformer block

GPT, as well as other LLMs and LMMs are an amazing advance in AI. But, how do they work? They all use an ML model called a transformer. Transformers allow AI to learn the complex relationships between tokens in the training data, in other words to learn the semantics, grammar, and even underlying knowledge encoded in natural language and images.

This tutorial will focus on the core math that makes a transformer block work, using multi headed attention as well as position and token embedding.

Both the descriptive explanations and the code samples for this tutorial were generated by chatGPT. In some cases the initial code had minor errors, these errors were also fixed by GPT 4 by feeding the errors back into GPT 4 and GPT 4 would generate new code.

This is an advanced tutorial which builds the main components of the Transformer model, the multi headed attention mechanism and the position and token embedding, from scratch in PyTorch.

Tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

TransformerCoreMathTutorial

Clone this wiki locally