A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why.
Andrej Karpathy's nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space.
This workshop is my attempt to give others that same experience. nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour β designed to be completed in a single workshop session.
A working GPT model trained from scratch on your MacBook, capable of generating Shakespeare-like text. You'll write:
Training uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. Also works on Google Colab β upload the files and run with !python train.py.
Install uv if you don't have it:
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Then set up the project:
uv sync
mkdir scratchpad && cd scratchpad
If you don't have a local setup, upload the repo to Colab and install dependencies:
!pip install torch numpy tqdm tiktoken
Upload data/shakespeare.txt to your Colab files, then write your code in notebook cells or upload .py files and run them with !python train.py.
Work through the docs in order. Each part walks you through writing a piece of the pipeline, explaining what each component does and why. By the end, you'll have a working model.py, train.py, and generate.py that you wrote yourself.
| Part | What You'll Write | Concepts |
|---|---|---|
| Part 1: Tokenization | Character-level tokenizer | Character encoding, vocabulary size, why BPE fails on small data |
| Part 2: The Transformer | Full GPT model architecture | Embeddings, self-attention, layer norm, MLP blocks |
| Part 3: The Training Loop | Complete training pipeline | Loss functions, AdamW, gradient clipping, LR scheduling |
| Part 4: Text Generation | Inference and sampling | Temperature, top-k, autoregressive decoding |
| Part 5: Putting It All Together | Train on real data, experiment | Loss curves, scaling experiments, next steps |
| Part 6: Competition | Train the best AI poet | Find datasets, scale up, submit your best poem |
Input Text
β
βΌ
βββββββββββββββββββ
β Tokenizer β "hello" β [20, 43, 50, 50, 53] (character-level)
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β Token Embed + β token IDs β vectors (n_embd dimensions)
β Position Embed β + positional information
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β Transformer β Γ n_layer
β Block: β
β ββββββββββββββ β
β β LayerNorm β β
β β Self-Attn β β n_head parallel attention heads
β β + Residual β β
β ββββββββββββββ€ β
β β LayerNorm β β
β β MLP (FFN) β β expand 4x, GELU, project back
β β + Residual β β
β ββββββββββββββ β
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β LayerNorm β
β Linear β logitsβ vocab_size outputs (probability over next token)
βββββββββββββββββββ
| Config | Params | n_layer | n_head | n_embd | Train Time (M3 Pro) |
|---|---|---|---|---|---|
| Tiny | ~0.5M | 2 | 2 | 128 | ~5 min |
| Small | ~4M | 4 | 4 | 256 | ~20 min |
| Medium (default) | ~10M | 6 | 6 | 384 | ~45 min |
All configs use character-level tokenization (vocab_size=65) and block_size=256.
This workshop uses character-level tokenization on Shakespeare. BPE tokenization (GPT-2's 50k vocab) doesn't work on small datasets β most token bigrams are too rare for the model to learn patterns from.
| Tokenizer | Vocab Size | Dataset Size Needed |
|---|---|---|
| Character-level | ~65 | Small (Shakespeare, ~1MB) |
| BPE (tiktoken) | 50,257 | Large (TinyStories+, 100MB+) |
Part 5 covers switching to BPE for larger datasets.
The engineering was horrible and very ad-hoc but I learned a lot. Results were ok-ish (I classified tweets) but it gave me a good perspective on the sheer GPU power (and engineering challenges) one would need to do this seriously. I didn't fully grasp the potential of generating output but spent quite some time chuckling at generated tweets (was just curious to try it).
But that is just me. I think is more useful to understand the how and whys before training a LLM.
[0] https://github.com/rasbt/LLMs-from-scratch
[1] https://www.manning.com/books/build-a-large-language-model-f...
[2] https://magazine.sebastianraschka.com/p/coding-llms-from-the...
A series of Jupyter notebooks explaining the whole machine learning mechanism, from the beginning
https://github.com/nickyreinert/DeepLearning-with-PyTorch-fr...
and of course also how to build an llm from scratch
https://github.com/nickyreinert/basic-llm-with-pytorch/blob/...
I doubt you have a machine big enough to make it "Large".
And it's paired with 48 processor cores! I mean, they don't even support AVX512 but they can do math!
I could totally train a LLM! Or at least my family could... might need my kid to pick up and carry on the project.
But in all seriousness... you either missed the point, are being needlessly pedantic, or are... wrong?
This is about learning concepts, and the rest of this is mostly moot.
On the pedantic or wrong notes--What is the documented cut-off for a "large" language model? Because GPT-2 was and is described as a "large" language model. It had 1.5B parameters. You can just about get a consumer GPU capable of training that for about $400 these days.
I'm not saying it's worth it but you don't need to buy a GPU yourself to be able to train.
runs on a Blackwell 6000 Max-Q, using 86GB VRAM. Training supposedly takes 3h40m
In my own very humble opinion, it becomes "Large" when it's out of non-specialized hardware. So currently, a model which requires more than 32GB vram is large (as that's roughly where the high-end gaming GPUs cut off).
And btw, there is no way you can train a language model on a CPU, even with ddr5, lest you wait a whole week for a single training cycle. Give it a go! I know I did, it's a magnitude away from being feasible.
And no one is stopping anyone from tweaking few parameters in this repo to go above 10M parameters.