If you're interested in this resource, I highly recommend checking out Stanford's CS336 class. It covers all this curriculum in a lot more depth, introduces you into a lot of theoretical aspects (scaling laws, intuitions) and systems thinking (kernel optimization/profiling). For this, you have to do the assignments, of course... https://cs336.stanford.edu/

Been doing it since the day I was born. The beginnings were hard but I’m getting there.

Coincidentally, I just started on Build a Large Language Model (From Scratch), a repo/book/course by Sebastian Raschka [0][1][2]. Maybe it is a good problem to have to have to decide which learning resource to use.

[0] https://github.com/rasbt/LLMs-from-scratch

[1] https://www.manning.com/books/build-a-large-language-model-f...

[2] https://magazine.sebastianraschka.com/p/coding-llms-from-the...

shameless plug:

A series of Jupyter notebooks explaining the whole machine learning mechanism, from the beginning

https://github.com/nickyreinert/DeepLearning-with-PyTorch-fr...

and of course also how to build an llm from scratch

https://github.com/nickyreinert/basic-llm-with-pytorch/blob/...

Context: he is one of the MLX developers, a skilled ML researcher.

I did it back in the day when fast.ai was relatively new with ULMFiT. This must have been when Bert was sota. The architecture allows you to train a base and specialize with a head. I used the entire Wikipedia for the base and then some GBs of tweets I had collected through the firehouse. I had access to a lab with 20 game dev computers. Must have been roughly GTX 2080s. One training cycle took about half a day for the tokenized Wikipedia so I hyper parameter tuned by running one different setting on each computer and then moving on with the winner as the starting point for the next day. It was always fun to come to work the next morning and check the results.

The engineering was horrible and very ad-hoc but I learned a lot. Results were ok-ish (I classified tweets) but it gave me a good perspective on the sheer GPU power (and engineering challenges) one would need to do this seriously. I didn't fully grasp the potential of generating output but spent quite some time chuckling at generated tweets (was just curious to try it).

This looks like exact copy of this video of andrej karpathy ( https://youtu.be/kCc8FmEb1nY ) but in a writing format, am i wrong ?

If someone is interested, I am giving short courses with walkthrough on how to train you LLM from scratch via AI Study Camp.

The documentation is really helpful enough to get started

Nice. What scale does this realistically reach on a single machine?

This looks great for a first introduction to training LLMs, and it looks simple enough to try this locally. Great job!

Train your LM from scratch*

I doubt you have a machine big enough to make it "Large".

I would start with linear algebra, some calculus and statistics and understand how a neural network - which really is just one type of ML - works, the learn the basics of CNN and RNN, then learn transformers and LLM.

But that is just me. I think is more useful to understand the how and whys before training a LLM.

That’s actually super interesting

I know it's a bit of a joke, but "I Built a Neural Network from Scratch in SCRATCH" gave me, a complete outsider, a lot of insight into how neural networks work.

https://www.youtube.com/watch?v=5COUxxTRcL0

how does one get the lectures? I don't see the option for any lectures.

Been doing it since the day I was born. The beginnings were hard but I’m getting there.

You've actually been primarily training a physics model, with an LLM attached to it.

Train Your Own LLM From Scratch

A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why.

Andrej Karpathy's nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space.

This workshop is my attempt to give others that same experience. nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour — designed to be completed in a single workshop session.

What You'll Build

A working GPT model trained from scratch on your MacBook, capable of generating Shakespeare-like text. You'll write:

Tokenizer — turning text into numbers the model can process
Model architecture — the transformer: embeddings, attention, feed-forward layers
Training loop — forward pass, loss, backprop, optimizer, learning rate scheduling
Text generation — sampling from your trained model

Prerequisites

Any laptop or desktop (Mac, Linux, or Windows)
Python 3.12+
Comfort reading Python code (you don't need ML experience)

Training uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. Also works on Google Colab — upload the files and run with !python train.py.

Getting Started

Local (recommended)

Install uv if you don't have it:

# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Then set up the project:

uv sync
mkdir scratchpad && cd scratchpad

Google Colab

If you don't have a local setup, upload the repo to Colab and install dependencies:

!pip install torch numpy tqdm tiktoken

Upload data/shakespeare.txt to your Colab files, then write your code in notebook cells or upload .py files and run them with !python train.py.

Work through the docs in order. Each part walks you through writing a piece of the pipeline, explaining what each component does and why. By the end, you'll have a working model.py, train.py, and generate.py that you wrote yourself.

Part	What You'll Write	Concepts
Part 1: Tokenization	Character-level tokenizer	Character encoding, vocabulary size, why BPE fails on small data
Part 2: The Transformer	Full GPT model architecture	Embeddings, self-attention, layer norm, MLP blocks
Part 3: The Training Loop	Complete training pipeline	Loss functions, AdamW, gradient clipping, LR scheduling
Part 4: Text Generation	Inference and sampling	Temperature, top-k, autoregressive decoding
Part 5: Putting It All Together	Train on real data, experiment	Loss curves, scaling experiments, next steps
Part 6: Competition	Train the best AI poet	Find datasets, scale up, submit your best poem

Architecture: GPT at a Glance

Input Text
    │
    ▼
┌─────────────────┐
│   Tokenizer     │  "hello" → [20, 43, 50, 50, 53]  (character-level)
└────────┬────────┘
         ▼
┌─────────────────┐
│  Token Embed +  │  token IDs → vectors (n_embd dimensions)
│  Position Embed │  + positional information
└────────┬────────┘
         ▼
┌─────────────────┐
│  Transformer    │  × n_layer
│  Block:         │
│  ┌────────────┐ │
│  │ LayerNorm  │ │
│  │ Self-Attn  │ │  n_head parallel attention heads
│  │ + Residual │ │
│  ├────────────┤ │
│  │ LayerNorm  │ │
│  │ MLP (FFN)  │ │  expand 4x, GELU, project back
│  │ + Residual │ │
│  └────────────┘ │
└────────┬────────┘
         ▼
┌─────────────────┐
│   LayerNorm     │
│   Linear → logits│  vocab_size outputs (probability over next token)
└─────────────────┘

Model Configs for This Workshop

Config	Params	n_layer	n_head	n_embd	Train Time (M3 Pro)
Tiny	~0.5M	2	2	128	~5 min
Small	~4M	4	4	256	~20 min
Medium (default)	~10M	6	6	384	~45 min

All configs use character-level tokenization (vocab_size=65) and block_size=256.

Tokenization: Characters vs BPE

This workshop uses character-level tokenization on Shakespeare. BPE tokenization (GPT-2's 50k vocab) doesn't work on small datasets — most token bigrams are too rare for the model to learn patterns from.

Tokenizer	Vocab Size	Dataset Size Needed
Character-level	~65	Small (Shakespeare, ~1MB)
BPE (tiktoken)	50,257	Large (TinyStories+, 100MB+)

Part 5 covers switching to BPE for larger datasets.

Key References

nanoGPT — The project this workshop is based on. Minimal GPT training in ~300 lines of PyTorch
build-nanogpt video lecture — 4-hour video building GPT-2 from an empty file
Karpathy's microgpt — A full GPT in 200 lines of pure Python, no dependencies
nanochat — Full ChatGPT clone training pipeline
Attention Is All You Need (2017) — The original transformer paper
GPT-2 paper (2019) — Language models as unsupervised learners
TinyStories paper — Why small models trained on curated data punch above their weight

how does one get the lectures? I don't see the option for any lectures.

One goes to youtube and searches for cs336?

This looks great for a first introduction to training LLMs, and it looks simple enough to try this locally. Great job!

But that is just me. I think is more useful to understand the how and whys before training a LLM.

If someone is interested, I am giving short courses with walkthrough on how to train you LLM from scratch via AI Study Camp.

[0] https://github.com/rasbt/LLMs-from-scratch

[1] https://www.manning.com/books/build-a-large-language-model-f...

[2] https://magazine.sebastianraschka.com/p/coding-llms-from-the...

I really enjoyed the book. Great for people who want to understand the real nuts and bolts and have worked examples of all of the calculations.

This looks like exact copy of this video of andrej karpathy ( https://youtu.be/kCc8FmEb1nY ) but in a writing format, am i wrong ?

Nice. What scale does this realistically reach on a single machine?

Context: he is one of the MLX developers, a skilled ML researcher.

shameless plug:

A series of Jupyter notebooks explaining the whole machine learning mechanism, from the beginning

https://github.com/nickyreinert/DeepLearning-with-PyTorch-fr...

and of course also how to build an llm from scratch

https://github.com/nickyreinert/basic-llm-with-pytorch/blob/...

The documentation is really helpful enough to get started

I know it's a bit of a joke, but "I Built a Neural Network from Scratch in SCRATCH" gave me, a complete outsider, a lot of insight into how neural networks work.

https://www.youtube.com/watch?v=5COUxxTRcL0

That’s actually super interesting

You've actually been primarily training a physics model, with an LLM attached to it.

Model: 36L/36H/576D, 144.2M params

runs on a Blackwell 6000 Max-Q, using 86GB VRAM. Training supposedly takes 3h40m

Train your LM from scratch*

I doubt you have a machine big enough to make it "Large".

You can fully train a 1.6b model on a single 3090. That’s a reasonably big model.

If you have a credit card with a "normal" ceiling you probably can rent enough on neocloud providers like HuggingFace or Mistral Forge.

I'm not saying it's worth it but you don't need to buy a GPU yourself to be able to train.

Hey now! I've got a half terabyte of RAM at my disposal! I mean, it's DDR4 but... it's RAM!

And it's paired with 48 processor cores! I mean, they don't even support AVX512 but they can do math!

I could totally train a LLM! Or at least my family could... might need my kid to pick up and carry on the project.

But in all seriousness... you either missed the point, are being needlessly pedantic, or are... wrong?

This is about learning concepts, and the rest of this is mostly moot.

On the pedantic or wrong notes--What is the documented cut-off for a "large" language model? Because GPT-2 was and is described as a "large" language model. It had 1.5B parameters. You can just about get a consumer GPU capable of training that for about $400 these days.

I really enjoyed the book. Great for people who want to understand the real nuts and bolts and have worked examples of all of the calculations.

One goes to youtube and searches for cs336?

You can fully train a 1.6b model on a single 3090. That’s a reasonably big model.

you can train it, but not fully

Hey now! I've got a half terabyte of RAM at my disposal! I mean, it's DDR4 but... it's RAM!

And it's paired with 48 processor cores! I mean, they don't even support AVX512 but they can do math!

I could totally train a LLM! Or at least my family could... might need my kid to pick up and carry on the project.

But in all seriousness... you either missed the point, are being needlessly pedantic, or are... wrong?

This is about learning concepts, and the rest of this is mostly moot.

If you have a credit card with a "normal" ceiling you probably can rent enough on neocloud providers like HuggingFace or Mistral Forge.

I'm not saying it's worth it but you don't need to buy a GPU yourself to be able to train.

Yeah it's just a semantic pet peeve. Let me ask you this: What is a "Language Model", if this is a "Large Language Model"? Inversely, if a 1.5B model is "Large" then what is the recent 1T param models? "Superlarge"?

In my own very humble opinion, it becomes "Large" when it's out of non-specialized hardware. So currently, a model which requires more than 32GB vram is large (as that's roughly where the high-end gaming GPUs cut off).

And btw, there is no way you can train a language model on a CPU, even with ddr5, lest you wait a whole week for a single training cycle. Give it a go! I know I did, it's a magnitude away from being feasible.

Then rewrite the title and call it "learn how to do a non usable llm from scratch"

Model: 36L/36H/576D, 144.2M params

runs on a Blackwell 6000 Max-Q, using 86GB VRAM. Training supposedly takes 3h40m

you can train it, but not fully

Then rewrite the title and call it "learn how to do a non usable llm from scratch"

Opus 4.7 is non-usable for the tasks I have — but it’s considered an LLM.

And no one is stopping anyone from tweaking few parameters in this repo to go above 10M parameters.

Opus 4.7 is non-usable for the tasks I have — but it’s considered an LLM.

And no one is stopping anyone from tweaking few parameters in this repo to go above 10M parameters.

Hacker Times

Hacker Times

Train Your Own LLM from Scratch

Discussion

Discussion

Train Your Own LLM From Scratch

What You'll Build

Prerequisites

Getting Started

Local (recommended)

Google Colab

Architecture: GPT at a Glance

Model Configs for This Workshop

Tokenization: Characters vs BPE

Key References