CS336: Language Modeling from Scratch

I’m intrigued by this course. However I’m also curious about its prerequisite:

> Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N) You should be comfortable with the basics of machine learning and deep learning.

Anyone have a good implementation-heavy self-study resource for those topics, or experience with the recorded lectures for those Stanford courses?

I have fond memories of cs224d [1] taught by richardsocher. It’s a bit dated at this point as it was created in the pre-transformer era, but it was a very cool introduction to applying deep learning to nlp at the time.

[1] https://cs224d.stanford.edu

> GPU compute for self-study

Those suggestions they make for a B200 start at $4.99 an hour.

Is that really required, for starting out? I've been tinkering with my own from-scratch LLM, but in the early phases I don't need anything more than a 4090 on Vast.ai

I brought a group together to do this class using the YouTube videos and course materials available online. It is challenging but rewarding. We tackled it one lecture video per week. Started with over 30 learners and by last session we were down to 8.

I wonder if people prefer to learn this on their own or if building a community around open learning is something that others are interested in

Thanks for releasing this again! What are this year's changes to prior offerings?

Are video lectures available online?

i recently started reading "build reasoning model from scratch" then i realized that i am not really interested in building part and just want to understand theory and practice behind it.

A want like a casual lesswrong style from ground up explanation.

I’m intrigued by this course. However I’m also curious about its prerequisite:

> Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N) You should be comfortable with the basics of machine learning and deep learning.

Anyone have a good implementation-heavy self-study resource for those topics, or experience with the recorded lectures for those Stanford courses?

I found the 2024 Spring CS224N course sufficient for this pre-requisite, coupled with the textbook (chapters 1-13). Like CS336, this one also has videos and assignments available, and it being from 2024 is not a problem since the basics are mostly the same as recent years. Notably this is not true for 336, which spends much more time discussing cutting edge techniques, so the 2026 version there is essential.

Course: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246...

Lecture videos: https://www.youtube.com/playlist?list=PLoROMvodv4rOaMFbaqxPD...

Textbook: https://web.stanford.edu/~jurafsky/slp3/

> GPU compute for self-study

Those suggestions they make for a B200 start at $4.99 an hour.

Is that really required, for starting out? I've been tinkering with my own from-scratch LLM, but in the early phases I don't need anything more than a 4090 on Vast.ai

I imagine it's a lot like FPGAs:

- the hardware you need for a production use-case is relatively small, because production {models, bitstreams} have been heavily size-optimized, stripping out everything not needed to get a good result for the target use-cases

- but the hardware you need when tinkering/learning how to design {compute kernels, IP blocks} in the first place, must be quite a bit more powerful / higher-capacity, because your experiments will intentionally be the opposite of optimized: they'll be built for legibility / introspectability / debuggability at every level, which massively inflates and de-optimizes the resulting {model, bitstream}.

(And, to be clear here, "running someone else's finished model, which was designed and optimized to be used on something like a 4090, against your own prompt" is a kind of experimenting, which is cheap, in the same way that "deploying someone else's pre-baked FPGA bitstream, that was designed and synthesized for a $20 target FPGA, onto your own instance of that $20 FPGA, and then feeding your own input signals to it" is cheap. But that's not the kind of experimenting you'd be doing in this course while learning to design your own models!)

TA here. Definitely not! In fact we explicitly added sections in the first assignment to allow for scaling down to even local compute (M-series GPUs). For assignment 2 there are a few regions that require Triton support for your GPU, but everything can be adapted for much cheaper GPUs.

We were lucky enough to get Blackwell GPUs for Stanford students this year, which is why the writeups are written mostly around them.

You're right to be sceptical. I have trained reasonably good SLMs for the TinyStories dataset on my 4060Ti (16GB) with no problems. You'll only encounter problems if you want to try if your ideas scale up to models any bigger than "arguably tiny".

It seems strange that the required resources aren't provided by the educational institution?

You dont even need a GPU to train your own LLM.

I beliee these are affordable enough for the intended audience (which is Stanford undergrad/master)

Thanks for releasing this again! What are this year's changes to prior offerings?

TA here. Biggest changes are in the second assignment (distributed) where we added a bunch of memory, profiling and distributed tasks, as well as in the fifth assignment (alignment), where most of the RL tasks are fresh this year. Assignment 3 (scaling laws) was also completely updated, but in a way that might be difficult to run without substantial resources. I'm working on a way for external students to be able to run simulated experiments for free!

Assignment 1 (basics) has the most hours of preparation invested in it, and only minor modernization/bug fixes were necessary this year.

[1] https://cs224d.stanford.edu

Similar thoughts here. That was when I realized the potential of the Internet: I didn't have to be a grad student at a tier 1 research university to learn about the frontier.

I wonder if people prefer to learn this on their own or if building a community around open learning is something that others are interested in

I'd be interested in joining a discord server.

Would be great to have a community to discuss the material - even if folks can't commit to the full course.

Are video lectures available online?

i recently started reading "build reasoning model from scratch" then i realized that i am not really interested in building part and just want to understand theory and practice behind it.

A want like a casual lesswrong style from ground up explanation.

In that case I humbly suggest my talk from AI Engineer World's Fair https://www.youtube.com/watch?v=ZuiJjkbX0Og

Gives you the basics on LLM internals in about 90 minutes and includes an already built model in JavaScript that you can step through in browser devtools to get as detailed as you want.

Assignment 1 (basics) has the most hours of preparation invested in it, and only minor modernization/bug fixes were necessary this year.

Course: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246...

Lecture videos: https://www.youtube.com/playlist?list=PLoROMvodv4rOaMFbaqxPD...

Textbook: https://web.stanford.edu/~jurafsky/slp3/

You dont even need a GPU to train your own LLM.

Similar thoughts here. That was when I realized the potential of the Internet: I didn't have to be a grad student at a tier 1 research university to learn about the frontier.

We were lucky enough to get Blackwell GPUs for Stanford students this year, which is why the writeups are written mostly around them.

I'd be interested in joining a discord server.

Would be great to have a community to discuss the material - even if folks can't commit to the full course.

I imagine it's a lot like FPGAs:

In that case I humbly suggest my talk from AI Engineer World's Fair https://www.youtube.com/watch?v=ZuiJjkbX0Og

Gives you the basics on LLM internals in about 90 minutes and includes an already built model in JavaScript that you can step through in browser devtools to get as detailed as you want.

It seems strange that the required resources aren't provided by the educational institution?

We do provide resources for enrolled students. The online suggestions are for external students or Stanford students who we weren't able to admit.

Two schools of thought - people are paying 100K per year, we should provide everything. Second is - they are paying 100K per year, do you think they will care for couple of hundred more.

I beliee these are affordable enough for the intended audience (which is Stanford undergrad/master)

for them Modal is sponsoring the compute, as stated on the website, the prices are for remote followers

We do provide resources for enrolled students. The online suggestions are for external students or Stanford students who we weren't able to admit.

for them Modal is sponsoring the compute, as stated on the website, the prices are for remote followers

Two schools of thought - people are paying 100K per year, we should provide everything. Second is - they are paying 100K per year, do you think they will care for couple of hundred more.

Content

What is this course about?

Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleaning for pre-training, transformer model construction, model training, and evaluation before deployment.

Prerequisites

Proficiency in Python

The majority of class assignments will be in Python. Unlike most other AI classes, students will be given minimal scaffolding. The amount of code you will write will be at least an order of magnitude greater than for other classes. Therefore, being proficient in Python and software engineering is paramount.
Experience with deep learning and systems optimization

A significant part of the course will involve making neural language models run quickly and efficiently on GPUs across multiple machines. We expect students to be able to have a strong familiarity with PyTorch and know basic systems concepts like the memory hierarchy.
College Calculus, Linear Algebra (e.g. MATH 51, CME 100)

You should be comfortable understanding matrix/vector notation and operations.
Basic Probability and Statistics (e.g. CS 109 or equivalent)

You should know the basics of probabilities, Gaussian distributions, mean, standard deviation, etc.
Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N)

You should be comfortable with the basics of machine learning and deep learning.

Note that this is a 5-unit class. This is a very implementation-heavy class, so please allocate enough time for it.

Coursework

Assignments

Assignment 1: Basics
- Implement all of the components (tokenizer, model architecture, optimizer) necessary to train a standard Transformer language model.
- Train a minimal language model.
Assignment 2: Systems
- Profile and benchmark the model and layers from Assignment 1 using advanced tools, optimize Attention with your own Triton implementation of FlashAttention2.
- Build a memory-efficient, distributed version of the Assignment 1 model training code.
Assignment 3: Scaling
- Understand the function of each component of the Transformer.
- Query a training API to fit a scaling law to project model scaling.
Assignment 4: Data
- Convert raw Common Crawl dumps into usable pretraining data.
- Perform filtering and deduplication to improve model performance.
Assignment 5: Alignment and Reasoning RL
- Apply supervised finetuning and reinforcement learning to train LMs to reason when solving math problems.
- Optional Part 2: implement and apply safety alignment methods such as DPO.

All (currently tentative) deadlines are listed in the schedule.

GPU compute for self-study

If you are following along at home, you can access GPU compute from a cloud provider to complete the assignments.

Here are a few options (public pricing for a single B200 GPU on March 28, 2026):

Modal (sponsor): $6.25/hour. Offers $30 of free monthly compute. You are only charged for actual compute (no idle resources) and their UX makes switching between local dev and large-scale gpu experiments simple. (Modal Pricing)
Lambda Labs: $6.69/hour (Lambda Pricing)
RunPod: $4.99/hour (RunPod Pricing)
Nebius: $5.50/hour ($3.05/hour preemptible) (Nebius Pricing)
Together: $7.49/hour, minimum 8 GPUs, cheaper for longer commitments (Together Pricing)

For convenience and to save money, we recommend debugging correctness of your implementation on CPU first and then using GPU(s) (with the count recommended in the assignments) for completing training runs (A1, A4, A5) or benchmarking GPU operations (A2).

Honor code

Like all other classes at Stanford, we take the student Honor Code seriously. Please respect the following policies:

Collaboration: Study groups are allowed, but students must understand and complete their own assignments, and hand in one assignment per student. If you worked in a group, please put the names of the members of your study group at the top of your assignment. Please ask if you have any questions about the collaboration policy.
AI tools: Prompting LLMs such as ChatGPT is permitted for low-level programming questions or high-level conceptual questions about language models, but using it directly to solve the problem is prohibited. We strongly encourage you to disable AI autocomplete (e.g., Cursor Tab, GitHub CoPilot) in your IDE when completing assignments (though non-AI autocomplete, e.g., autocompleting function names is totally fine). We have found that AI autocomplete makes it much harder to engage deeply with the content. See the AI policy (inspired by this).
Existing code: Implementations for many of the things you will implement exist online. The handouts we'll give will be self-contained, so that you will not need to consult third-party code for producing your own implementation. Thus, you should not look at any existing code unless when otherwise specified in the handouts.

Submitting coursework

All coursework are submitted via Gradescope by the deadline. Do not submit your coursework via email.
If anything goes wrong, please ask a question in Slack or contact a course assistant.
You can submit as many times as you'd like until the deadline: we will only grade the last submission.
Partial work is better than not submitting any work.

Late days

Each student has 6 late days to use. A late day extends the deadline by 24 hours.
You can use up to 3 late days per assignment.

Regrade requests

If you believe that the course staff made an objective error in grading, you may submit a regrade request on Gradescope within 3 days after the grades are released.

Sponsor

We would like to thank Modal for sponsoring compute for this class.

Schedule (YouTube playlist)

#	Date	Description	Course Materials	Deadlines
1	Mon March 30	Overview, tokenization [Percy]	lecture_01.py	Assignment 1 out
[code]
[preview]
2	Wed April 1	PyTorch (einops), resource accounting (FLOPs, memory, arithmetic intensity) [Percy]	lecture_02.py (recording version)
3	Mon April 6	Architectures, hyperparameters [Tatsu]	lecture 3.pdf
4	Wed April 8	Attention alternatives and mixture of experts [Tatsu]	lecture 4.pdf
5	Mon April 13	GPUs, TPUs [Tatsu]	lecture 5.pdf
6	Wed April 15	Kernels, Triton [Percy]	lecture_06.py	Assignment 1 due
Assignment 2 out
[code]
[preview]
7	Mon April 20	Parallelism [Percy]	lecture_07.py
8	Wed April 22	Parallelism [Tatsu]	lecture_08.pdf
9	Mon April 27	Scaling laws [Tatsu]	lecture_09.pdf
10	Wed April 29	Inference [Percy]	lecture_10.py	Assignment 2 due
Assignment 3 out
[code]
[preview]
11	Mon May 4	Scaling laws [Tatsu]	lecture_11.pdf
12	Wed May 6	Evaluation [Percy]	lecture_12.py	Assignment 3 due
Assignment 4 out
[code]
[preview]
13	Mon May 11	Data (sources, datasets) [Percy]	lecture_13.py
14	Wed May 13	Data (filtering, deduplication, mixing, synthetic data) [Percy]	lecture_14.py
15	Mon May 18	Mid/post-training (SFT/RLHF) [Tatsu]	lecture_15.pdf
16	Wed May 20	Post-training - RLVR [Tatsu]	lecture_16.pdf	Assignment 4 due
Assignment 5 out
[code]
[preview]
[Optional Part 2]
	Mon May 25	No class (Memorial Day)
17	Wed May 27	Alignment - multimodality [Percy]	lecture_17.py
18	Mon June 1	Guest lecture: Daniel Selsam
19	Wed June 3	Guest lecture: Dan Fu		Assignment 5 due

Hacker Times