AI Agent Guidelines for CS336 at Stanford https://github.com/stanford-cs336/assignment1-basics/blob/ma... (https://news.ycombinator.com/item?id=48359232)
> Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N) You should be comfortable with the basics of machine learning and deep learning.
Anyone have a good implementation-heavy self-study resource for those topics, or experience with the recorded lectures for those Stanford courses?
Those suggestions they make for a B200 start at $4.99 an hour.
Is that really required, for starting out? I've been tinkering with my own from-scratch LLM, but in the early phases I don't need anything more than a 4090 on Vast.ai
A want like a casual lesswrong style from ground up explanation.
Assignment 1 (basics) has the most hours of preparation invested in it, and only minor modernization/bug fixes were necessary this year.
Course: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246...
Lecture videos: https://www.youtube.com/playlist?list=PLoROMvodv4rOaMFbaqxPD...
Textbook: https://web.stanford.edu/~jurafsky/slp3/
We were lucky enough to get Blackwell GPUs for Stanford students this year, which is why the writeups are written mostly around them.
Would be great to have a community to discuss the material - even if folks can't commit to the full course.
- the hardware you need for a production use-case is relatively small, because production {models, bitstreams} have been heavily size-optimized, stripping out everything not needed to get a good result for the target use-cases
- but the hardware you need when tinkering/learning how to design {compute kernels, IP blocks} in the first place, must be quite a bit more powerful / higher-capacity, because your experiments will intentionally be the opposite of optimized: they'll be built for legibility / introspectability / debuggability at every level, which massively inflates and de-optimizes the resulting {model, bitstream}.
(And, to be clear here, "running someone else's finished model, which was designed and optimized to be used on something like a 4090, against your own prompt" is a kind of experimenting, which is cheap, in the same way that "deploying someone else's pre-baked FPGA bitstream, that was designed and synthesized for a $20 target FPGA, onto your own instance of that $20 FPGA, and then feeding your own input signals to it" is cheap. But that's not the kind of experimenting you'd be doing in this course while learning to design your own models!)
Gives you the basics on LLM internals in about 90 minutes and includes an already built model in JavaScript that you can step through in browser devtools to get as detailed as you want.
Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleaning for pre-training, transformer model construction, model training, and evaluation before deployment.
Proficiency in Python
The majority of class assignments will be in Python. Unlike most other AI classes, students will be given minimal scaffolding. The amount of code you will write will be at least an order of magnitude greater than for other classes. Therefore, being proficient in Python and software engineering is paramount.
Experience with deep learning and systems optimization
A significant part of the course will involve making neural language models run quickly and efficiently on GPUs across multiple machines. We expect students to be able to have a strong familiarity with PyTorch and know basic systems concepts like the memory hierarchy.
College Calculus, Linear Algebra (e.g. MATH 51, CME 100)
You should be comfortable understanding matrix/vector notation and operations.
Basic Probability and Statistics (e.g. CS 109 or equivalent)
You should know the basics of probabilities, Gaussian distributions, mean, standard deviation, etc.
Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N)
You should be comfortable with the basics of machine learning and deep learning.
Note that this is a 5-unit class. This is a very implementation-heavy class, so please allocate enough time for it.
All (currently tentative) deadlines are listed in the schedule.
If you are following along at home, you can access GPU compute from a cloud provider to complete the assignments.
Here are a few options (public pricing for a single B200 GPU on March 28, 2026):
For convenience and to save money, we recommend debugging correctness of your implementation on CPU first and then using GPU(s) (with the count recommended in the assignments) for completing training runs (A1, A4, A5) or benchmarking GPU operations (A2).
Like all other classes at Stanford, we take the student Honor Code seriously. Please respect the following policies:
If you believe that the course staff made an objective error in grading, you may submit a regrade request on Gradescope within 3 days after the grades are released.
We would like to thank Modal for sponsoring compute for this class.
| # | Date | Description | Course Materials | Deadlines |
|---|---|---|---|---|
| 1 | Mon March 30 | Overview, tokenization [Percy] | lecture_01.py | Assignment 1 out |
| [code] | ||||
| [preview] | ||||
| 2 | Wed April 1 | PyTorch (einops), resource accounting (FLOPs, memory, arithmetic intensity) [Percy] | lecture_02.py (recording version) | |
| 3 | Mon April 6 | Architectures, hyperparameters [Tatsu] | lecture 3.pdf | |
| 4 | Wed April 8 | Attention alternatives and mixture of experts [Tatsu] | lecture 4.pdf | |
| 5 | Mon April 13 | GPUs, TPUs [Tatsu] | lecture 5.pdf | |
| 6 | Wed April 15 | Kernels, Triton [Percy] | lecture_06.py | Assignment 1 due |
| Assignment 2 out | ||||
| [code] | ||||
| [preview] | ||||
| 7 | Mon April 20 | Parallelism [Percy] | lecture_07.py | |
| 8 | Wed April 22 | Parallelism [Tatsu] | lecture_08.pdf | |
| 9 | Mon April 27 | Scaling laws [Tatsu] | lecture_09.pdf | |
| 10 | Wed April 29 | Inference [Percy] | lecture_10.py | Assignment 2 due |
| Assignment 3 out | ||||
| [code] | ||||
| [preview] | ||||
| 11 | Mon May 4 | Scaling laws [Tatsu] | lecture_11.pdf | |
| 12 | Wed May 6 | Evaluation [Percy] | lecture_12.py | Assignment 3 due |
| Assignment 4 out | ||||
| [code] | ||||
| [preview] | ||||
| 13 | Mon May 11 | Data (sources, datasets) [Percy] | lecture_13.py | |
| 14 | Wed May 13 | Data (filtering, deduplication, mixing, synthetic data) [Percy] | lecture_14.py | |
| 15 | Mon May 18 | Mid/post-training (SFT/RLHF) [Tatsu] | lecture_15.pdf | |
| 16 | Wed May 20 | Post-training - RLVR [Tatsu] | lecture_16.pdf | Assignment 4 due |
| Assignment 5 out | ||||
| [code] | ||||
| [preview] | ||||
| [Optional Part 2] | ||||
| Mon May 25 | No class (Memorial Day) | |||
| 17 | Wed May 27 | Alignment - multimodality [Percy] | lecture_17.py | |
| 18 | Mon June 1 | Guest lecture: Daniel Selsam | ||
| 19 | Wed June 3 | Guest lecture: Dan Fu | Assignment 5 due |