gk • Apr 9, 2026 • 6 min read
frontier reasoning models like opus 4.6, gpt 5.4, and gemini’s thinking series are now matching or beating humans on competition math and hard coding benchmarks. rl is what got them there, and grpo is the algorithm doing most of the heavy lifting.
rather than training a separate value model to estimate how good a response should be (which roughly doubles training compute), grpo samples a handful of responses to the same prompt and uses their average score as the baseline. it’s simpler, cheaper, and works well for reasoning tasks with verifiable rewards. here’s how it works.
supervised fine-tuning teaches a model to copy examples. rl teaches it to optimize for an outcome.
the difference matters for tasks where the right answer is easy to verify but hard to demonstrate. math problems, code, logic puzzles. you don’t need to show the model how to solve a problem - you just need to reward it when it gets the answer right.
the model explores different reasoning paths and gradually learns which strategies actually work. this is what separates today’s reasoning models - things like claude and gpt - from earlier generations. they’ve been trained to search for good solutions, not just predict plausible continuations.
the naive update rule (“if the reward was good, do more of that”) doesn’t quite work on its own. the problem is that rewards are relative. if every response scores a 0.9, the model has no signal about which ones to boost.
to make updates meaningful, you subtract a baseline from each reward:
advantage = reward − baseline
the baseline represents “what we’d normally expect.” the harder question is: where does that baseline come from?
grpo (group relative policy optimization) solves this by using the model’s own outputs as the reference point: the baseline comes from the group.
for each training prompt, instead of generating one response, you generate a whole group - G responses - and score each one. the group mean becomes the baseline. no extra model needed.
the math is straightforward:
for a prompt q, sample outputs o₁, o₂, ..., oG
score each: r₁, r₂, ..., rG
baseline = mean(r₁ … rG)
advantage = (rᵢ − baseline) / std(r₁ … rG)
outputs that score above the group average get a positive advantage and become more likely. outputs below average get a negative advantage and become less likely. the model is always learning relative to itself, not some externally defined standard.
the reward function is where the real work happens. for reasoning tasks, a common setup uses:
no trained reward model required - just string matching and rule checks. this is part of why grpo has been so popular for math and code. competition math problems have unambiguous answers; code either passes the tests or it doesn’t. models trained this way have reached state-of-the-art on benchmarks like MATH and AIME that were out of reach for sft alone.
compare this to rlhf: the traditional pipeline collects human preference data, trains a reward model on it, then optimizes the policy against that proxy. grpo-style training skips all of that when you have a ground truth to check against. the reward signal is the ground truth itself, not a learned approximation of it.
what you gain: no reward model to train, no reward model to overfit, no proxy misalignment to worry about. what you give up: it only works when correctness is cleanly verifiable. for tasks where quality is subjective or multi-dimensional, you’re back to needing a learned reward.
without guardrails, the model can drift far from its starting point chasing rewards. this produces degenerate outputs: reward hacking, repetition, bizarre formatting. we’ve seen this firsthand.
the fix is a kl divergence penalty that adds a cost for diverging from the reference model, a frozen copy of the policy kept from the start of training.
the base objective is simple: increase the log-probability of responses that scored above average, decrease the log-probability of responses that scored below, and weight each update by how large the advantage was. the kl term then acts as a regularizer, keeping the policy from drifting too far:
loss = −advantage × log_prob(response) + β × KL(policy ∥ reference)
β controls how tightly the model is leashed. too high and the model barely moves; too low and it starts hallucinating to maximize reward. tuning β is one of the key levers in rl training stability.
for verifiable tasks, grpo says you don’t need a learned critic or a learned reward model. sample a group of responses, score them against a rule-based check, and use the spread as your training signal. that’s the whole algorithm.
it’s worked well for math and code, where correctness is unambiguous and rewards are cheap to compute. the harder question is how far it extends to tasks where correctness is fuzzier: open-ended writing, multi-step reasoning, anything where a rule-based reward would miss important nuance. that’s where most of the current rl research is focused.
for now, grpo is the standard tool for training reasoning models, from open-source math specialists to the rl stages of frontier models like o3 and gemini.
we use it extensively at cgft - including in our unit test generation work and agentic rag training. if you’re working on a task with verifiable outputs and want to apply rl, grpo is usually where to start.
reach out if you want to get into specifics.
stay updated with new guides and product updates on rlft.