Understanding the Go Runtime: The Scheduler

I enjoyed both these GopherCon talks:

GopherCon 2018: The Scheduler Saga - Kavya Joshi https://www.youtube.com/watch?v=YHRO5WQGh0k

GopherCon 2017: Understanding Channels - Kavya Joshi https://www.youtube.com/watch?v=KBZlN0izeiY

My biggest issue with go is it’s incredibly unfair scheduler. No matter what load you have, P99 and especially P99.9 latency will be higher than any other language. The way that it steals work guarantees that requests “in the middle” will be served last.

It’s a problem that only go can solve, but that means giving up some of your speed that are currently handled immediately that shouldn’t be. So overall latency will go up and P99 will drop precipitously. Thus, they’ll probably never fix it.

If you have a system that requires predictable latency, go is not the right language for it.

> a goroutine’s state is surprisingly small. The mcall() assembly function only saves 3 values — the stack pointer, the program counter, and the base pointer — into a tiny gobuf struct. That’s it. Why so few? Because goroutine switches happen at function call boundaries, and at those points the compiler has already spilled any important registers to the stack following normal calling conventions.

Wouldn’t that mean go never uses registers to pass arguments to functions?

If so, that seems in conflict with https://go.dev/src/cmd/compile/abi-internal#function-call-ar..., which says “Because access to registers is generally faster than access to the stack, arguments and results are preferentially passed in registers”

Or does the compiler always Go’s stable ABI, known as ABI0 in functions where it inserts code to potentially context switch, and only uses the (potentially) faster ABI that passes arguments in registers elsewhere?

The unfair scheduling point resonates. I run a lot of concurrent HTTP workloads in Go (scraping, data pipelines) and the scheduler is honestly fine for throughput-oriented work where you don't care about tail latency. But the moment you need consistent response times under load it becomes a real problem. GOMAXPROCS tuning and runtime.LockOSThread help in narrow cases but they're band-aids. The lack of priority or fairness knobs is a deliberate design choice but it does push certain workloads toward other runtimes.

Isn't a dedicated worker pool with priority queues enough to get predictable P99 without leaving Go?

If you fix N workers and control dispatch order yourself, the scheduler barely gets involved — no stealing, no surprises.

The inter-goroutine handoff is ~50-100ns anyway.

Isn't the real issue using `go f()` per request rather than something in the language itself?

This is an excellent idea as a blog. Kudos!

Go missed a big opportunity to be Rust when we needed Rust more than anything. I have long since moved on from Go and C#/.NET is widely available nowadays and in many respects less held back by some strange political choices when it comes to DevEx (I am of course talking about generics).

I enjoyed both these GopherCon talks:

GopherCon 2018: The Scheduler Saga - Kavya Joshi https://www.youtube.com/watch?v=YHRO5WQGh0k

GopherCon 2017: Understanding Channels - Kavya Joshi https://www.youtube.com/watch?v=KBZlN0izeiY

https://m.youtube.com/watch?v=-K11rY57K7k - Dmitry Vyukov — Go scheduler: Implementing language with lightweight concurrency

This one notably also explains the design considerations for golangs M:N:P in comparison to other schemes and which specific challenges it tries to address.

Good videos, thanks for sharing!

Wouldn’t that mean go never uses registers to pass arguments to functions?

The compiler generates code to spill arguments to the stack at synchronous preemption points (function entry). Signal-based preemption has a spill path that saves the full ABI register set.

Isn't a dedicated worker pool with priority queues enough to get predictable P99 without leaving Go?

If you fix N workers and control dispatch order yourself, the scheduler barely gets involved — no stealing, no surprises.

The inter-goroutine handoff is ~50-100ns anyway.

Isn't the real issue using `go f()` per request rather than something in the language itself?

This is an excellent idea as a blog. Kudos!

No. Eventually the queues get full and go routines pause waiting to place the element onto the queue, landing you right back at unfair scheduling.

https://github.com/php/frankenphp/pull/2016 if you want to see a “correctly behaving” implementation that becomes 100% cpu usage under contention.

My usecase was building an append-only blob store with mandatory encryption, but using a semaphore + direct goroutine calls to limit background write concurrency instead of a channel + dedicated writer goroutines was a net win across a wide variety of write sizes and max concurrent inflight writes. It is interesting that frankenphp + caddy came up with almost the same conclusion despite vastly different work being done.

Rust is the older project of the two, kicking off in 2006. Go, which set sail in 2007, duplicating the work of Rust would have been pointless. We already had Rust.

Go's objective was to become a faster Python. Which was something we also desperately needed at the time, and it has well succeeded on that front. Go has largely replaced all the non-data science things people were earlier doing with Python.

If you have a system that requires predictable latency, go is not the right language for it.

> Thus, they’ll probably never fix it.

I'm sorry you had a bad experience with Go. What makes you say this? Have you filed an issue upstream yet? If not, I encourage you to do so. I can't promise it'll be fixed or delved into immediately, but filing detailed feedback like this is really helpful for prioritizing work.

> If you have a system that requires predictable latency, go is not the right language for it.

Having a garbage collector already make this the case, it is a known trade off.

“It’s a problem that only go can solve”

I had this discussion a decade ago and concluded that a reasonable fair scheduler could be built on top of the go runtime scheduler by gating the work presented. The case was be made that the application is the proper, if not only, place to do this. Other than performance, if you encountered a runtime limitation then filing an issue is how the Go community moves forward.

It misses having a custom scheduler option, like Java and .NET runtimes offer, unfortunely that is too many knobs for the usual Go approach to language design.

Having a interface for how it is supposed to behave, a runtime.SetScheduler() or something, but it won't happen.

> If you have a system that requires predictable latency, go is not the right language for it.

I presume that's by design, to trade off against other things google designed it for?

> No matter what load you have, P99 and especially P99.9 latency will be higher than any other language

I strongly call BS on that.

Strong claim and evidence seems to be a hallucination in your own head.

There are several writeups of large backends ported from node/python/ruby to Go which resulted in dramatic speedups, including drop in P99 and P99.9 latencies by 10x

That's empirical evidence your claim is BS.

What exactly is so unfair about Go scheduler and what do you compare it to?

Node's lack of multi-threading?

Python's and Ruby's GIL?

Just leaving this to OS thread scheduler which, unlike Go, has no idea about i/o and therefore cannot optimize for it?

Apparently the source of your claim is https://github.com/php/frankenphp/pull/2016

Which is optimizing for a very specific micro-benchmark of hammering std-lib http server with concurrent request. Which is not what 99% of go servers need to handle. And is exercising way more than a scheduler. And is not benchmarking against any other language, so the sweeping statement about "higher than any other language" is literally baseless.

And you were able to make a change that trades throughput for P99 latency without changing the scheduler, which kind of shows it wasn't the scheduler but an interaction between a specific implementation of HTTP server and Go scheduler.

And there are other HTTP servers in Go that focus on speed. It's just 99.9% of Go servers don't need any of that because the baseline is 10x faster than python/ruby/javascript and on-par with Java or C#.

> If you have a system, go is not the right language for it.

FTFY

https://m.youtube.com/watch?v=-K11rY57K7k - Dmitry Vyukov — Go scheduler: Implementing language with lightweight concurrency

This one notably also explains the design considerations for golangs M:N:P in comparison to other schemes and which specific challenges it tries to address.

And by Dmitry himself.

If the server cannot keep up with the given workload because of some bottleneck (CPU, network, disk IO), then it cannot guarantee any response times - incoming queries will be either rejected or queued in a long wait queue, which will lead to awfully big response times. This doesn't depend on the programming language or the framework the server written in.

If you want response time guarantees, make sure the server has enough free resources for processing the given workload.

The compiler generates code to spill arguments to the stack at synchronous preemption points (function entry). Signal-based preemption has a spill path that saves the full ABI register set.

Good videos, thanks for sharing!

> Thus, they’ll probably never fix it.

“It’s a problem that only go can solve”

> If you have a system that requires predictable latency, go is not the right language for it.

Having a garbage collector already make this the case, it is a known trade off.

This may have been practically true for a long time, but as Java's ZGC garbage collector proves, this is not a hard truth.

You can have world pauses that are independent of heap size, and thus predictable latency (of course, trading off some throughput, but that is almost fundamental)

Not really, it is a matter of having the right implementation.

- https://www.ptc.com/en/products/developer-tools/perc

- https://www.aicas.com/products-services/jamaicavm

- https://www.azul.com/products/prime

Not all GCs are born alike.

Nim's GC is deterministic when you need it.

It misses having a custom scheduler option, like Java and .NET runtimes offer, unfortunely that is too many knobs for the usual Go approach to language design.

Having a interface for how it is supposed to behave, a runtime.SetScheduler() or something, but it won't happen.

I find it hard to believe the people who built Go, coming from designing Plan 9 and Inferno, would build a language where it is difficult to swap out a component.

I have this feeling that in their quest to make Go simple, they added complexity in other areas. Then again, this was built at Google, not Bell Labs so the culture of building absurdly complex things likely influenced this.

Rust is the older project of the two, kicking off in 2006. Go, which set sail in 2007, duplicating the work of Rust would have been pointless. We already had Rust.

If you saw the early presentations, they complained about the slow compile times and high complexity of C++. It seems that they were targeting that, not Python.

No. Eventually the queues get full and go routines pause waiting to place the element onto the queue, landing you right back at unfair scheduling.

https://github.com/php/frankenphp/pull/2016 if you want to see a “correctly behaving” implementation that becomes 100% cpu usage under contention.

fair point on blocking sends — but that's an implementation detail, not a structural one.

From my pov, the worker pool's job isn't to absorb saturation. it's to make capacity explicit so the layer above can route around it. a bounded queue that returns ErrQueueFull immediately is a signal, not a failure — it tells the load balancer to try another instance.

saturation on a single instance isn't a scheduler problem, it's a provisioning signal. the fix is horizontal, not vertical. once you're running N instances behind something that understands queue depth, the "unfair scheduler under contention" scenario stops being reachable in production — by design, not by luck.

the FrankenPHP case looks like a single-instance stress test pushed to the limit, which is a valid benchmark but not how you'd architect for HA.

If you want response time guarantees, make sure the server has enough free resources for processing the given workload.

> If you have a system, go is not the right language for it.

FTFY

> If you have a system that requires predictable latency, go is not the right language for it.

I presume that's by design, to trade off against other things google designed it for?

No clue. All I know is that people complain about it every time they benchmark.

> No matter what load you have, P99 and especially P99.9 latency will be higher than any other language

I strongly call BS on that.

Strong claim and evidence seems to be a hallucination in your own head.

There are several writeups of large backends ported from node/python/ruby to Go which resulted in dramatic speedups, including drop in P99 and P99.9 latencies by 10x

That's empirical evidence your claim is BS.

What exactly is so unfair about Go scheduler and what do you compare it to?

Node's lack of multi-threading?

Python's and Ruby's GIL?

Just leaving this to OS thread scheduler which, unlike Go, has no idea about i/o and therefore cannot optimize for it?

Apparently the source of your claim is https://github.com/php/frankenphp/pull/2016

"There are several writeups of large backends ported from node/python/ruby to Go which resulted in dramatic speedups, including drop in P99 and P99.9 latencies by 10x"

But that's not comparing apples to apples. When you get a dramatic speedup, you will also see big drops in the P99 and P99.9 latencies because what stressed out the scripting language is a yawn to a compiled language. Just going from stressed->yawning will do wonders for all your latencies, tail latencies included.

That doesn't say anything about what will happen when the load increases enough to start stressing the compiled language.

Do I need to share the TLA+ spec that shows its unfair? Or do you have any actual proof to your claims?

this makes sense for your workload, but may the right primitive be a function of your payload profile and business constraints ?

in my case the problem doesn't arise because control plane and data plane are separated by design — metadata and signals never share a concurrency primitive with chunk writes. the data plane only sees chunks of similar order of magnitude, so a fixed worker pool doesn't overprovision on small payloads or stall on large ones.

curious whether your control and data plane are mixed on the same path, or whether the variance is purely in the blob sizes themselves.

if it's the latter: I wonder if batching sub-1MB payloads upstream would have given you the same result without changing the concurrency primitive. did you have constraints that made that impractical?

This may have been practically true for a long time, but as Java's ZGC garbage collector proves, this is not a hard truth.

You can have world pauses that are independent of heap size, and thus predictable latency (of course, trading off some throughput, but that is almost fundamental)

Not really, it is a matter of having the right implementation.

- https://www.ptc.com/en/products/developer-tools/perc

- https://www.aicas.com/products-services/jamaicavm

- https://www.azul.com/products/prime

Not all GCs are born alike.

And by Dmitry himself.

Nim's GC is deterministic when you need it.

I find it hard to believe the people who built Go, coming from designing Plan 9 and Inferno, would build a language where it is difficult to swap out a component.

The same people refused to support generics for several years, and the current design still has some issues to iron out.

Go also lacks some of Limbo features, e.g. plugin package is kind of abandoned. Thus even though dynamic loading is supported, it is hardly usable.

If you saw the early presentations, they complained about the slow compile times and high complexity of C++. It seems that they were targeting that, not Python.

I did see the early presentations. And since you did too, you will recall that one of the primary priorities was for it to "feel like a dynamically-typed language". You know, because it was trying to be a faster Python.

What you might be confusing that with is that their assumption was that Google services were written in C++ because those services needed C++ performance, not because the developers wanted to write code in C++, and that those C++ developers would jump at the chance to use a Python-like language that still satisfies them performance-wise. It turns out they were wrong — the developers actually did want to write C++ — but you can understand the thinking when Google was already using Python heavily in less performance-critical areas. Guido van Rossum himself was even on the payroll at the time.

For what it is worth, Google did create "Rust" after learning that a faster Python doesn't satisfy C++ developers. It's called Carbon. But it is telling that the earlier commenter has never heard of it, and it is unlikely it will ever leave the heap of esoteric languages because duplicating Rust was, and continues to be, pointless. We already had Rust.

fair point on blocking sends — but that's an implementation detail, not a structural one.

the FrankenPHP case looks like a single-instance stress test pushed to the limit, which is a valid benchmark but not how you'd architect for HA.

No clue. All I know is that people complain about it every time they benchmark.

"There are several writeups of large backends ported from node/python/ruby to Go which resulted in dramatic speedups, including drop in P99 and P99.9 latencies by 10x"

That doesn't say anything about what will happen when the load increases enough to start stressing the compiled language.

Do I need to share the TLA+ spec that shows its unfair? Or do you have any actual proof to your claims?

It would be helpful for you to share a link to the Github issue you created. If the TLA+ spec you no doubt put a lot of time into creating is contained there, that would be additionally amazing, but more relevant will be the responses from the maintainers so that we're not stuck with one side of the story.

Of course, expecting you to provide the link would be incredibly onerous. We can look it up ourselves just as easy as you can. Well, in theory we can. The only trouble is that I cannot find the issue you are talking about. I cannot find any issues in the Go issue tracker from your account.

So, in the interest of good faith, perhaps you can help us out this one time and point us in the right direction?

this makes sense for your workload, but may the right primitive be a function of your payload profile and business constraints ?

curious whether your control and data plane are mixed on the same path, or whether the variance is purely in the blob sizes themselves.

In my case, "background writes" literally means "do the io.WriteAt for this fixed-size buffer in another goroutine so that the one servicing the blob write can get on with encryption / CRC calculation / stuffing the resulting byte stream into fixed-size buffers". Handling it that way lets me keep the IO to the kernel as saturated as possible without the added schedule + mutex overhead sending stuff thru a channel incurs, while still keeping a hard upper bound on IO in flight (max semaphore weight) and write buffer allocations (sync.Pool). My fixed-size buffers are 32k, and it is a net win even there.

You can run it for fixed timeslices.

The same people refused to support generics for several years, and the current design still has some issues to iron out.

Go also lacks some of Limbo features, e.g. plugin package is kind of abandoned. Thus even though dynamic loading is supported, it is hardly usable.

So, in the interest of good faith, perhaps you can help us out this one time and point us in the right direction?

I’m not interested in contributing to go. I tried once, was basically ignored. I have contributed to issues there where it has impacted projects I’ve worked on. But even then, it didn’t feel collaborative; mostly felt like dealing with a tech support team instead of other developers.

That being said, I love studying go and learning how to use it to the best of my ability because I work on sub-ųs networking in go.

When I get home, I’ll dig it up. But if you think it’s a fair scheduler, I invite you to just think about it on a whiteboard for a few minutes. It’s nowhere near fair and should be self-evident from first principles alone.

right — no variance, question was off target. worth noting though: the sema-bounded WriteAt goroutines are structurally a fan-out over homogeneous units, even if the pipeline feels linear from the blob's perspective. that's probably why the channel adds nothing — no fan-in, no aggregation, just bounded fire-and-forget.

You can run it for fixed timeslices.

That being said, I love studying go and learning how to use it to the best of my ability because I work on sub-ųs networking in go.

Here’s a much better write up than I’m willing to do: https://www.cockroachlabs.com/blog/rubbing-control-theory/

There are also multiple issues about this on GitHub.

And an open issue that is basically been ignored. golang/go#51071

Like I said. Go won’t fix this because they’ve optimized for throughput at the expense of everything else, which means higher tail latencies. They’d have to give up throughput for lower latency.

Here’s a much better write up than I’m willing to do: https://www.cockroachlabs.com/blog/rubbing-control-theory/

There are also multiple issues about this on GitHub.

And an open issue that is basically been ignored. golang/go#51071

> And an open issue that is basically been ignored. golang/go#51071

It doesn't look ignored to me. It explains that the test coverage is currently poor, so they are in a terrible position of not being able to make changes until that is rectified.

The first step is to improve the test coverage. Are you volunteering? AI isn't at a point where it is going to magically do it on its own, so it is going to take a willing human hand. You do certainly appear to be the perfect candidate, both having the technical understanding and the need for it.

> And an open issue that is basically been ignored. golang/go#51071

It doesn't look ignored to me. It explains that the test coverage is currently poor, so they are in a terrible position of not being able to make changes until that is rectified.

Heh. I've had my fair share of mailing list drama. This is political AND technical. Someone saying "let’s cut throughput" is going to get shot down fast, no matter the technical merit. If someone with the political clout were to be willing to champion the work and guide the discussion appropriately while someone like me does the work, that's different. That's at least how things like this are done in other communities, unless go is different.

> If someone with the political clout were to be willing to champion the work and guide the discussion appropriately while someone like me does the work, that's different.

There is unlikely anyone on the Go team with more political clout in this particular area than the one who has already reached out to you. You obviously didn't respond to him publicly, but did he reject your offer in private? Or are you just imaging some kind of hypothetical scenario where they are refusing to talk to you, despite evidence to the contrary?

> If someone with the political clout were to be willing to champion the work and guide the discussion appropriately while someone like me does the work, that's different.

> You obviously didn't respond to him publicly, but did he reject your offer in private?

I literally have no idea what you're talking about here.

> You obviously didn't respond to him publicly, but did he reject your offer in private?

I literally have no idea what you're talking about here.

You must not have read all the comments yet? One of Go's key runtime maintainers sent you a message. Now is your opportunity to give him your plan so that he can give you the political support you seek.

In the previous article we explored how Go’s memory allocator manages heap memory — grabbing large arenas from the OS, dividing them into spans and size classes, and using a three-level hierarchy (mcache, mcentral, mheap) to make most allocations lock-free. A key detail was that each P (processor) gets its own memory cache. But we never really explained what a P is, or how the runtime decides which goroutine runs on which thread. That’s the scheduler’s job, and that’s what we’re exploring today.

The scheduler is the piece of the runtime that answers a deceptively simple question: which goroutine runs next? You might have hundreds, thousands, or even millions of goroutines in your program, but you only have a handful of CPU cores. The scheduler’s job is to multiplex all those goroutines onto a small number of OS threads, keeping every core busy while making sure no goroutine gets starved.

If you’ve ever used goroutines and channels, you’ve already benefited from the scheduler without knowing it. Every go statement, every channel send and receive, every time.Sleep—they all interact with the scheduler. Let’s see how it works.

Let’s start with the fundamental building blocks — the three structures that the entire scheduler is built around.

The GMP Model

The scheduler is built around three concepts, commonly called the GMP model: G (goroutine), M (machine/OS thread), and P (processor). We touched on these during the bootstrap article, but now let’s look at them properly.

Let’s look at each one.

G — Goroutine

A G is a goroutine — the Go runtime’s representation of a piece of concurrent work. Every time you write go f(), the runtime creates (or reuses) a G to track that function’s execution.

What does a G actually carry? The struct has a lot of fields, but the ones I think are most useful for understanding how it works are: a small stack (starting at just 2KB), some saved registers (stack pointer, program counter, etc.) so the scheduler can pause it and resume it later, a status field that tracks what the goroutine is doing (running, waiting, ready to run), and a pointer to the M currently running it. The full struct in src/runtime/runtime2.go has a lot more — fields for panic and defer handling, GC assist tracking, profiling labels, timers, and more.

Compare that to an OS thread, which typically starts with a 1–8MB stack and carries a lot of kernel state. A goroutine is dramatically lighter — that’s why you can have millions of them in a single program. An OS thread? You’ll start feeling the pressure at a few thousand.

So goroutines are the work. But someone has to actually execute that work — the CPU doesn’t know what a goroutine is. It only knows how to run threads.

M — Machine (OS Thread)

An M (defined in src/runtime/runtime2.go ) is an OS thread — the thing that actually executes code. The scheduler’s job is to put goroutines onto Ms so they can run.

Every M has two goroutine pointers that are worth knowing about. The first is curg — the user goroutine currently running on this thread. That’s your code. The second is g0 — and every M has its own. g0 is a special goroutine that’s reserved for the runtime’s own housekeeping — scheduling decisions, stack management, garbage collection bookkeeping. It has a much larger stack than regular goroutines: typically 16KB, though it can be 32KB or 48KB depending on the OS and whether the race detector is enabled. Unlike regular goroutines, the g0 stack doesn’t grow — it’s fixed at allocation time, so it has to be big enough upfront to handle whatever the runtime needs to do. When the scheduler needs to make a decision (which goroutine to run next, how to handle a blocking operation), it switches from your goroutine to this M’s g0 to do that work. Think of g0 as the M’s “manager mode” — it runs the scheduling logic, then hands control back to a user goroutine.

An M also has a pointer to the P it’s currently attached to. This is important: without a P, an M can’t run Go code. It’s just an idle OS thread sitting there doing nothing. Why does an M need a P at all?

P — Processor

This is the clever part of the design. A P (defined in src/runtime/runtime2.go ) is not a CPU core and it’s not a thread — it’s a scheduling context. Think of it as a workstation: it has everything a goroutine needs to run efficiently, and an M has to sit down at one before it can do any real work.

Why not just let Ms run goroutines directly? The problem is system calls. When an M enters the kernel, the entire OS thread blocks — and if all the scheduling resources were attached to the M, they’d be stuck too. The run queue, the memory cache, everything would be frozen until the syscall returns. By putting all of that on a separate P, the runtime can detach the P from a blocked M and hand it to a free one. The work keeps moving even when a thread is stuck.

So each P carries its own local run queue — a list of up to 256 goroutines that are ready to run. It also has a runnext slot, which is like a fast-pass for the very next goroutine to execute. There’s a gFree list where finished goroutines are kept around so they can be recycled instead of allocated from scratch. It even carries its own mcache — the per-P memory cache we saw in the memory allocator article. And because each P has its own copy of all this stuff, the threads using it don’t need to fight over shared locks all the time — that’s a nice bonus.

The number of Ps is controlled by GOMAXPROCS, which defaults to the number of CPU cores. So on an 8-core machine, you have 8 Ps, meaning at most 8 goroutines can truly run in parallel at any moment. But you can have far more Ms than Ps — some might be blocked in system calls while others are actively running goroutines. The key is that only GOMAXPROCS of them can be running Go code at any given time.

This decoupling is the heart of the scheduler’s design, and we’ll see why it matters so much as we go through the rest of the article.

So we have Gs, Ms, and Ps — but somebody needs to keep track of all of them. That’s the schedt struct.

The Scheduler State (schedt)

The schedt struct (defined in src/runtime/runtime2.go ) is the global scheduler state. There’s exactly one instance of it — a global variable called sched — and it holds everything that doesn’t belong to any specific P or M. Think of it as the shared bulletin board that the Ps and Ms check when they need to coordinate.

What lives there? First, the global run queue (runq) — a linked list of goroutines that aren’t in any P’s local queue. These are goroutines that overflowed from a full local queue, or that came back from a system call and couldn’t find a P. There’s also a global free list (gFree) of dead goroutines waiting to be recycled — when a P’s local free list runs out, it refills from here, and when a P has too many dead goroutines, it dumps some back. The same two-level pattern we saw in the memory allocator: local caches for the fast path, shared pool as backup.

Then there are the idle lists. When a P has no M running it, it goes on the pidle list. When an M has no work and no P, it goes on the midle list and sleeps. The scheduler also tracks how many Ms are currently spinning (looking for work) in nmspinning — we’ll explain what spinning means later in the article — and whether the GC is requesting a stop-the-world pause in gcwaiting. All of this shared state is protected by sched.lock — but the lock is designed to be held very briefly, because the hot path (picking a goroutine from a local queue) doesn’t touch schedt at all.

Beyond schedt, the runtime keeps master lists of every G, M, and P that has ever been created — the global variables allgs, allm, and allp. These aren’t used for scheduling decisions. They exist so the runtime can find everything when it needs to do something global, like scanning all goroutine stacks during garbage collection or checking for stuck system calls in sysmon.

Here’s the full picture:

Go Scheduler Diagram

Now that we’ve set the stage, it’s time to see the actors in action. Let’s follow a goroutine through its lifetime and see how it moves across this battlefield.

The Life of a Goroutine

Let’s follow the life of a goroutine from birth to death — and sometimes back again. The states are defined in src/runtime/runtime2.go , but rather than listing them, let’s walk through the story.

Birth: Creation and First Steps

It starts when you write go f(). The compiler turns this into a call to newproc() (in src/runtime/proc.go ), and the runtime needs a G struct to represent this new goroutine. But it doesn’t necessarily allocate one from scratch — first, it checks the current P’s local free list of dead goroutines. If there’s one available, it gets recycled, stack and all. If the local list is empty, it tries to grab a batch from the global free list in schedt. Only if both are empty does the runtime allocate a new G with a fresh 2KB stack. This reuse is why goroutine creation is so cheap — most of the time, it’s just pulling a G off a list and reinitializing a few fields.

If the G was recycled from the free list, it’s already in _Gdead state — that’s where goroutines go when they finish. If it was freshly allocated, it starts in _Gidle (a blank struct, never used before) and immediately transitions to _Gdead. Either way, the G is in _Gdead before setup begins. Wait — dead already? Yes, but only technically. _Gdead means “not in use by the scheduler” — it’s the state for goroutines that are either being set up or finished and waiting for reuse. The runtime uses it as a safe “parked” state while it configures the G’s internals.

During initialization, the runtime prepares the goroutine so it’s ready to run. It sets the stack pointer to the top of its stack, points the program counter at your function so it knows where to start executing, and places a return address pointing to goexit — the goroutine cleanup handler. This way, when your function finishes and returns, execution naturally lands in goexit without needing any special “is it done?” check.

Once setup is complete, the G moves to _Grunnable and goes into the current P’s runnext slot, displacing whatever was there before. This means the new goroutine will run very soon — right after the current goroutine yields.

Now the goroutine is alive — sitting on a run queue, ready to execute, just waiting for an M to pick it up.

Running

When the scheduler picks this G off the queue, it transitions to _Grunning. This is the active state — the goroutine is executing your code on an M, with a P. This is where it spends its productive time.

But goroutines rarely run straight through to completion. At some point, something will interrupt the flow, and what happens next depends on why the goroutine stopped. This is where the story branches.

Blocking and Unblocking

Maybe the goroutine tries to receive from an empty channel, or acquire a locked mutex, or sleep. Here’s a detail that might surprise you: there’s no external “scheduler thread” that swoops in and parks the goroutine. The goroutine parks itself.

Let’s say your goroutine does <-ch on an empty channel. The channel implementation sees there’s nothing to receive, so it calls gopark() to park the goroutine until a value arrives. The goroutine switches to the g0 stack, changes its own status to _Gwaiting, and adds itself to the channel’s wait queue. After that, it’s gone from the scheduler’s perspective — not on any run queue, just sitting on the channel’s internal wait list. The M doesn’t go to sleep though. It calls schedule() and picks up the next goroutine. From the M’s point of view, one goroutine parked and another one started running — the M stayed busy the whole time.

gopark() also records why the goroutine is blocking — channel receive, mutex lock, sleep, select, and so on. This is what shows up when you look at goroutine dumps or profiling data, so you can tell exactly what each goroutine is waiting for.

Now for the other side: what happens when the thing the goroutine was waiting for finally happens? Say another goroutine sends a value on that channel. The sender finds our goroutine on the channel’s wait queue, copies the value directly to it, and calls goready(). This changes the goroutine’s status back to _Grunnable and places it in the sender’s runnext slot — meaning it’ll run very soon, right after the sender yields. This runnext placement creates a tight back-and-forth between producer and consumer goroutines. G1 sends, G2 receives and runs immediately, G2 sends back, G1 receives and runs immediately — almost like coroutines handing off to each other, with minimal scheduling overhead.

System Calls

Blocking on channels and mutexes is one thing — the goroutine parks, but the M and P stay free. System calls are a different beast, because they block the entire OS thread.

When a goroutine makes a system call — reading a file, accepting a network connection, anything that enters the kernel — the entire OS thread blocks. Before entering the kernel, the goroutine calls entersyscall(), which saves its context and changes its status to _Gsyscall. But here’s an important detail: the M doesn’t give up its P. It keeps it. Why? Because most system calls are fast — a few microseconds — and the goroutine will come back and keep running on the same P as if nothing happened. No locks, no coordination, no overhead.

But as soon as the goroutine is in _Gsyscall, it’s in danger of losing its P. If the system call takes too long, sysmon can come along and retake the P — detach it from the blocked M and hand it to another thread so the goroutines in its run queue keep running. This is where the G-M-P decoupling really pays off: the thread is stuck in the kernel, but the work moves on.

When the system call finishes, the goroutine checks whether it still has its P. If it does — great, keep going. If sysmon took it, the goroutine tries to grab any idle P. And if there are no idle Ps at all, it puts itself on the global run queue and waits to be picked up. We’ll cover sysmon in more detail in a following article.

So far we’ve seen goroutines block voluntarily — on channels, mutexes, and system calls. But there’s something more subtle happening behind the scenes every time a goroutine calls a function.

Stack Growth

There’s another thing that can happen while a goroutine is running: it can run out of stack space. Go goroutines start with a tiny 2KB stack, and unlike OS threads, they don’t get a fixed-size stack upfront. Instead, the compiler inserts a small check called the stack growth prologue at the beginning of most functions. This check compares the current stack pointer against the stack limit — if there’s not enough room for the next function call, the runtime steps in.

When that happens, the runtime allocates a new, larger stack (typically double the size), copies the old stack contents over, adjusts all the pointers that reference stack addresses, and frees the old stack. The goroutine then continues running on its new, bigger stack as if nothing happened. This is what allows Go to run millions of goroutines — they start small and only grow when they actually need the space.

This stack check is worth mentioning here because, as we’ll see in the next section, the scheduler piggybacks on it for cooperative preemption.

Preemption

The goroutine might also be stopped involuntarily. Everything we’ve seen so far — blocking on channels, making system calls, finishing — involves the goroutine cooperating. But what if a goroutine never yields? A tight computational loop without any function calls, channel operations, or memory allocations would never give the scheduler a chance to run anything else on that P.

Go has two answers. The first is cooperative preemption: the compiler inserts a small check at the beginning of most functions that tests whether the goroutine has been asked to yield. When the runtime wants to preempt a goroutine, it flips a flag, and the next function call triggers the check and hands control back to the scheduler. This is cheap — it reuses the stack growth check that’s already there — but it only works at function calls.

The second is asynchronous preemption: for goroutines stuck in tight loops with no function calls, the runtime sends an OS signal (SIGURG on Unix) directly to the thread. The signal handler interrupts the goroutine, saves its context, and yields to the scheduler. This is the heavy hammer — it works even when cooperative preemption can’t.

In both cases, the preempted goroutine transitions directly to _Grunnable and goes back on a run queue — it’ll get another chance to run soon. There’s also a special _Gpreempted state, but that’s reserved for when the GC or debugger needs to fully suspend a goroutine via suspendG. In either case, it’s sysmon that detects goroutines running too long (more than 10ms) and triggers the preemption. We’ll explore the details in the system monitor article.

Death and Recycling

Finally, the goroutine’s function returns. Remember that the PC was set up to point at goexit during creation? So the return falls through to goexit0(), and the goroutine handles its own death. It changes its own status to _Gdead, cleans up its fields, drops the M association, and puts itself on the P’s free list. Then it calls schedule() to find the next goroutine for this M.

The G isn’t freed or garbage collected. It sits on the free list, stack and all, waiting to be recycled. This is a key optimization — allocating and setting up a new G is much more expensive than reinitializing a dead one. And this is where the story comes full circle: a new go statement might pull this same G off the free list, reinitialize it, and send it through the whole journey again.

The Self-Service Pattern

There’s a pattern running through all of these stages: the goroutine is always the one doing the work of its own state transitions. There’s no central scheduler thread pulling the strings — the goroutine parks itself, adds itself to wait queues, cleans itself up, and invokes the scheduler to pick the next G. The scheduler is really just a set of functions that goroutines call on themselves, using the M’s g0 stack to do the bookkeeping.

Most goroutines spend their lives bouncing between _Grunnable, _Grunning, and _Gwaiting — ready, running, waiting, ready, running, waiting — until they finally finish and return to _Gdead.

With the data structures and states in place, let’s look at the core algorithm — the loop that drives everything.

The Scheduling Loop

Now for the heart of the scheduler: the schedule() function (in src/runtime/proc.go ). This is a loop that runs on every M, on the g0 stack, and its job is simple: find a runnable goroutine and execute it. When the goroutine stops running (it blocks, finishes, or gets preempted), control returns to schedule(), and the loop starts again.

Here’s the rough shape:

Go Scheduler Loop

The goroutine runs until it yields control back to the scheduler—either voluntarily (by blocking on a channel, calling runtime.Gosched(), etc.) or involuntarily (via preemption). Then we’re back at schedule(), looking for the next goroutine.

The schedule() function itself is straightforward. It checks a few special cases (is this M locked to a specific goroutine?), and then calls findRunnable() to get the next goroutine. Once it has one, it calls execute() to run it.

The interesting part is findRunnable()—that’s where all the decisions happen. Let’s break down exactly how it searches for work.

Finding Work: The Search Order

findRunnable() (in src/runtime/proc.go ) is the function that answers “what should I run next?” It searches multiple sources in a specific order, and it keeps looking until it finds something — if there’s truly nothing to do, it parks the M to sleep until work appears, and then resumes the search.

Here’s the search order:

1. GC and Trace Work

Before looking for user goroutines, the scheduler checks if there’s runtime work to do. If the GC is active and needs a mark worker, that takes priority. If execution tracing is enabled and its reader goroutine is ready, that also takes priority. The runtime’s own needs come first.

2. The Global Queue Fairness Check

Every 61st schedule call, the scheduler grabs a single goroutine from the global run queue before looking at the local queue. Why 61? It’s a prime number, which helps avoid synchronization patterns where the check always lines up with the same goroutine. The point is to prevent starvation: if goroutines are constantly being added to local queues, the ones sitting in the global queue could wait forever without this check.

3. The Local Run Queue

This is the fast path, and where most goroutines come from. The scheduler first checks the runnext slot—a priority position that holds the single goroutine most likely to run next. If runnext is set, the goroutine gets it and inherits the current time slice, meaning it doesn’t reset the scheduling tick. This is an optimization for producer-consumer patterns: if G1 sends on a channel and wakes G2, G2 goes into runnext and runs immediately, almost like a direct handoff.

If runnext is empty, the scheduler takes from the ring buffer—a lock-free circular queue of up to 256 goroutines. Only the owning M writes to this queue (single producer), so no locks are needed for the common case.

4. The Global Run Queue (Again)

If the local queue is empty, check the global queue. This time, instead of grabbing just one goroutine, the scheduler grabs a batch. This amortizes the cost of acquiring the global lock (sched.lock). One lock acquisition, many goroutines.

5. Network Polling

Before resorting to stealing, the scheduler checks the netpoller to see if any network I/O is ready. If any goroutines were blocked waiting for network operations and those operations are now complete, those goroutines become runnable. We’ll talk about how the netpoller works in a future article.

6. Work Stealing

If all the above came up empty, it’s time to steal. The scheduler looks at other Ps’ local queues and takes half of their goroutines. This is the mechanism that keeps all cores busy even when work is unevenly distributed.

7. Last Resort: Park

If there’s truly nothing to do anywhere—no local work, no global work, no network I/O, nothing to steal—the M releases its P, puts it on the idle P list, and parks itself to sleep. It will be woken up later when new work appears.

But that “parking” decision isn’t as straightforward as it sounds. Should a thread go to sleep the moment it runs out of work, or should it hang around for a bit in case something shows up?

Spinning Threads

There’s a subtle balance to strike here. When a thread runs out of work — its local queue is empty, there’s nothing to steal — should it go to sleep immediately? If it does, and new work arrives a microsecond later, there’s nobody awake to pick it up. Another thread has to be woken from sleep, which costs time. On the other hand, if too many idle threads stay awake burning CPU cycles looking for work that isn’t there, that’s pure waste.

Go’s answer is spinning threads. When an M runs out of work, it doesn’t park right away. Instead, it enters a spinning state — actively checking queues and trying to steal — for a brief period before giving up and going to sleep. The runtime limits the number of spinners to at most half the number of busy Ps — so on an 8-core machine with 6 busy Ps, up to 3 threads can spin at once. Enough to be responsive, not so many that they waste CPU.

The other side of the coin is when new work appears — say a new goroutine is created or a channel unblocks. The runtime is even more conservative here: it only wakes up a sleeping thread if there are zero spinners. If there’s already a spinning thread out there, it’ll pick up the new work. The goal is simple: always have someone ready to grab new work, but not too many someones.

All of these mechanisms — blocking, unblocking, system calls, preemption — involve switching from one goroutine to another. Let’s look at what that switch actually costs.

Context Switching

Let’s talk briefly about what happens during a goroutine context switch, because it’s what makes the whole system fast.

When the scheduler switches from one goroutine to another, it needs to save where the current goroutine was and restore where the next one left off. The good news is that a goroutine’s state is surprisingly small. The mcall() assembly function only saves 3 values — the stack pointer, the program counter, and the base pointer — into a tiny gobuf struct. That’s it. Why so few? Because goroutine switches happen at function call boundaries, and at those points the compiler has already spilled any important registers to the stack following normal calling conventions. The switch only needs to save enough to find the stack again.

gogo() does the opposite: it restores those saved values and jumps right into the goroutine. Together, mcall() and gogo() are the mechanism behind every voluntary goroutine switch. For async preemption (where the goroutine is interrupted mid-execution by a signal), the full register set has to be saved — but that’s the exception, not the common path.

And it’s fast. A goroutine context switch takes roughly 50–100 nanoseconds — about 200 CPU cycles. Compare that to an OS thread context switch, which involves saving the full register set and switching kernel stacks — that costs 1–2 microseconds, 10 to 40 times slower. This is a big part of why goroutines scale so much better than threads.

Let’s wrap up what we’ve learned.

Summary

The Go scheduler multiplexes goroutines onto OS threads using the GMP model: Gs (goroutines) are the work, Ms (OS threads) provide the execution, and Ps (processors) carry the scheduling context — local run queues, memory caches, and everything needed to run goroutines efficiently. The global schedt struct ties it all together with shared state like the global run queue, idle lists, and the spinning thread count.

We followed a goroutine through its whole life — from creation (recycling dead Gs when possible), through running, blocking (where the goroutine parks itself), system calls (where the P detaches so other goroutines keep running), stack growth, and preemption (both cooperative and asynchronous). At the end, the goroutine cleans up after itself and goes back on the free list for reuse.

The scheduling loop in schedule() and findRunnable() drives it all — checking the local queue, the global queue for fairness every 61 ticks, the netpoller, and stealing from other Ps before giving up. Spinning threads keep the system responsive by staying awake briefly to catch new work, and context switching between goroutines costs only about 50–100 nanoseconds thanks to the small amount of state involved.

If you want to explore the implementation yourself, the main scheduler code lives in src/runtime/proc.go , with data structures in src/runtime/runtime2.go and assembly routines in src/runtime/asm_*.s.

In the next article, we’ll look at the garbage collector — how it tracks which objects are still alive and reclaims the rest, all while your program keeps running.

I still have no idea what you are talking about.

I thought it was a simple question. You don't know if you have read the comments or not?

Hacker Times

Hacker Times

Understanding the Go Runtime: The Scheduler

Discussion

Discussion

The GMP Model

G — Goroutine

M — Machine (OS Thread)

P — Processor

The Scheduler State (schedt)

The Life of a Goroutine

Birth: Creation and First Steps

Running

Blocking and Unblocking

System Calls

Stack Growth

Preemption

Death and Recycling

The Self-Service Pattern

The Scheduling Loop

Finding Work: The Search Order

1. GC and Trace Work

2. The Global Queue Fairness Check

3. The Local Run Queue

4. The Global Run Queue (Again)

5. Network Polling

6. Work Stealing

7. Last Resort: Park

Spinning Threads

Context Switching

Summary