Python: The Optimization Ladder

    CPython 3.13 went further with an experimental copy-and-patch JIT compiler -- a lightweight JIT that stitches together pre-compiled machine code templates instead of generating code from scratch. It's not a full optimizing JIT like V8's TurboFan or a tracing JIT like PyPy's;

Good news. Python 3.15 adapts Pypy tracing approach to JIT and there are real performance gains now:

https://github.com/python/cpython/issues/139109

https://doesjitgobrrr.com/?goals=5,10

Great writeup.

I've been in the pandas (and now polars world) for the past 15 years. Staying in the sandbox gets most folks good enough performance. (That's why Python is the language of data science and ML).

I generally teach my clients to reach for numba first. Potentially lots of bang for little buck.

One overlooked area in the article is running on GPUs. Some numpy and pandas (and polars) code can get a big speedup by using GPUs (same code with import change).

Significant AI smell in this write up. As a result, my current reflex is to immediately stop reading. Not judgement on the actual analysis and human effort which went in. It’s just that the other context is missing.

> The real story is that Python is designed to be maximally dynamic -- you can monkey-patch methods at runtime, replace builtins, change a class's inheritance chain while instances exist -- and that design makes it fundamentally hard to optimize. ...

> 4 bytes of number, 24 bytes of machinery to support dynamism. a + b means: dereference two heap pointers, look up type slots, dispatch to int.__add__, allocate a new PyObject for the result (unless it hits the small-integer cache), update reference counts.

Would Python be a lot less useful without being maximally dynamic everywhere? Are there domains/frameworks/packages that benefit from this where this is a good trade-off?

I can't think of cases in strong statically typed languages where I've wanted something like monkey patching, and when I see monkey patching elsewhere there's often some reasonable alternative or it only needs to be used very rarely.

the optimization ladder is just the five stages of grief but for python developers. denial ("it's fast enough"), anger ("why is this so slow"), bargaining ("maybe if I use numpy"), depression ("I should rewrite this in rust"), acceptance ("actually cython is fine")

Python is perfect as a "glue" language. "Inner Loops" that have to run efficiently is not where it shines, and I would write them in C or C++ and patch them with Python for access to the huge library base.

This is the "two language problem" ( I would like to hear from people who extensively used Julia by the way, which claims to solve this problem, does it really ?)

Kudos for going through all the existing JIT approaches, instead of reaching for rewrite into X straight away.

However if Rust with PyO3 is part of the alternatives, then Boost.Python, cppyy, and pybind11 should also be accounted for, given their use in HPC and HFT integrations.

Surprised Python is only 21x slower than C for tree traversal stuff. In my experience that's one of the most painful places to use Python. But maybe that's because I use numpy automatically when simple arrays are involved, and there's no easy path for trees.

I wish there were more details on this part.

> Missing @cython.cdivision(True) inserts a zero-division check before every floating-point divide in the inner loop. Millions of branches that are never taken.

I thought never taken branches were essentially free. Does this mean something in the loop is messing with the branch predictor?

Missing: write static python and transpile to rust pyO3 which is at the top of the ladder.

Some nuance: try transpiling to a garbage collected rust like language with fast compilation until you have millions of users.

Also use a combination of neural and deterministic methods to transpile depending on the complexity.

Missing Muna[0][1], I'm curious how it would compare on these benchmarks.

[0]: https://www.muna.ai/ [1]: https://docs.muna.ai/predictors/create

I love how in an article about making python faster, the fastest option is to simply write Rust, lol

>The usual suspects are the GIL, interpretation, and dynamic typing. All three matter, but none of them is the real story. The real story is that Python is designed to be maximally dynamic -- you can monkey-patch methods at runtime, replace builtins, change a class's inheritance chain while instances exist -- and that design makes it fundamentally hard to optimize.

ok I guess the harder question is. Why isn't python as fast as javascript?

json.loads is something you don't want to use in a loop if you care for performance at all. Just simple using orjson can give you 3x speed without the need to change anything.

A personal opinion: I would much prefer to read the rough, human version of this article than this AI-polished version. I'm interested in the content and the author clearly put thought and effort into it, but I'm constantly thrown out of it by the LLM smell. (I'm also a bit mad that `--` is now on the em dash treadmill and will soon be unusable.)

I'm not just saying this to vent. I honestly wonder if we could eventually move to a norm where people publish two versions of their writing and allow the reader to choose between them. Even when the original is just a set of notes, I would personally choose to make my own way through them.

Great post saved it for when I need to optimize my python code

The replacement of emdashes with double hyphens here is almost insulting. A look through the blog history suggests that the author has no issue writing in English normally, and nothing seems really off about the actual findings here (or even the speculation about causes etc.), so I really can't understand the motivation for LLM-generated prose. (The author's usual writing style appears to have some arguable LLM-isms, but they make a lot more sense in context and of course those patterns had to come from somewhere. The overall effect is quite different.)

Edit: it's strange to get downvoted while also getting replies that agree with me and don't seem to object.

(Also, I thought it wasn't supposed to be possible to edit after getting a reply?)

What a great article!

Shockingly good article — correct identification of the root cause of performance issues being excessive dynamism and ranking of the solutions based on the value/effort ratio. Excellent taste. Will keep this in my back pocket as a quick Python optimization reference.

It's just somewhat unfortunate that I have to question every number and fact presented since the writing was clearly at least somewhat AI-assisted with the author seemingly not being upfront about that at all.

I must admit that I'm amused by the people who find the writeup useful but are turned off by the AI "smell". And look forward to the day when all valued content reeks of said "smell"; let's see what detractors-for-no-good-reason do then (yes I'm a bit ticked by the attitude).

> language slow

> looks inside

> the reference implementation of language is slow

Despite its content, this blogpost also pushes this exact "language slow" thinking in its preamble. I don't think nearly enough people read past introductions for that to be a responsible choice or a good idea.

The only thing worse than this is when Python specifically is outright taught (!) as an "interpeted language", as if an implementation-detail like that was somehow a language property. So grating.

    CPython 3.13 went further with an experimental copy-and-patch JIT compiler -- a lightweight JIT that stitches together pre-compiled machine code templates instead of generating code from scratch. It's not a full optimizing JIT like V8's TurboFan or a tracing JIT like PyPy's;

Good news. Python 3.15 adapts Pypy tracing approach to JIT and there are real performance gains now:

https://github.com/python/cpython/issues/139109

https://doesjitgobrrr.com/?goals=5,10

Now this is great to know.

While this is great, I expected faster CPython to eventually culminate into what YJIT for Ruby is. I'm not sure the current approaches they are trying will get the ecosystem there.

Now this is great to know.

Missing Muna[0][1], I'm curious how it would compare on these benchmarks.

[0]: https://www.muna.ai/ [1]: https://docs.muna.ai/predictors/create

Kudos for going through all the existing JIT approaches, instead of reaching for rewrite into X straight away.

However if Rust with PyO3 is part of the alternatives, then Boost.Python, cppyy, and pybind11 should also be accounted for, given their use in HPC and HFT integrations.

What a great article!

Great post saved it for when I need to optimize my python code

So much "is real". It is ok to check your grammar, but this is slopabetes inducing.

Here's what gave it away for me

> The remaining difference is noise, not a fundamental language gap. The real Rust advantage isn't raw speed -- it's pipeline ownership.

The author is from Turkey (where I’m also originally from).

Believe it or not, when you write a blog post in a different language, it really helps to use an LLM, even just to fix your grammar mistakes etc.

I assume that’s most likely what happened here too.

I didn't notice any signs of AI writing until seeing this comment and re-reading (though I did notice it on the second pass).

That said, I think this article demonstrates that focusing on whether or not an article used AI might be focusing on the wrong “problem.” I appreciate being sensitive to the "smell" (the number of low-effort, AI posts flying around these days has made me sensitive too), but personally, I found this article both (1) easy to read and (2) insightful. I think the number of AI-written content lacking (2) is the problem.

If we only applied the same reflex to software, even when 100% human programmed.

“The numbers are real.” But the voice is not.

I also seem to be developing an immune response to several slopisms. But the actual content is useful for outlining tradeoffs if you’re needing to make your Python code go faster.

I got the same sense, but nowadays I can't be sure whether a text is AI or the writer's style has absorbed LLM tropes.

I don't think it should be conflated with auto generated AI slop. I see a lot of snippets which were clearly manually written. I'm assuming the author used AI in a supervised manner, to smooth out the writing process and improve coherency.

I love how in an article about making python faster, the fastest option is to simply write Rust, lol

That has been a thing forever, many "Python" libraries, are actually bindings to C, C++ and Fortran.

The culture of calling them "Python" is one reason why JITs are so hard to gain adoption in Python, the problem isn't the dynamism (see Smalltalk, SELF, Ruby,...), rather the culture to rewrite code in C, C++ and Fortran code and still call it Python.

There's no surprise that Rust is faster to run, but I don't think there are many who would claim that Rust is faster to write.

Would Python be a lot less useful without being maximally dynamic everywhere? Are there domains/frameworks/packages that benefit from this where this is a good trade-off?

I've always thought the flexibility should allow python to consume things like gRPC proto files or OpenAPI docs and auto-generate the classes/methods at runtime as opposed to using codegen tools. But as far as I know, there aren't any libraries out there actually doing that.

There are some use cases for very dynamic code, like ORMs; with descriptors you can add attributes + behavior at runtime and it's quite useful. Anyways, breaking metaprogramming and more dynamic features would mean python 4 and we know how 2 -> 3 went. I also don't think it's where the core developers are going. Also also, there are other things I'd change before going after monkey patching like some scoping rules, mutable defaults in function attributes, better async ergonomics, etc.

I've used a library that patches the zipfile module to add support for zstd compression in zipfiles.

In python3.14 the support is there, but 2 years ago you could just import this library and it would just work normally.

Edit: it's strange to get downvoted while also getting replies that agree with me and don't seem to object.

(Also, I thought it wasn't supposed to be possible to edit after getting a reply?)

Yea while reading, I just didn't understand how you end up LLM writing the article? Clearly, the data and writeup are real. But, was it "edited" with an LLM? It looks closer to ~the entire thing being LLM written. I finished reading because the topic is interesting, but the LLM writing style is difficult to bear.. and I agree with your point that trying to fool us that it's human with `--` is just absurd

Same problems, same Apple M4 Pro, real numbers.

ok I guess the harder question is. Why isn't python as fast as javascript?

> ok I guess the harder question is. Why isn't python as fast as javascript?

Actually there is a pretty easy answer: worldwide, the amount of javascript being evaluated every day is many orders of magnitude higher than the amount of python. The amount of money available for optimizing it has thus been many orders of magnitude higher as well.

Be careful with that, numpy arrays can be slower than Python tuples for some operations. The creation is always slower and the overhead has to be worth it.

You can turn trees into numpy-style matrix operations because graphs and matrices are two sides of the same coin. I don't see the code for the binary-tree benchmark in the repo to see how it's written, but there are libraries like graphblas that use the equivalence for optimization.

Missing: write static python and transpile to rust pyO3 which is at the top of the ladder.

Some nuance: try transpiling to a garbage collected rust like language with fast compilation until you have millions of users.

Also use a combination of neural and deterministic methods to transpile depending on the complexity.

> a garbage collected rust like language with fast compilation

I don't know what languages you might have in mind. "Rust-like" in what sense?

One thing with python is that usually I will use one of the many c based libraries to get reasonable speed and well thought out abstractions from the start. I architect around numpy, scipy, shapely, pandas/polars or whatever. So my code runs at reasonable speed from the start. But transpiling to rust then effectively means a complete redesign of the code, data structures, algorithms etc. And I have seen the AI tools really struggle to get it right, as my intent gets lost somewhere.

So what I do now (since Claude Code) is write really bare bones (and slow) pure python implementation (like I used to do for numba, pypy or cython ready code), with minimal dependencies. Then I use the REPL, notebooks and nice plotting tools to get a real understanding of the problem space and the intricacies of my algorithm/problem at hand. When done, I let Claude add tests and I ask it to transpile to equivalent Rust and boom! a flawless 1000x speed upgrade in a minutes.

The great thing is I don't need to do the mental gymnastics to vectorize code in a write only mode like I've had to do since my Matlab days. Instead I can write simple to read for loops that follow my intent much better, and result in much more legible code. So refreshing!

And with pyO3 i can still expose the Rust lib to python, and continue to use Python for glue and plotting

json.loads is something you don't want to use in a loop if you care for performance at all. Just simple using orjson can give you 3x speed without the need to change anything.

Being upfront about AI-assistance or no AI-assistance doesn't mean shit. Whether AI was involved is independent of what they state and there's no real way to fully prove otherwise.

I wish there were more details on this part.

> Missing @cython.cdivision(True) inserts a zero-division check before every floating-point divide in the inner loop. Millions of branches that are never taken.

I thought never taken branches were essentially free. Does this mean something in the loop is messing with the branch predictor?

They're cheap but not free, especially at the front end of the CPU where it's just a lot more instructions to churn through. What the branch predictor gets you is it turns branches, which would normally cause a pipeline bubble, to be executed like straightline code if they're predicted right. It's a bit like a tracing jit. But you will still have a bunch of extra instructions to, like, compute the branch predicate.

> language slow

> looks inside

> the reference implementation of language is slow

The only thing worse than this is when Python specifically is outright taught (!) as an "interpeted language", as if an implementation-detail like that was somehow a language property. So grating.

While I sympathize (and have said similar in the past), language design can (and in Python's case certainly does) hinder optimization quite a bit. The techniques that are purely "use a better implementation" get you not much further than PyPy. Further benefits come from cross-compilation that requires restricting access to language features (and a system that can statically be convinced that those features weren't used!), or indeed straight up using code written in a different language through an FFI.

But yes, the very terminology "interpreted language" was designed for a different era and is somewhere between misleading and incomprehensible in context. (Not unlike "pass by value".)

This is the "two language problem" ( I would like to hear from people who extensively used Julia by the way, which claims to solve this problem, does it really ?)

This problem has been solved already by Lisp, Scheme, Java, .NET, Eiffel, among others, with their pick and choose mix of JIT and AOT compiler toolchains and runtimes.

Great writeup.

I've been in the pandas (and now polars world) for the past 15 years. Staying in the sandbox gets most folks good enough performance. (That's why Python is the language of data science and ML).

I generally teach my clients to reach for numba first. Potentially lots of bang for little buck.

One overlooked area in the article is running on GPUs. Some numpy and pandas (and polars) code can get a big speedup by using GPUs (same code with import change).

Taichi, benchmarked in the article, claims to be able to outperform CUDA at some GPU tasks, although their benchmarks look to be a few years old:

https://github.com/taichi-dev/taichi_benchmark

Isn't this a depressing thought? Regardless of AI, to think that everything we read would come in the same literary style, conveying little of the author, giving no window through which to learn about who they are -- that would be a real loss.

Why is it amusing?

How can you suppose that this is not a good reason to object, especially days after https://news.ycombinator.com/item?id=47340079 ?

I find the style so reflexively grating that it's honestly hard for me to imagine others not being bothered by it, let alone being bothered by others being bothered.

Especially since I looked at previous posts on the blog and they didn't have the same problem.

The smell makes me suspicious because I don’t know how the author used AI.

If the author wrote a detailed rough draft, had AI edit, reviewed the output thoroughly, and has the domain knowledge to know if the AI is correct, then this could be a useful piece.

I suspect most authors _don’t_ fall in that bucket.

Yeah, while posting how they are using Claude to do something really amazing.

"I totally get a kick out of the peeps who find the writeup super helpful yet are totally put off by that distinct "AI smell"—it’s like they can't even! Just imagine when everything we value is woven into a tapestry of that same "smell"—where will all the naysayers retreat to then? It’s a little frustrating, honestly, and I’m just like, come on! Let’s delve into this new era of content and embrace the chaos!"

There, FTFY :D

While this is great, I expected faster CPython to eventually culminate into what YJIT for Ruby is. I'm not sure the current approaches they are trying will get the ecosystem there.

I implemented most of the tracing JIT frontend in Python 3.15, with help from Mark to clean up and fix my code. I also coordinated some of the community JIT optimizer effort in Python 3.15 (note: NOT the code generator/DSL/infra, that's Mark, Diego, Brandt and Savannah). So I think I'm able to answer this.

I can't speak for everyone on the team, but I did try the lazy basic block versioning in YJIT in a fork of CPython. The main problem is that the copy-and-patch backend we currently have in CPython is not too amenable to self-modifying machine code. This makes inter-block jumps/fallthroughs very inefficient. It can be done, it's just a little strange. Also for security reasons, we tried not to have self-modifying code in the original JIT and we're hoping to stick to that. Everything has their tradeoffs---design is hard! It's not too difficult to go from tracing to lazy basic blocks. Conceptually they're somewhat similar, as the original paper points out. The main thing we lack is the compact per-block type information that something like YJIT/Higgs has.

I guess while I'm here I might as well make the distinction:

- Tracing is the JIT frontend (region selection).

- Copy and Patch is the JIT backend (code generation).

We currently use both. PyPy uses meta-tracing. It traces the runtime itself rather than the user's code in CPython's tracing case. I did take a look at PyPy's code, and a lot of ideas in the improved JIT are actually imported from PyPy directly. So I have to thank them for their great ideas. I also talk to some of the PyPy devs.

Ending off: the team is extremely lean right now. Only 2 people were generously employed by ARM to work on this full time (thanks a lot to ARM too!). The rest of us are mostly volunteers, or have some bosses that like open source contributions and allow some free time. As for me, I'm unemployed at the moment and this is basically my passion project. I'm just happy the JIT is finally working now after spending 2-3 years of my life on it :). If you go to Savannah's website [1], the JIT is around 100% faster for toy programs like Richards, and even for big programs like tomli parsing, it's 28% faster on macOS AArch64. The JIT is very much a community effort right now.

[1]: https://doesjitgobrrr.com/?goals=5,10

PS: If you want to see how the work has progressed, click "all time" in that website, it's pretty cool to see (lower is faster). I have a blog explaining how we made the JIT faster here https://fidget-spinner.github.io/posts/faster-jit-plan.html.

I guess while I'm here I might as well make the distinction:

- Tracing is the JIT frontend (region selection).

- Copy and Patch is the JIT backend (code generation).

[1]: https://doesjitgobrrr.com/?goals=5,10

So much "is real". It is ok to check your grammar, but this is slopabetes inducing.

“The numbers are real.” But the voice is not.

If we only applied the same reflex to software, even when 100% human programmed.

I also seem to be developing an immune response to several slopisms. But the actual content is useful for outlining tradeoffs if you’re needing to make your Python code go faster.

I got the same sense, but nowadays I can't be sure whether a text is AI or the writer's style has absorbed LLM tropes.

That has been a thing forever, many "Python" libraries, are actually bindings to C, C++ and Fortran.

I've used a library that patches the zipfile module to add support for zstd compression in zipfiles.

In python3.14 the support is there, but 2 years ago you could just import this library and it would just work normally.

Same problems, same Apple M4 Pro, real numbers.

> ok I guess the harder question is. Why isn't python as fast as javascript?

Be careful with that, numpy arrays can be slower than Python tuples for some operations. The creation is always slower and the overhead has to be worth it.

The author is from Turkey (where I’m also originally from).

Believe it or not, when you write a blog post in a different language, it really helps to use an LLM, even just to fix your grammar mistakes etc.

I assume that’s most likely what happened here too.

IMO it would make sense to add a disclaimer then, e.g. “I wrote this myself but had AI edit”

I have no problem with people using AI, especially to close a language gap.

If you disclose your usage I have a _lot_ more trust that effort has been put into the writing despite the usage

Honestly I'd rather read imperfect english

Here's what gave it away for me

> The remaining difference is noise, not a fundamental language gap. The real Rust advantage isn't raw speed -- it's pipeline ownership.

There’s an unmistakable rhythm beginning with first paragraph. The trigger was “Same problems, same Apple M4 Pro, real numbers.” in third for me.

I’m scarred to detect these things by my own AI usage.

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

I didn't notice any signs of AI writing until seeing this comment and re-reading (though I did notice it on the second pass).

Your initial focus is to prioritize which content to consume.

There's no surprise that Rust is faster to run, but I don't think there are many who would claim that Rust is faster to write.

Maybe with LLM/Code Assistance this effort reduces? Since we're mostly talking mathematics here, you have well defined algorithms that don't need to be "vibed". The codegen, hopefully, is consistent.

Generating code at runtime is often an anti-goal because you can’t easily introspect it. “Build-time” generation gives you that, but print often choose to go further and check the generated code to source control to be able to see the change history.

But it's an fairly easy build if you want any of that.

> a garbage collected rust like language with fast compilation

I don't know what languages you might have in mind. "Rust-like" in what sense?

Probably OCaml, Standard ML, Haskell, MLton, F#, Scala,....

If going to complain about some of those being slow, remeber that they have various options between interpreter, bytecode, REPL, JIT and AOT.

Honestly I'd rather read imperfect english

IMO it would make sense to add a disclaimer then, e.g. “I wrote this myself but had AI edit”

I have no problem with people using AI, especially to close a language gap.

If you disclose your usage I have a _lot_ more trust that effort has been put into the writing despite the usage

There’s an unmistakable rhythm beginning with first paragraph. The trigger was “Same problems, same Apple M4 Pro, real numbers.” in third for me.

I’m scarred to detect these things by my own AI usage.

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

Maybe with LLM/Code Assistance this effort reduces? Since we're mostly talking mathematics here, you have well defined algorithms that don't need to be "vibed". The codegen, hopefully, is consistent.

Your initial focus is to prioritize which content to consume.

But it's an fairly easy build if you want any of that.

Probably OCaml, Standard ML, Haskell, MLton, F#, Scala,....

If going to complain about some of those being slow, remeber that they have various options between interpreter, bytecode, REPL, JIT and AOT.

Yeah, while posting how they are using Claude to do something really amazing.

And with pyO3 i can still expose the Rust lib to python, and continue to use Python for glue and plotting

Being upfront about AI-assistance or no AI-assistance doesn't mean shit. Whether AI was involved is independent of what they state and there's no real way to fully prove otherwise.

This problem has been solved already by Lisp, Scheme, Java, .NET, Eiffel, among others, with their pick and choose mix of JIT and AOT compiler toolchains and runtimes.

There, FTFY :D

The smell makes me suspicious because I don’t know how the author used AI.

If the author wrote a detailed rough draft, had AI edit, reviewed the output thoroughly, and has the domain knowledge to know if the AI is correct, then this could be a useful piece.

I suspect most authors _don’t_ fall in that bucket.

Why is it amusing?

How can you suppose that this is not a good reason to object, especially days after https://news.ycombinator.com/item?id=47340079 ?

I find the style so reflexively grating that it's honestly hard for me to imagine others not being bothered by it, let alone being bothered by others being bothered.

Especially since I looked at previous posts on the blog and they didn't have the same problem.

Ultimately it’s up to the author to make that explicit choice. I think that AI does and will enhance writing and depth and breadth of analysis one could perform. But, to be trustworthy, people will need to either lay out all cards on the table and/or work on other ways to gain trust over time. Maybe people need to provide some context to communicate what model was used and in which ways. What % of final output is AI vs author. I mean, if I see 100% composed by human author stated somewhere then there’s my cue to at the very least learn a little about the author. Certainly more complexity and discernment for readers. Depressing? In some ways maybe; but I’m kind of optimistic. Imagine what Tolkien could worldbuild armed with AI.. but then it wouldn’t be Tolkien.

But yes, the very terminology "interpreted language" was designed for a different era and is somewhere between misleading and incomprehensible in context. (Not unlike "pass by value".)

Absolutely, no doubt about that. I just find it a terrible way to approach from in general, as well as specifically in this case: swapping out CPython with PyPy, GraalPy, Taichi, etc. - as per the post - requires no code changes, yet results in leaps and bounds faster performance.

If switching runtimes yields, say, 10x perf, and switching languages yields, say, 100x, then the language on its own was "just" a 10x penalty. Yet the presentation is "language is 100x slower". That's my gripe. And these are apparently conservative estimates as per the tables in the OP.

Not that metering "language performance" with numbers would be a super meaningful exercise to begin with, but still. The fact that most people just go with CPython does not escape me either. I do wonder though if people would shop for alternative runtimes more if the common culture was more explicitly and dominantly concerned with the performance of implementations, rather than of languages.

Worse, IMO, is the never taken branch taking up space in branch prediction buffers. Which will cause worse predictions elsewhere (when this branch ip collides with a legitimate ip). Unless I missed a subtlety and never taken branches don’t get assigned any resources until they are taken (which would be pretty smart actually).

Taichi, benchmarked in the article, claims to be able to outperform CUDA at some GPU tasks, although their benchmarks look to be a few years old:

https://github.com/taichi-dev/taichi_benchmark

And doesn't account for cuTitle, NVidia's new API infrastructure that supports writing CUDA directly in Python via a JIT that is based on MLIR.

Every year, someone posts a benchmark showing Python is 100x slower than C. The same argument plays out: one side says "benchmarks don't matter, real apps are I/O bound," the other says "just use a real language." Both are wrong.

I took two of the most-cited Benchmarks Game problems -- n-body and spectral-norm -- reproduced them on my machine, and ran every optimization tool I could find. Then I added a third benchmark -- a JSON event pipeline -- to test something closer to real-world code.

Same problems, same Apple M4 Pro, real numbers. This is one developer's journey up the ladder -- not a definitive ranking. A dedicated expert could squeeze more out of any of these tools. The full code is at faster-python-bench.

Here's the starting point -- CPython 3.13 on the official Benchmarks Game run:

Benchmark	C gcc	CPython 3.13	Ratio
n-body (50M)	2.1s	372s	177x
spectral-norm (5500)	0.4s	350s	875x
fannkuch-redux (12)	2.1s	311s	145x
mandelbrot (16000)	1.3s	183s	142x
binary-trees (21)	1.6s	33s	21x

The question isn't whether Python is slow at computation. It is. The question is how much effort each fix costs and how far it gets you. That's the ladder.

Why Python Is Slow

The usual suspects are the GIL, interpretation, and dynamic typing. All three matter, but none of them is the real story. The real story is that Python is designed to be maximally dynamic -- you can monkey-patch methods at runtime, replace builtins, change a class's inheritance chain while instances exist -- and that design makes it fundamentally hard to optimize.

A C compiler sees a + b between two integers and emits one CPU instruction. The Python VM sees a + b and has to ask: what is a? What is b? Does a.__add__ exist? Has it been replaced since the last call? Is a actually a subclass of int that overrides __add__? Every operation goes through this dispatch because the language guarantees you can change anything at any time.

The object overhead is where this shows up concretely. In C, an integer is 4 bytes on the stack. In Python:

C int:        [    4 bytes    ]

Python int:   [ ob_refcnt  8B ]    reference count
              [ ob_type    8B ]    pointer to type object
              [ ob_size    8B ]    number of digits
              [ ob_digit   4B ]    the actual value
              ─────────────────
              = 28 bytes minimum

(Simplified -- CPython 3.12+ replaced ob_size with lv_tag in a restructured int layout. Total is still 28 bytes. See longintrepr.h.)

4 bytes of number, 24 bytes of machinery to support dynamism. a + b means: dereference two heap pointers, look up type slots, dispatch to int.__add__, allocate a new PyObject for the result (unless it hits the small-integer cache), update reference counts. CPython 3.11+ mitigates this with adaptive specialization -- hot bytecodes like BINARY_OP_ADD_INT skip the dispatch for known types -- but the overhead is still there for the general case. One number isn't slow. Millions in a loop are.

The GIL (Global Interpreter Lock) gets blamed a lot, but it has no impact on single-threaded performance -- it only matters when multiple CPU-bound threads compete for the interpreter. For the benchmarks in this post, the GIL is irrelevant. CPython 3.13 shipped experimental free-threaded mode (--disable-gil) -- still experimental in 3.14 -- but as we'll see, it actually makes single-threaded code slower because removing the GIL adds overhead to every reference count operation.

The interpretation overhead is real but is being actively addressed. CPython 3.11's Faster CPython project added adaptive specialization -- the VM detects "hot" bytecodes and replaces them with type-specialized versions, skipping some of the dispatch. It helped (~1.4x). CPython 3.13 went further with an experimental copy-and-patch JIT compiler -- a lightweight JIT that stitches together pre-compiled machine code templates instead of generating code from scratch. It's not a full optimizing JIT like V8's TurboFan or a tracing JIT like PyPy's; it's designed to be small and fast to start, avoiding the heavyweight JIT startup cost that has historically kept CPython from going this route. Early results in 3.13 show no improvement on most benchmarks, but the infrastructure is now in place for more aggressive optimizations in future releases. JavaScript's V8 achieves much better JIT results, but V8 also had a large dedicated team and a single-threaded JavaScript execution model that makes speculative optimization easier. (For more on the "why doesn't CPython JIT" question, see Anthony Shaw's "Why is Python so slow?".)

So the picture is: Python is slow because its dynamic design requires runtime dispatch on every operation. The GIL, the interpreter, the object model -- these are all consequences of that design choice. Each rung of the ladder removes some of this dispatch. The higher you climb, the more you bypass -- and the more effort it costs.

Rung 0: Upgrade CPython

Cost: changing your base image. Reward: up to 1.4x.

Version	N-body	vs 3.14	Spectral-norm	vs 3.14
CPython 3.10	1,663ms	0.75x	16,826ms	0.83x
CPython 3.11	1,200ms	1.04x	13,430ms	1.05x
CPython 3.13	1,134ms	1.10x	13,637ms	1.03x
CPython 3.14	1,242ms	1.0x	14,046ms	1.0x
CPython 3.14t (free-threaded)	1,513ms	0.82x	14,551ms	0.97x

The story is 3.10 to 3.11: a 1.39x speedup on n-body, for free. That's the Faster CPython project -- adaptive specialization of bytecodes, inline caching, zero-cost exceptions. 3.13 squeezed out a bit more. 3.14 gave some of it back -- a minor regression on these benchmarks.

Free-threaded Python (3.14t) is slower on single-threaded code. The GIL removal adds overhead to every reference count operation. Worth it only if you have genuinely parallel CPU-bound threads. (Full version comparison)

This rung costs nothing. If you're still on 3.10, upgrade.

Rung 1: Alternative Runtimes (PyPy, GraalPy)

Cost: switching interpreters. Reward: 6-66x.

	N-body	Spectral-norm
CPython 3.14	1,242ms	14,046ms
GraalPy	211ms (5.9x)	212ms (66x)
PyPy	98ms (13x)	1,065ms (13x)

Both are JIT-compiled runtimes that generate native machine code from your unmodified Python. Zero code changes. Just a different interpreter.

PyPy uses a tracing JIT -- it records hot loops and compiles them. GraalPy runs on GraalVM's Truffle framework with a method-based JIT. PyPy wins on n-body (13x vs 5.9x), but GraalPy dominates spectral-norm (66x vs 13x) -- the matrix-heavy inner loop plays to GraalVM's strengths. GraalPy also offers Java interop and is actively developed by Oracle.

The catch: ecosystem compatibility. Both support major packages, but C extensions run through compatibility layers that can be slower than on CPython. GraalPy is on Python 3.12 (no 3.14 yet) and has slow startup -- it's JVM-based, so the JIT needs warmup before reaching peak performance. For pure Python code with long-running hot loops -- these are free speed.

Rung 2: Mypyc

Cost: type annotations you probably already have. Reward: 2.4-14x.

	N-body	Spectral-norm
CPython 3.14	1,242ms	14,046ms
Mypyc	518ms (2.4x)	990ms (14x)

Mypyc compiles type-annotated Python to C extensions using the same type analysis as mypy. No new syntax, no new language -- just your existing typed Python, compiled ahead of time.

# Already valid typed Python -- mypyc compiles this to C
def advance(dt: float, n: int, bodies: list[Body], pairs: list[BodyPair]) -> None:
    dx: float
    dy: float
    dz: float
    dist_sq: float
    dist: float
    mag: float
    for _ in range(n):
        for (r1, v1, m1), (r2, v2, m2) in pairs:
            dx = r1[0] - r2[0]
            dy = r1[1] - r2[1]
            dz = r1[2] - r2[2]
            dist_sq = dx * dx + dy * dy + dz * dz
            dist = math.sqrt(dist_sq)
            mag = dt / (dist_sq * dist)
            # ...

The difference from the baseline: explicit type declarations on every local variable so mypyc can use C primitives instead of Python objects, and decomposing ** (-1.5) into sqrt() + arithmetic to avoid slow power dispatch. That's it -- no special decorators, no new build system beyond mypycify().

The mypy project itself -- ~100k+ lines of Python -- achieved a 4x end-to-end speedup by compiling with mypyc. The official docs say "1.5x to 5x" for existing annotated code, "5x to 10x" for code tuned for compilation. The spectral-norm result (14x) lands above that range because the inner loop is pure arithmetic that mypyc compiles directly to C. On our dict-heavy JSON pipeline, mypyc hit 2.3x on pre-parsed dicts -- closer to the expected floor.

The constraint: mypyc supports a subset of Python. Dynamic patterns like **kwargs, getattr tricks, and heavily duck-typed code will compile but won't be optimized -- they fall back to slow generic paths. But if your code already passes mypy strict mode, mypyc is the lowest-effort compilation rung on the ladder.

Rung 3: NumPy

Cost: knowing NumPy. Reward: up to 520x.

	Spectral-norm
CPython 3.14	14,046ms
NumPy	27ms (520x)

520x. Faster than our single-threaded Rust at 154x on the same problem -- though NumPy delegates to BLAS, which uses multiple cores.

Spectral-norm is matrix-vector multiplication. NumPy pre-computes the matrix once and delegates to BLAS (Apple Accelerate on macOS):

a = build_matrix(n)
for _ in range(10):
    v = a.T @ (a @ u)
    u = a.T @ (a @ v)

Each @ is a single call to hand-optimized BLAS with SIMD and multithreading. NumPy trades O(N) memory for O(N^2) memory -- it stores the full 2000x2000 matrix (30MB) -- but the computation is done in compiled C/C++ (Apple Accelerate on macOS, OpenBLAS or MKL on Linux), not Python.

This is the lesson people miss when they say "Python is slow." Python the loop runner is slow. Python the orchestrator of compiled libraries is as fast as anything.

The constraint: your problem must fit vectorized operations. Element-wise math, matrix algebra, reductions, conditionals (np.where computes both branches and masks the result -- redundant work, but still faster than a Python loop on large arrays) -- NumPy handles all of these. What it can't help with: sequential dependencies where each step feeds the next, recursive structures, and small arrays where NumPy's per-call overhead costs more than the computation itself.

Interlude: JAX

Cost: rewriting loops as jax.lax.fori_loop + array operations. Reward: 12-1,633x.

A Reddit commenter (justneurostuff) suggested testing JAX -- an array computing library that uses XLA JIT compilation. I expected it to land somewhere near NumPy. I was wrong.

	N-body	Spectral-norm
CPython 3.14	1,242ms	14,046ms
NumPy	--	27ms (520x)
JAX JIT	100ms (12.2x)	8.6ms (1,633x)

8.6ms on spectral-norm. That's 3x faster than NumPy and the fastest result in this entire post. On n-body, 12.2x -- between Mypyc and Numba. Both results match the CPython baseline to 9 decimal places. This is single-threaded -- forcing one thread gave 9.1ms vs 8.6ms on spectral-norm.

I don't know JAX well enough to explain exactly why it's 3x faster than NumPy on the same matrix multiplications. Both call BLAS under the hood. My best guess is that JAX's @jit compiles the entire function -- matrix build, loop, dot products -- so Python is never involved between operations, while NumPy returns to Python between each @ call. But I haven't verified that in detail. Might be time to learn.

The catch: JAX is a different programming model. Python loops become lax.fori_loop. Conditionals become lax.cond. You're writing functional array programs that happen to use Python syntax -- closer to a domain-specific language than a drop-in optimizer. But if your problem fits, the numbers speak for themselves. JAX isn't the only library that compiles array code -- PyTorch has torch.compile, for example. I only tested JAX, so I can't say whether others would produce similar results on these benchmarks.

Rung 4: Numba

Cost: @njit + restructuring data into NumPy arrays. Reward: 56-135x.

	N-body	Spectral-norm
CPython 3.14	1,242ms	14,046ms
Numba @njit	22ms (56x)	104ms (135x)

Numba JIT-compiles decorated functions to machine code via LLVM:

@njit(cache=True)
def advance(dt, n, pos, vel, mass):
    for i in range(n):
        for j in range(i + 1, n):
            dx = pos[i, 0] - pos[j, 0]
            dy = pos[i, 1] - pos[j, 1]
            dz = pos[i, 2] - pos[j, 2]
            dist = sqrt(dx * dx + dy * dy + dz * dz)
            mag = dt / (dist * dist * dist)
            vel[i, 0] -= dx * mag * mass[j]
            # ...

One decorator. Restructure data into NumPy arrays. The constraint: Numba works best with NumPy arrays and numeric types. It has limited support for typed dicts, typed lists, and @jitclass, but strings and general Python objects are largely out of reach. It's a scalpel, not a saw.

Rung 5: Cython

Cost: learning C's mental model, expressed in Python syntax. Reward: 99-124x.

	N-body	Spectral-norm
CPython 3.14	1,242ms	14,046ms
Cython	10ms (124x)	142ms (99x)

124x on n-body. Within 10% of Rust. But here's the thing about this rung:

My first Cython n-body got 10.5x. Same Cython, same compiler. The final version got 124x. The difference was three landmines, none of which produced warnings:

Cython's ** operator with float exponents. Even with typed doubles and -ffast-math, x ** 0.5 is 40x slower than sqrt(x) in Cython -- the operator goes through a slow dispatch path instead of compiling to C's sqrt(). The n-body baseline uses ** (-1.5), which can't be replaced with a single sqrt() call -- it required decomposing the formula into sqrt() + arithmetic. 7x penalty on the overall benchmark.
Precomputed pair index arrays prevent the C compiler from unrolling the nested loop. 2x penalty. The "clever" version is slower.
Missing @cython.cdivision(True) inserts a zero-division check before every floating-point divide in the inner loop. Millions of branches that are never taken.

Cython's promise is that it "makes writing C extensions for Python as easy as Python itself." In practice that means: learn C's mental model, express it in Python syntax, and use the annotation report (cython -a) to verify the compiler did what you think. The full story is in The Cython Minefield.

The reward is real -- 99-124x, matching compiled languages. But the failure mode is silent. All three landmines cost you silently, and the annotation report is the only way to catch them.

Rung 6: The New Wave

Cost: new toolchains, rough edges, ecosystem gaps. Reward: 26-198x.

Three tools promise to compile Python (or Python-like code) to native machine code. I tested all three.

	N-body	Speedup	Spectral-norm	Speedup	The catch
Codon 0.19	47ms	26x	99ms	142x	Own runtime, limited stdlib, limited CPython interop
Mojo nightly	16ms	78x	118ms	119x	New language (pre-1.0), full rewrite required
Taichi 1.7	16ms	78x	71ms	198x	Python 3.13 only (no 3.14 wheels)

The numbers are real. The developer experience is rough. Codon can't import your existing code. Mojo is a new language wearing Python's clothes. Taichi has the best spectral-norm result (198x) but doesn't ship wheels for Python 3.14 -- its numbers above were benchmarked on a separate Python 3.13 environment. That's the compromise with these tools: if your runtime doesn't keep up with CPython releases, you're stuck on an old version or juggling multiple environments. (Full deep dive with code and DX verdicts)

None are drop-in. All are worth watching.

Rung 7: Rust via PyO3

Cost: learning Rust. Reward: 113-154x.

	N-body	Spectral-norm
CPython 3.14	1,242ms	14,046ms
Rust (PyO3)	11ms (113x)	91ms (154x)

The top of the ladder. But notice: on n-body, Cython at 10ms vs Rust at 11ms -- they're essentially tied. Both compiled to native machine code. The remaining difference is noise, not a fundamental language gap.

The real Rust advantage isn't raw speed -- it's pipeline ownership. When Rust parses JSON directly with serde into typed structs, it never creates a Python dict. It bypasses the Python object system entirely. That matters more on the next benchmark.

The Ceiling

The Benchmarks Game problems are pure compute: tight loops, no I/O, no data structures beyond arrays. Most Python code looks nothing like that. So I built a third benchmark: load 100K JSON events, filter, transform, aggregate per user. Dicts, strings, datetime parsing -- the kind of code that makes Numba useless and makes Cython fight the Python object system.

First, every tool starts from pre-parsed Python dicts -- same input, same work:

Approach	Time	Speedup	What it costs you
CPython 3.14	48ms	1.0x	Nothing
Mypyc	21ms	2.3x	Type annotations
Cython (dict optimized)	12ms	4.1x	Days of annotation work

4.1x. Not 50x. The bottleneck is Python dict access. Even Cython's fully optimized version -- @cython.cclass, C arrays for counters, direct CPython C-API calls (PyList_GET_ITEM, PyDict_GetItem with borrowed refs) -- still reads input dicts through the Python C API.

But wait -- why are we feeding Cython Python dicts at all? json.loads() takes ~57ms to create those dicts. That's more than the entire baseline pipeline. What if Cython reads the raw bytes itself?

I wrote a second Cython pipeline that calls yyjson -- a general-purpose C JSON parser, comparable to Rust's serde_json. Both are schema-agnostic: they parse any valid JSON, not just our event format. Cython walks the parsed tree with C pointers, filters and aggregates into C structs, and builds Python dicts only for the final output. For Rust, idiomatic serde with zero-copy deserialization. Both own the data end-to-end:

Approach	Time	Speedup	What it costs you
CPython 3.14 (json.loads + pipeline)	105ms	1.0x	Nothing
Mypyc (json.loads + pipeline)	77ms	1.4x	Type annotations
Cython (json.loads + pipeline)	67ms	1.6x	C-API dict access
Rust (serde, from bytes)	21ms	5.0x	New language + bindings
Cython (yyjson, from bytes)	17ms	6.3x	C library + Cython declarations

6.3x for Cython, 5.0x for Rust. The ceiling was never the pipeline code -- it was json.loads(). Both approaches use general-purpose JSON parsers -- yyjson on the Cython side, serde on the Rust side -- and both avoid Python objects entirely in the hot loop: Cython walks yyjson's C tree into C structs, Rust deserializes into native structs via serde.

I'm not claiming Cython is faster than Rust or vice versa. A sufficiently motivated person could make either one faster -- swap parsers, tune allocators, restructure the pipeline. The point isn't which tool wins this specific benchmark. The point is how many rungs you're willing to climb. Both land in the same neighborhood once you bypass json.loads(). The code is at faster-python-bench.

The Full Report Card

N-body (500K iterations, tight floating-point loops)

Approach	Time	Speedup	What it costs you
CPython 3.10	1,663ms	0.75x	Old version
CPython 3.14	1,242ms	1.0x	Nothing
CPython 3.14t	1,513ms	0.82x	GIL-free but slower single-thread
Mypyc	518ms	2.4x	Type annotations
GraalPy	211ms	5.9x	Python 3.12 only, ecosystem compatibility
JAX JIT	100ms	12.2x	Rewrite loops as `lax.fori_loop`
PyPy	98ms	13x	Ecosystem compatibility
Codon	47ms	26x	Separate runtime, limited stdlib
Numba	22ms	56x	`@njit` + NumPy arrays
Taichi	16ms	78x	Python 3.13 only (no 3.14 wheels)
Mojo	16ms	78x	New language + toolchain
Cython	10ms	124x	C knowledge + landmines
Rust (PyO3)	11ms	113x	Learning Rust

Spectral-norm (N=2000, matrix-vector multiply)

Approach	Time	Speedup	What it costs you
CPython 3.10	16,826ms	0.83x	Old version
CPython 3.14	14,046ms	1.0x	Nothing
CPython 3.14t	14,551ms	0.97x	GIL-free but slower single-thread
Mypyc	990ms	14x	Type annotations
GraalPy	212ms	66x	Python 3.12 only, ecosystem compatibility
PyPy	1,065ms	13x	Ecosystem compatibility
Codon	99ms	142x	Separate runtime, limited stdlib
Numba	104ms	135x	`@njit` + NumPy arrays
Mojo	118ms	119x	New language + toolchain
Rust (PyO3)	91ms	154x	Learning Rust
Cython	142ms	99x	C knowledge + landmines
Taichi	71ms	198x	Python 3.13 only (no 3.14 wheels)
NumPy	27ms	520x	Knowing NumPy + O(N^2) memory
JAX JIT	8.6ms	1,633x	Rewrite loops as `lax.fori_loop`

JSON pipeline (100K events, end-to-end from raw bytes)

Approach	Time	Speedup	What it costs you
CPython 3.14 (json.loads + pipeline)	105ms	1.0x	Nothing
Mypyc (json.loads + pipeline)	77ms	1.4x	Type annotations
Cython (json.loads + pipeline)	67ms	1.6x	C-API dict access
Rust (serde, from bytes)	21ms	5.0x	New language + bindings
Cython (yyjson, from bytes)	17ms	6.3x	C library + Cython declarations

When to Stop Climbing

The effort curve is exponential. Mypyc (2.4-14x) costs type annotations. PyPy/GraalPy (6-66x) costs a binary swap. Numba (56-135x) costs a decorator and data restructuring. JAX (12-1,633x) costs rewriting your code functionally. Cython (99-124x) costs days and C knowledge. Rust (113-154x) costs learning a new language.

Upgrade first. 3.10 to 3.11 gives you 1.4x for free.

Mypyc for typed codebases. If your code already passes mypy strict, compile it. 2.4x on n-body, 14x on spectral-norm, for almost no work.

NumPy for vectorizable math. If your problem is matrix algebra or element-wise operations, NumPy gets you 520x with code you already know.

JAX if you can express it functionally. Same array paradigm as NumPy, but XLA whole-graph compilation took spectral-norm to 1,633x -- 3x faster than NumPy. The cost is rewriting loops as lax.fori_loop and conditionals as lax.cond. On problems that don't vectorize well (n-body with 5 bodies), JAX is 12x -- good but not exceptional.

Numba for numeric loops. @njit gives you 56-135x with one decorator and honest error messages.

Cython if you know C. 99-124x is real, but the failure mode is silent slowness.

Rust for pipeline ownership. On pure compute, Cython and Rust are neck and neck. The real advantage is when Rust owns the data flow end-to-end.

PyPy or GraalPy for pure Python. 6-66x for zero code changes is remarkable, if your dependencies support it. GraalPy's spectral-norm result (66x) rivals compiled solutions.

Most code doesn't need any of this. The pipeline benchmark -- the most realistic of the three -- topped out at 4.1x when starting from Python dicts. 6.3x when Cython called yyjson and owned the bytes. If your hot path is dict[str, Any], the answer might be "stop creating dicts," not "change the language." And if your code is I/O bound, none of this matters at all.

Profile before you optimize. cProfile to find the function. line_profiler to find the line. Then pick the right rung.

Not covered: Nuitka (Python-to-C compiler, mostly used for packaging -- speedups are in the Mypyc range), Pythran (NumPy-focused AOT compiler, niche), SPy (Antonio Cuni's static Python dialect -- not ready yet but worth watching), and CinderX (Meta's performance-oriented CPython fork -- not ready yet).

Found an error? Open a PR.

Edits

2026-03-10: Rewrote the NumPy constraints paragraph. The original listed "irregular access patterns, conditionals per element, recursive structures" as things NumPy can't handle. Two of those were wrong: NumPy fancy indexing handles irregular access fine (22x faster than Python on random gather), and np.where handles conditionals (2.8-15.5x faster on 1M elements, even though it computes both branches). Replaced with things NumPy actually can't help with: sequential dependencies (n-body with 5 bodies is 2.3x slower with NumPy), recursive structures, and small arrays (NumPy loses below ~50 elements due to per-call overhead).

2026-03-10: The original text said "Early results are modest (single-digit percent improvements)" -- implying the 3.13 JIT was already delivering gains. Changed to "Early results in 3.13 show no improvement on most benchmarks." Bad wording on my part -- 3.13 JIT shows no speedup (and can be slightly slower). The speedups are coming in 3.15: Savannah Ostrowski's preliminary FastAPI benchmarks show ~8% improvement on 3.15 (see also doesjitgobrrr.com). Thanks to Fidget-Spinner (CPython core developer working on the JIT) for the correction.

2026-03-11: Added JAX JIT benchmarks after a Reddit comment from justneurostuff suggested testing it. Results: 1,633x on spectral-norm (fastest in the post -- 3x faster than NumPy), 12.2x on n-body. Both match baseline to 9 decimal places. Added as an interlude between NumPy and Numba sections, and to both report card tables.

Hacker Times