any other resources like this?
Same for Java, I have yet to in my entire career see enterprise Java be performant and not memory intensive.
At the end of the day, if you care about performance at the app layer, you will use a language better suited to that.
In practice though, for most enterprise web services, a lot of real world performance comes down to how efficiently you are calling external services (including the database). Just converting a loop of queries into bulk ones can help loads (and then tweaking the query to make good use of indexes, doing upserts, removing unneeded data, etc.)
I'm hopeful that improvements in LLMs mean we can ditch ORMs (under the guise that they are quicker to write queries and the inbetween mapping code with) and instead make good use of SQL to harness the powers that modern databases provide.
I wish Java had a proper compiler.
It doesn't excuse the "use exceptions for control flow" anti-pattern, but it is a quick patch.
And aside from algorithms, it usually comes down to avoiding memory allocations.
I have my go-to zero-alloc grpc and parquet and json and time libs etc and they make everything fast.
It’s mostly how idiomatic Java uses objects for everything that makes it slow overall.
But eventually after making a JVM app that keeps data in something like data frames etc and feels a long way from J2EE beans you can finally bump up against the limits that only c/c++/rust/etc can get you past.
This one is so prevalent that JVM has an optimization where it gives up on filling stack for exception, if it was thrown over and over in exact same place.
Java is only fast-ish even on its best day. The more typical performance is much worse because the culture around the language usually doesn't consider performance or efficiency to be a priority. Historically it was even a bit hostile to it.
I was listening to someone say they write fast code in Java by avoiding allocations with a PoolAllocator that would "cache" small objects with poolAllocator.alloc(), poolAllocator.release(). So just manual memory management with extra steps. At that point why not use a better language for the task?
Well, JS is fast and Go is faster, but Java is C++-fast.
It gets a reaction, though, so great for social media.
This is usually the first thing I look for when someone is complaining about speed. Developers often miss it because they are developing against a database on their local machine which removes any of the network latency that exists in deployed environments.
Also, before jsonb existed, you'd often run into big blobs of properties you don't care to split up into tables. Now it takes some discipline to avoid shoving things into jsonb that shouldn't be.
I recently fixed a treesitter perf issue (for myself) in neovim by just dfsing down the parse tree instead of what most textobject plugins do, which is:
-> walk the entire tree for all subtrees that match this metadata
-> now you have a list of matching subtrees, iterate through said subtree nodes, and see which ones are "close" to your cursor.
But in neovim, when I type "daf", I usually just want to delete the function right under my cursor. So you can just implement the same algorithm by just... dfsing down the parse tree (which has line numbers embedded per nodes) and detecting the matches yourself.
In school, when I did competitive programming and TCS, these gains often came from super clever invariants that you would just sit there for hours, days, weeks, just mulling it over. Then suddenly realize how to do it more cleverly and the entire problem falls away (and a bunch of smart people praise you for being smart :D). This was not one of them - it was just, "go bypass the API and do it faster, but possibly less maintainably".
In industry, it's often trying to manage the tradeoff between readability, maintainability, etc. I'm very much happy to just use some dumb n^2 pattern for n <= 10 in some loop that I don't really care much about, rather than start pulling out some clever state manipulation that could lead to pretty "menial" issues such as:
- accidental mutable variables and duplicating / reusing them later in the code
- when I look back in a week, "What the hell am I doing here?"
- or just tricky logic in general
I only noticed the treesitter textobject issue because I genuinely started working with 1MB autogen C files at work. So... yeah...
I could go and bug the maintainers to expose a "query over text range* API (they only have query, and node text range separately, I believe. At least of the minimal research I have done; I haven't kept up to date with it). But now that ties into considerations far beyond myself - does this expose state in a way that isn't intuitive? Are we adding composable primitives or just ad hoc adding features into the library to make it faster because of the tighter coupling? etc. etc.
I used to think of all of that as just kind of "bs accidentals" and "why shouldn't we just be able to write the best algorithms possible". As a maintainer of some systems now... nah, the architectural design is sometimes more fun!
I may not have these super clever flashes of insight anymore but I feel like my horizons have broadened (though part of it is because GPT Pro started 1 shotting my favorite competitive programming problems circa late 2025 D: )
Maven on the other hand, is just plain boring tech that works. There's plenty of documentation on how to use it properly for many different environments/scenarios, it's declarative while enabling plug-ins for bespoke customisations, it has cruft from its legacy but it's quite settled and it just works.
Could Maven be more modern if it was invented now? Yeah, sure, many other package managers were developed since its inception with newer/more polished concepts but it's dependable, well documented, and it just plain works.
Gradle does suck and maven is ok but a bit ugly.
Programming in Rust is a constant negotiation with the compiler. That isn't necessarily good or bad but I have far more control in Zig, and flexibility in Java.
Maybe we can ditch active models like those we see in sqlalchemy, but the typed query builders that come with ORMs are going to become more important, not less. Leveraging the compiler to catch bad queries is a huge win.
In my demo app, the CPU hotspots were entirely in application code, not I/O wait. And across a fleet, even "smaller" gains in CPU and heap compound into real cost and throughput differences. They're different problems, but your point is valid. Goal here is to get more folks thinking about other aspects of performance especially when the software is running at scale.
The rest were all very familiar. Well, apart from the new stuff. I think most of my code was running in java 6...
I’ve heard about HFT people using Java for workloads where micro optimization is needed.
To be frank, I just never understood it. From what I’ve seen heard/you have to write the code in such a way that makes it look clumsy and incompatible with pretty much any third party dependencies out there.
And at that point, why are you even using Java? Surely you could use C, C++, or any variety of popular or unpopular languages that would be more fitting and ergonomic (sorry but as a language Java just feels inferior to C# even). The biggest swelling point of Java is the ecosystem, and you can’t even really use that.
Too many folks have this mindset there is only one JVM, when that has never been the case since the 2000's, after Java for various reasons started poping everywhere.
After all, even if one has some slow and beastly, unoptimized Spring Boot container that chews through RAM, its not that expenseive (in the grand scheme of things) to just replicate more instances of it.
in practice, for web applications exposing some sort of `WarmupTask` abstraction in your service chassis that devs can implement will get you quite far. just delay serving traffic on new deployments until all tasks complete. that way users will never hit a cold node
* Most mature Java project has moved to Kotlin.
* The standard build system uses gradle, which is either groovy or kotlin, which gets compiled to java which then compiles java.
* Log4shell, amongst other vulnerabilities.
* Super slow to adopt features like async execution
* Standard repo usage is terrible.
There is no point in using Java anymore. I don't agree that Rust is a replacement, but between Python, Node, and C/C++ extensions to those, you can do everything you need.
https://gwern.net/doc/cs/2005-09-30-smith-whyihateframeworks...
My experience with something like the latest Claude Code models these days has been that they are pretty good at SQL. I think some combination of LLM review of SQL code with smoke tests would do the trick here.
But on Java specifically: every Java object still has a 24-byte overhead. How doesn't that thrash your cache?
The advice on avoiding allocations in Java also results in terrible code. For example, in math libraries, you'll often see void Add(Vector3 a, Vector3 b, Vector3 our) as opposed to the more natural Vector3 Add(Vector3 a, Vector3 b). There you go, function composition goes out the window and the resulting code is garbage to read and write. Not even C is that bad; the compiler will optimize the temporaries away. So you end up with Java that is worse than a low-level imperative language.
And, as far as I know, the best GC for Java still incurs no less than 1ms pauses? I think the stock ones are as bad as 10ms. How anyone does low-latency anything in Java then boggles my mind.
https://foojay.io/today/how-is-leyden-improving-java-perform...
There's a balance with a DB. Doing 1 or 2 row queries 1000 times is obviously inefficient, but making a 1M row query can have it's own set of problems all the same (even if you need that 1M).
It'll depend on the hardware, but you really want to make sure that anything you do with a DB allows for other instances of your application a chance to also interact with the DB. Nothing worse than finding out the 2 row insert is being blocked by a million row read for 20 seconds.
There's also a question of when you should and shouldn't join data. It's not always a black and white "just let the DB handle it". Sometimes the better route to go down is to make 2 queries rather than joining, particularly if it's something where the main table pulls in 1000 rows with only 10 unique rows pulled from the subtable. Of course, this all depends on how wide these things are as well.
But 100% agree, ORMs are the worst way to handle all these things. They very rarely do the right thing out of the box and to make them fast you ultimately end up needing to comprehend the SQL they are emitting in the first place and potentially you end up writing custom SQL anyways.
AOT options like GraalVM Native Image can help cold starts a lot, but then half your favorite frameworks breaks and you trade one set of hoops for another. Pick which pain you want.
I long ago concluded that Java was not a client or systems programming language because of the implementation priorities of the JVM maintainers. Note that I say priorities--they are extremely bright and capable engineers that focus on different use cases, and there isn't much money to be made from a client ecosystem.
The folks on embedded get to play with PTC and Aicas.
Android, even if not proper Java, has dex2oat.
I think Java (or other JVM languages) are then best positioned, because of jooq. Still the best SQL generation library I've used.
if it is valuable, i'd be surprised you can't freeze/resume the state and use it for instantaneous workload optimized startup.
There are JITs that use dynamic profile guided optimization which can adjust the emitted binary at runtime to adapt to the real world workload. You do not need to have a profile ahead of time like with ordinary PGO. Java doesn't have this yet (afaik), but .NET does and it's a huge deal for things like large scale web applications.
https://devblogs.microsoft.com/dotnet/bing-on-dotnet-8-the-i...
Apart from that my experience over the last 20 years was that a lot of performance is lost because of memory allocation (in GCed languages like Java or JavaScript). Removing allocation in hot loops really goes a long way and leads to 10 or 100 fold runtime improvements.
That said, the article does have the "LLM stank" on it, which is always offputting, but the content itself seems solid.
Well, the whole thing was standard Java OOP, except they also had a bunch of functional programming stuff on top of that. I can relate to that -- I think they were university students when they started, and I definitely had an OOP and FP phase. But then they just... kept it, 10+ years later.
So while it's true that you can write C in any language... those kind of folks don't tend to use Java in the first place ;)
--
(Except Notch? Well, his code looks like C, not sure if it's actually fast! I really enjoyed his 4 kilobyte java games back in the day, I think he published the source for each one too.)
EDIT: Found it!
https://web.archive.org/web/20120317121029/http://www.mojang...
Edit 2: This one has a download, still works!
https://web.archive.org/web/20120301015921/http://www.mojang...
Doing it to avoid memory pressure generally means you simply have a bad algorithm that needs to be tweaked. It's very rarely the right solution.
I've spent a fair few years developing lowish (10-20us wire to wire) latency trading systems and the majority of the code does not need to go fast. It's just wasted effort, a debugging headache, and technical debt. So the natural trade off is a bit of pain to make the hot path fast through spans, unsafe code, pre-allocated object pools, etc and in return you get to use a safe and easy programming language everywhere else.
In C# low latency dev is not even that painful, as there are a lot of tools available specifically for this purpose by the runtime.
This is actually the perfect situation: you are allowed to do it carefully and manually for 1% of code on the hot path, but you don't have to worry about it for the 99% of the code that's not.
A project might also grow into these requirements. I can easily imagine that something wasn't problematic for a long time but suddenly emerged as an issue over time. At that point you wouldn't want to migrate the whole codebase to a better language anymore.
Like I said, it's not hypermodern with batteries included, and streamlined for what became more common workflows after it was created but it doesn't need workarounds, it's not complicated to define a plugin to be called in one of the steps of the lifecycle, and it's provided as part of its plugin architecture.
I can understand spending many hours fighting Gradle, even I with plenty of experience with Gradle (begrudgingly, I don't like it at all) still end up fighting its idiocies but Maven... It's like any other tool, you need to learn the basics but after that you will only fight it if you are verging away from the well-documented usage (which are plenty, it's been battle-tested for decades).
They store up conserved programming time and then spend it all at once when you hit the edge case.
If you never hit the case, it's great. As soon as you do, it's all returned with interest :)
Because in my experience as of 2026, Java programs are consistently among the most painful or unpleasant to interact with.
When using JDBC I found myself quickly in implementing a poor mans ORM.
I am talking about this bug. It looks like it is still unfixed, in the sense, there is a PR fixing it, but it wasn't merged. LOL.
Regardless of whether this specific bug would be caught by Rust compiler, Bun in general is notorious for crashing, just look at how many open issues there are, how many crashes.
Not saying that you cannot make a correct program in Zig, but I prefer having checks that Rust compiler does, to not having them.
It's cool when your tooling warns you about potential bugs or mistakes in implementation, but it's still your responsibility to write the correct code. If you pick up a hammer and hit your finger instead of the nail, then in most cases (though not always) it’s your own fault.
Oracle is a prime example of this. Stored procedures are the place to put all business logic according to Oracle documentation.
This caused backslash from escaping developers who then declared business logic should never be inside the database. To avoid vendor lock-in.
There's no ideal solution, just tradeoffs.
> But people just want to compare it to building a cli tool in go or rust.
This seems like the key. HN is definitely biased towards simpler, smaller tools. (And that's not a bad thing!). The most compelling JVM stories I hear are all from much larger scale enterprise settings.
Kafka being a good example. It's very good at what it does, but painful to manage and usually not worth the pain for anyone who's not in a mega enterprise.
I mean, that already happens. It's quite rare to see someone migrate from one database to another. Even if they stuck to pure SQL for everything, it's still a pretty daunting process as Postgres SQL and MSSQL won't be the same thing.
Part 1 of 3 in the Java Performance Optimization series. Parts 2 and 3 coming soon.
I built a Java order-processing app for a talk I gave at DevNexus a couple of weeks ago. The app worked. Tests passed. I ran a load test and collected a Java Flight Recording (JFR).
Before any changes: 1,198ms elapsed time, 85,000 orders per second, peak heap sitting at just over 1GB, 19 GC pauses.
After: 239ms. 419,000 orders per second. 139MB heap. 4 GC pauses.
Same app. Same tests. Same JDK. No architectural changes. And those numbers get a lot more meaningful when you consider that code like this doesn’t run on a single box in production. It runs across a fleet.
In Part 2 I’ll walk through the profiling data behind those numbers: the flame graph, which methods were actually hot, and what changed when we fixed them. Before we get there, you need to know what kinds of things we were actually fixing.
The problems were patterns that show up in real codebases. They compile fine, they sneak through code review, and they’re the kind of thing you could miss without profiling data telling you where to look. Here are eight of them.
TL;DR: Fixing anti-patterns like these turned a Java app that took 1,198ms into one that took 239ms. Here are some to look for and fix:
After fixing: 5x throughput, 87% less heap, 79% fewer GC pauses. Same app, same tests, same JDK.
String report = "";
for (String line : logLines) {
report = report + line + "\n";
}
This code looks good, right? The problem is what String immutability means in practice.
Every time you use +, Java creates a brand new String object, a full copy of all previous content with the new bit appended. The old one gets discarded. This happens every single iteration.
The characters being copied scale as O(n²). If you have 10,000 lines, iteration 1 copies roughly nothing, iteration 5,000 copies 5,000 characters worth of accumulated content, iteration 10,000 copies all of it. BellSoft ran JMH benchmarks on exactly this and showed that when n grows by 4x, the loop-concatenation version slows down by more than 7x, much worse than linear growth.
The fix:
StringBuilder sb = new StringBuilder();
for (String line : logLines) {
sb.append(line).append("\n");
}
String report = sb.toString();
StringBuilder works off a single mutable character buffer. One allocation. Every append writes into that buffer. One toString() at the end.
Note: Since JDK 9, the compiler is smart enough to optimize "Order: " + id + " total: " + amount on a single line. But that optimization doesn’t carry into loops. Inside a loop, you still get a new StringBuilder created and thrown away on every iteration. You have to declare it before the loop yourself, like the fix above shows.
for (Order order : orders) {
int hour = order.timestamp().atZone(ZoneId.systemDefault()).getHour();
long countForHour = orders.stream()
.filter(o -> o.timestamp().atZone(ZoneId.systemDefault()).getHour() == hour)
.count();
ordersByHour.put(hour, countForHour);
}
This looks reasonable. You’re grouping orders by hour. But look at what’s happening: for each order, you’re streaming over the entire list to count how many orders share that hour. If you have 10,000 orders, that’s 10,000 iterations times 10,000 stream elements. That’s 100 million comparisons for what should be a single pass.
In my demo app, this exact pattern was the single largest CPU hotspot. It accounted for nearly 71% of CPU stack samples in the JFR recording.
The fix:
for (Order order : orders) {
int hour = order.timestamp().atZone(ZoneId.systemDefault()).getHour();
ordersByHour.merge(hour, 1L, Long::sum);
}
One pass. O(n). Each order increments its hour’s count directly. You could also use Collectors.groupingBy(... Collectors.counting()) to do it in a single stream pipeline, but the merge approach is clear and avoids the overhead of creating a stream at all.
If you see a .stream() call inside a loop body, that’s a signal to pause and check whether you’re doing redundant work.
public String buildOrderSummary(String orderId, String customer, double amount) {
return String.format("Order %s for %s: $%.2f", orderId, customer, amount);
}
String.format() tends to get recommended as the clean, readable way to build strings. Yep, it’s readable and it’s also the slowest string-building option in Java when you’re calling it frequently.
Baeldung ran JMH benchmarks across every string concatenation approach in Java. String.format() came in last in every category. It has to parse the format string every call, run regex-based token matching, and dispatch through the full java.util.Formatter machinery. StringBuilder was consistently the fastest.
The fix:
return "Order " + orderId + " for " + customer + ": $" + String.format("%.2f", amount);
Use String.format() for the numeric formatting where you need it, and let the compiler optimize the rest. Or just use a StringBuilder if you need full control.
String.format() is fine for config loading, startup code, error messages, anywhere that runs infrequently. Move it out of anything your profiler says is hot.
Long sum = 0L;
for (Long value : values) {
sum += value;
}
What’s actually happening at the JVM level:
Long sum = Long.valueOf(0L);
for (Long value : values) {
sum = Long.valueOf(sum.longValue() + value.longValue());
}
Each iteration unboxes sum to get a long, adds, then boxes the result back into a new Long object. With a million elements, you’ve created a million Long objects that the GC has to clean up. Each Long on a 64-bit JVM takes roughly 16 bytes on the heap. That’s 16MB of heap churn for what should be a simple addition loop.
long sum = 0L; // primitive, not the wrapper
for (long value : values) {
sum += value;
}
Where this tends to sneak in: aggregation and processing loops. Summing metrics, accumulating counters, building stats. Boxed types creep in because someone used Long in a collection signature somewhere upstream and nobody thought about what it costs downstream in the loop. That can be legitimately easy to miss.
Watch for Integer, Long, or Double used as local loop variables or accumulators. Also watch for List<Long> and Map<String, Integer> in frequently-called code. Every .get() and .put() involves a box/unbox round trip that you’re paying for silently.
public int parseOrDefault(String value, int defaultValue) {
try {
return Integer.parseInt(value);
} catch (NumberFormatException e) {
return defaultValue;
}
}
If this method is called in a tight loop with a meaningful percentage of non-numeric inputs, you have a performance problem that might not look like one.
The expensive part is Throwable.fillInStackTrace(), which runs inside the Throwable constructor every time an exception is created. It walks the entire call stack via a native method and materializes it into StackTraceElement objects. The deeper your call stack, the more expensive this is. Imagine a situation in a framework like Spring where this can get very deep. Norman Maurer from the Netty project benchmarked this and the difference is significant. Baeldung’s JMH results show that throwing an exception makes a method run hundreds of times slower than a normal return path.
This isn’t theoretical. There’s a real production story of a Scala/JVM templating system that cut response time by 3x after discovering that a NumberFormatException was being thrown on every field of every template render. Every time a field name was being tested to see if it was a numeric index, it threw.
The fix:
public int parseOrDefault(String value, int defaultValue) {
if (value == null || value.isBlank()) return defaultValue;
for (int i = 0; i < value.length(); i++) {
char c = value.charAt(i);
if (i == 0 && c == '-') continue;
if (!Character.isDigit(c)) return defaultValue;
}
return Integer.parseInt(value);
}
Or use NumberUtils.isParsable() from Apache Commons Lang if it’s already on your classpath.
The principle: if invalid input is a routine case in your application, user-provided data, external feeds, anything you don’t fully control, pre-validate explicitly. Exceptions are for genuinely unexpected conditions, not for “this might be in the wrong format.”
public class MetricsCollector {
private final Map<String, Long> counts = new HashMap<>();
public synchronized void increment(String key) {
counts.merge(key, 1L, Long::sum);
}
public synchronized long getCount(String key) {
return counts.getOrDefault(key, 0L);
}
}
Shared mutable state needs protection. But synchronized on the whole method means only one thread can call either method at any given time. In a service handling real concurrency, every thread calling increment() queues up waiting for every other thread to finish. The lock itself becomes the bottleneck.
The fix:
private final ConcurrentHashMap<String, LongAdder> counts = new ConcurrentHashMap<>();
public void increment(String key) {
counts.computeIfAbsent(key, k -> new LongAdder()).increment();
}
public long getCount(String key) {
LongAdder adder = counts.get(key);
return adder == null ? 0L : adder.sum();
}
ConcurrentHashMap handles concurrent reads and writes without locking the whole structure. LongAdder is purpose-built for high-concurrency incrementing. It distributes the counter across internal cells and outperforms AtomicLong under contention.
Worth calling out separately: Collections.synchronizedMap() wrappers have the same broad-lock problem, one lock for the entire map. ConcurrentHashMap is almost always the right replacement.
public String serializeOrder(Order order) throws JsonProcessingException {
return new ObjectMapper().writeValueAsString(order);
}
ObjectMapper is one of the most common examples of an object that looks cheap to create but isn’t. Constructing one involves module discovery, serializer cache initialization, and configuration loading. It’s real work happening on every call here.
Same pattern with DateTimeFormatter.ofPattern("..."), new Gson(), new XmlMapper(). They’re all designed to be constructed once and reused. Creating them in a hot method means paying that setup cost on every invocation.
The fix:
private static final ObjectMapper MAPPER = new ObjectMapper();
public String serializeOrder(Order order) throws JsonProcessingException {
return MAPPER.writeValueAsString(order);
}
ObjectMapper is thread-safe once configured, so sharing a static final instance is fine. The DateTimeFormatter built-ins like DateTimeFormatter.ISO_LOCAL_DATE are already singletons. If you’re calling DateTimeFormatter.ofPattern("...") in a hot method, move it to a constant.
The heuristic: if an object’s constructor does substantial setup work and the object is stateless (or safely shareable) after construction, it should be a field or a constant, not a local variable.
This one is worth including if you’ve started using virtual threads, introduced as a production feature in Java 21.
Virtual threads work by mounting onto a small pool of platform (OS) threads called carrier threads. When a virtual thread blocks, waiting on I/O for example, the scheduler unmounts it from the carrier, freeing that carrier to run something else. That’s the whole scalability story with virtual threads.
But there’s a catch. When a virtual thread enters a synchronized block and hits a blocking operation while inside it, it can’t be unmounted. It pins the carrier thread. That platform thread is now stuck waiting, unable to serve other virtual threads, for as long as the blocking operation takes.
// This pattern can pin a carrier thread on JDK 21
public synchronized String fetchData(String key) throws IOException {
return Files.readString(Path.of("/data/" + key)); // blocking I/O inside synchronized
}
If this happens frequently enough, all your carrier threads get pinned and your application stalls, even with thousands of virtual threads waiting to do work. Netflix ran into exactly this in production and wrote a post about debugging it.
JFR actually tells you when this is happening. The jdk.VirtualThreadPinned event fires whenever a virtual thread blocks while pinned, and by default it only triggers when the operation takes longer than 20ms, so it’s already filtered to the cases that actually matter.
The fix on JDK 21–23:
private final ReentrantLock lock = new ReentrantLock();
public String fetchData(String key) throws IOException {
lock.lock();
try {
return Files.readString(Path.of("/data/" + key));
} finally {
lock.unlock();
}
}
ReentrantLock doesn’t use OS-level object monitors, so the JVM can unmount the virtual thread normally when it blocks, instead of pinning it to the carrier.
JDK 24 note: JEP 491, shipped in Java 24, largely resolves this. synchronized no longer causes pinning in most cases on JDK 24+. If you’re still on 21, 22, or 23, this is still relevant and worth checking for with JFR. If you’re on 24, you mostly don’t have to worry about it for synchronized, though native method calls can still cause pinning.
None of these patterns crash your application. They don’t throw exceptions or produce wrong answers. They just make everything a bit slower, chew through more memory, and scale worse than they should.
What makes them hard to find without profiling is that any one of them might be completely harmless in your codebase. String concatenation in a loop that runs once at startup costs you nothing. String.format() in a utility class called twice a day is fine. The issue is when these patterns land in hot paths, code that runs on every request, every event, every iteration of your main processing loop.
In my demo app, patterns like these and others turned a 239ms operation into a 1,198ms one and pushed heap usage from 139MB to over 1GB. No single pattern was catastrophic in isolation. But fix the heap pressure and GC pauses dropped from 19 to 4. Fix the contention and now new hotspots become visible that were previously buried under the noise. The shape of the profile shifts.
And these improvements compound again beyond a single application. Some of these optimizations might seem trivial when you’re looking at a single instance or seeing small improvements in your test suite run time. But often real world Java code doesn’t run on one box. In production, there are apps that run across a fleet handling a large volume of real customer requests. An improvement that shaves a few milliseconds or reduces heap pressure on one host is happening across thousands of hosts simultaneously. At that scale, the aggregate difference is incredible. Cost impact can be significant when you consider throughput improvements and potential instance downsizing across a fleet.
That cascading effect is what I want to show in Part 2, directly in JDK Mission Control. You’ll see the flame graph before any changes, then what it looks like after the first round of fixes, and how the picture keeps changing. In Part 3, we’ll look at automating the process of identifying and implementing performance improvements.
If any of these look familiar, wait until you see what the flame graph looks like. I’m on LinkedIn. Part 2 coming soon: One Method Was Using 71% of CPU. Here’s the Flame Graph.