We architected an edge caching layer to eliminate cold starts

Automatic version detection, revalidation, prewarming... caching seems so complicated these days. Forgive me for starting a sentence with "why don't we just"... but why don't we just use the hash of the object as the cache key and be done with it? You get integrity validation as a bonus to boot.

    <link rel="stylesheet" href="main.css?hash=sha384-5rcfZgbOPW7..." integrity="sha384-5rcfZgbOPW7..."/>

    Etag: "sha384-5rcfZgbOPW7..."
    Cache-Control: max-age=31536000, immutable

A lot of people are criticizing this for unnecessary complexity, but it's a little more complicated than that. I actually think it makes sense given where they are at right now. The complexity stems from Vercel and Next.js - had they used a different tech, say Cloudflare directly and architected their own systems designed to handle rapidly changing static content none of this would have been necessary. So I guess it depends on your definition of unnecessary complexity. It's definitely unnecessary for the problem space, but probably necessary for their existing stack.

I just don’t get it. Their last paragraph describes how they changed their dynamic site to be static. So then why do you need workers at all? Just deploy to a CDN.

How do you do version updates? Add content hash to all files except for root index.html.

Cache everything forever, except for index.html

To deploy new version upload all files, making sure index.html is last.

Since all files are unique, old version continues to be served.

No cache invalidating required since all files have unique paths, expect index.html which was never cached.

You have to ensure you absolutely have properly content hashes for everything. Images, css, js. Everything

The invalidation queue is interesting, but building a custom cache key manually? Even Cloudflare now supports Cache-Tags

Sometimes I feel like work and needless infra complexity grows perfectly to match headcount and nominally available resources.

2025, the world rediscovers simple static caching. You could do the same with varnish/nginx or wp-cache with 10% of the complexity. Or a CDN.

“Incremental Static Regeneration” is also one of the funniest things to come out of this tech cycle.

    <link rel="stylesheet" href="main.css?hash=sha384-5rcfZgbOPW7..." integrity="sha384-5rcfZgbOPW7..."/>

    Etag: "sha384-5rcfZgbOPW7..."
    Cache-Control: max-age=31536000, immutable

Sure, but where's the fun in that? Then you wouldn't be able to write "we architected a caching layer"! To their credit, at least this isn't the actual title of the article, but it still left me wondering if an actual architect (you know, the kind of architect that designs buildings) would say "I architected this"?

Because you want the ability to invalidate the cache for an entire site at the same time. So you would still need some map between domain and hash.

I just don’t get it. Their last paragraph describes how they changed their dynamic site to be static. So then why do you need workers at all? Just deploy to a CDN.

How do you do version updates? Add content hash to all files except for root index.html.

Cache everything forever, except for index.html

To deploy new version upload all files, making sure index.html is last.

Since all files are unique, old version continues to be served.

No cache invalidating required since all files have unique paths, expect index.html which was never cached.

You have to ensure you absolutely have properly content hashes for everything. Images, css, js. Everything

What happens in the event of a hash collision?

The invalidation queue is interesting, but building a custom cache key manually? Even Cloudflare now supports Cache-Tags

We chose to do a custom cache key to avoid modifying the origin host NextJS app as much as possible. If we had more confidence in modifying the host then I agree cache-tags would have been better.

2025, the world rediscovers simple static caching. You could do the same with varnish/nginx or wp-cache with 10% of the complexity. Or a CDN.

“Incremental Static Regeneration” is also one of the funniest things to come out of this tech cycle.

I have an existential crisis about joining a company so deeply bought into NextJS dark patterns every day.

Sometimes I feel like work and needless infra complexity grows perfectly to match headcount and nominally available resources.

I feel the same, 72 million monthly page views is about 8 pages per second even if in a single timezone (72e6 / 8h * 30d * 3600h/s) - even with today's heavy weight pages we are talking under well under 1000 req/s. Assuming they are not super image/asset heavy i would expect this to comfortably be served by a couple of reasonable old school ngnix servers[1]. If each page was a full megabyte of uncached content we are < 10Gbits/sec. Probably under 1

The build logic to decide which things to rebuild of course is probably the interesting bits but we dont need all these services... </grey-beard-rant>

[1] https://openbenchmarking.org/test/pts/nginx&eval=c18b8feaeca...

edit: to be less ranty they are more or less building static sites out of their Next.js codebase but on-demand updated etc which is indeed interesting but none of this needs cloudflare/hyerscaler tech

Not sure how many customers/sites they have. Perhaps they don't want to spend CPU regenerating all sites on every deployment? They do describe a content-driven pre-warmer but I'm still unclear why this couldn't be a content-driven static site generator running on some build machine

Something I noticed a long time ago is that Vercel turns everything they touch into being 10 times harder than it needs to be.

I have come to conclude it is that way because they focus on optimizing for a demo case that presents well to non-technical stakeholders. Doing one particular thing that looks good at a glance gets the buy-in, and then those who bought in never have to deal with the consequences of the decision once it is time to build something other than the demo.

If true, it's one of the only things preventing totls economic collapse due to lack of jobs.

We chose to do a custom cache key to avoid modifying the origin host NextJS app as much as possible. If we had more confidence in modifying the host then I agree cache-tags would have been better.

What happens in the event of a hash collision?

Have the deploy fail and dont update index.html, users stay in current version.

For example cloudfront with s3, you use If-None-Match when uploading to ensure deploy fails on conflict

You win a million dollar prize in cryptography.

Because you want the ability to invalidate the cache for an entire site at the same time. So you would still need some map between domain and hash.

You don't need to invalidate anything if the cache is keyed on the hash of the served objects. To put it another way, a hash-keyed cache results in perfectly precise, instant, distributed cache invalidation. Read the code in my comment again.

I have an existential crisis about joining a company so deeply bought into NextJS dark patterns every day.

The build logic to decide which things to rebuild of course is probably the interesting bits but we dont need all these services... </grey-beard-rant>

[1] https://openbenchmarking.org/test/pts/nginx&eval=c18b8feaeca...

The thing is you can still stick a CDN in front of your old school servers and just use a 'stale-while-revalidate' header to get exactly the effect described here.

Stale-while-revalidate as implemented in the post was easier for us and required less resources than migrating from our dynamic site architecture to static. Ideally we would have migrated to fully static sites, but the engineering effort required to make that happen wasn't in scope.

Something I noticed a long time ago is that Vercel turns everything they touch into being 10 times harder than it needs to be.

If true, it's one of the only things preventing totls economic collapse due to lack of jobs.

Have the deploy fail and dont update index.html, users stay in current version.

For example cloudfront with s3, you use If-None-Match when uploading to ensure deploy fails on conflict

You win a million dollar prize in cryptography.

I'm no fan of Vercel, but it's kind of the symptom of a wider pattern, right? I see crazy architecture astronaut setups in so many places. It's true non-technical stakeholders can cause problems but I often see it pushed from inside the tech org too. I'm thinking it's some combination of resume-driven development, misunderstanding of 'scalability'/when it's needed, and intra-org working-together problems where it's easier to just make a new service and assert your dominion over it.

I blame this more on NextJS than Vercel, but agree in spirit. Their architecture creates a pit of failure where you're encouraged to fall into a fully dynamic pattern and is a huge trap.

However, it's probably more inexperience than anything. Nobody senior was around to tell our founders that they should go for a SSG architecture when they started /shrug. It's mostly worked out anyways though haha.

I would suggest that UBI in fact already exists, just in a subset of tech jobs where you have to engage in a certain kind of theater to get it. It's only by construction though that losing these jobs would be a problem. We have pointless busywork (and a ton of other problems) because housing is a failed market, essentially.

The thing is you can still stick a CDN in front of your old school servers and just use a 'stale-while-revalidate' header to get exactly the effect described here.

We do this, but if you're redeploying fast enough thre's a change that a user loads a cached old page (or performs a client-side navigation to an old page) and makes a requests for a URL that's no longer served by the origin nor is cached by the CDN.

I have done this with Next.js. Next.js doesn't support this header or I don't know how.

I already had HAProxy setup. So I have added stale while revalidate compatible header from HAProxy. Cloudflare handle the rest.

Edit: I am not using vercel. Self hosted using docker on EC2.

I blame this more on NextJS than Vercel, but agree in spirit. Their architecture creates a pit of failure where you're encouraged to fall into a fully dynamic pattern and is a huge trap.

I have done this with Next.js. Next.js doesn't support this header or I don't know how.

I already had HAProxy setup. So I have added stale while revalidate compatible header from HAProxy. Cloudflare handle the rest.

Edit: I am not using vercel. Self hosted using docker on EC2.

Yeah, as a salty greybeard i tried to tell our FE tech-lead to just render the proper HTTP Cache-Control headers in the Next.js site we recently built. He tried and then linked me to https://nextjs.org/docs/app/guides/caching and various version of their docs on when you can and cannot set Cache-Control headers (e.g. https://nextjs.org/docs/app/api-reference/config/next-config...) and I got several hours of head-ache before calling it a problem for another day. That site is not high traffic enough to care but this is not the first time that i've gotten the "not the Next.js way" talk and was not happy.

I obviously can be done but clearly is not the intended solution which really bothers me

Well, part of the Vercel game is to lock you in to their platform and extract $$$, but as I recall you can spec out headers in NextJS config?. And possibly on CloudFlare itself via cache rules?

I obviously can be done but clearly is not the intended solution which really bothers me

Well, part of the Vercel game is to lock you in to their platform and extract $$$, but as I recall you can spec out headers in NextJS config?. And possibly on CloudFlare itself via cache rules?

I am self hosting using Docker. Next.js config to change header didn't work for me. I had cache rules in Cloudflare, but Next.js header for page (no-cache) doesn't allow Cloudflare to apply stale-while-revalidate.

Now that I have proper header added by HAProxy, Cloudflare cache rules for stale-while-revalidate works.

If anyone can reach Cloudflare. Please let us forcefully use stale-while-revalidate even when upstream server tells otherwise.

Now that I have proper header added by HAProxy, Cloudflare cache rules for stale-while-revalidate works.

If anyone can reach Cloudflare. Please let us forcefully use stale-while-revalidate even when upstream server tells otherwise.

Mintlify powers documentation for tens of thousands of developer sites, serving 72 million monthly page views. Every pageload matters when millions of developers and AI agents depend on your platform for technical information.

We had a problem. Nearly one in four visitors experienced slow cold starts when accessing documentation pages. Our existing Next.js ISR caching solution could not keep up with deployment velocity that kept climbing as our engineering team grew.

We ship code updates multiple times per day and each deployment invalidated the entire cache across all customer sites. This post walks through how we architected a custom edge caching layer to decouple deployments from cache invalidation, bringing our cache hit rate from 76% to effectively 100%.

We achieved our goal of fully eliminating cold starts and used a veritable smorgasbord of Cloudflare products to get there.

Cloudflare Architecture

Component	Purpose
Workers	docs-proxy handles requests; revalidation-worker consumes the queue
KV	Store deployment configs, version IDs, connected domains
Durable Objects	Global singleton coordination for revalidation locks
Queues	Async message processing for cache warming
CDN Cache	Edge caching with custom cache keys via fetch with cf options
Zones/DNS	Route traffic to workers

We could have built a similar system on any hyperscaler, but leaning on Cloudflare's CDN expertise, especially for configuring tiered cache, was a huge help.

It is important that you understand the difference between two key terms which I use throughout the following solution explanation.

Revalidations are a reactive process triggered when we detect a version mismatch at request time (e.g., after we deploy new code)
Prewarming is a proactive process triggered when customers update their documentation content, before any user requests it

Both ultimately warm the cache by fetching pages, but they differ in when and why they're triggered. More on this in sections 2 through 4 below.

1. The Proxy Layer

We placed a Cloudflare Worker in front of all traffic to Mintlify hosted sites. It proxies every request and contains business logic for both updating and using the associated cache. When a request comes in, the worker proceeds through the following steps.

Determines the deployment configuration for the requested host
Builds a unique cache key based on the path, deployment ID, and request type
Leverages Cloudflare's edge cache with a 15-day TTL for successful responses

Our cache key structure shown below. The cachePrefix roughly maps to the name of a particular customer, deploymentId identifies which Vercel deployment to proxy to, path is needed to know the correct page to fetch and then contentType functions such that we can store both html and rsc variants for every page.

`${cachePrefix}/${deploymentId}/${path}#${kind}:${contentType}`;

For example: acme/dpl_abc123/getting-started:html and acme/dpl_abc123/getting-started:rsc.

2. Automatic Version Detection and Revalidation

The most innovative aspect of our solution is automatic version mismatch detection.

When we deploy a new version of our Next.js client to production, Vercel sends a deployment.succeeded webhook. Our backend receives this and writes the new deployment ID to Cloudflare's KV.

KV.put('DEPLOY:{projectId}:id', deploymentId);

Then, when user requests come through the docs-proxy worker, it extracts version information from the origin response headers and compares it against the expected version in KV.

gotVersion = originResponse.headers['x-version'];
projectId = originResponse.headers['x-vercel-project-id'];

wantVersion = KV.get('DEPLOY:{projectId}:id');

shouldRevalidate = wantVersion != gotVersion;

When a version mismatch is detected, the worker automatically triggers revalidation in the background using ctx.waitUntil(). The user gets the previously cached stale version immediately. Meanwhile, cache warming of the new version happens asynchronously in the background.

We do not start serving the new version of pages until we have warmed all paths in the sitemap. Since, when you load a new version of any given page after an update, you have to make sure that all subsequent navigations also fetch that same version. If you were on v2 and then randomly saw v1 designs when navigating to a new page it would be jarring and worse than them loading slowly.

3. The Revalidation Coordinator

Our first concern when triggering revalidations for sites was that we were going to create a race condition where we had multiple updates in parallel for a given customer and start serving traffic for both new and old versions at the same time.

We decided to use Cloudflare's Durable Objects (DO) as a lock around the update process to prevent this. We execute the following steps during every attempted revalidation trigger.

Check the DO storage for any inflight updates, ignore the trigger if there is one
Write to the DO storage to track that we are starting an update and "lock"
Queue a message containing the cachePrefix, deploymentId, and host info for the revalidation worker to process
Wait for the revalidation worker to report completion, then "unlock" by deleting the DO state

We also added a failsafe where we automatically delete the DO's data and unlock in step 1 if it has been held for 30 minutes. We know from our analytics that no update should take that long and it is a safe timeout.

4. Revalidation Worker

Cloudflare Queues make it easy to attach a worker that can consume and process messages, so we have a dedicated revalidation worker that handles both prewarming (proactive) and version revalidation (reactive). Using a queue to control the rate of cache warming requests was mission critical since without it, we'd cause a thundering herd that takes down our own databases.

Each queue message contains the full context for a deployment: cachePrefix, deploymentId, and either a list of paths or enough info to fetch them from our sitemap API. The worker then warms all pages for that deployment before reporting completion.

// Get paths from message or fetch from sitemap API
paths = message.paths ?? fetchSitemap(cachePrefix)

// Process in batches of 6 (Cloudflare's concurrent connection limit)
for batch in chunks(paths, 6):
  awaitAll(
    batch.map(path =>
      // Warm both HTML and RSC variants
      for variant in ["html", "rsc"]:
        cacheKey = "{cachePrefix}/{deploymentId}/{path}#{variant}"
        headers = { "X-Cache-Key": cacheKey }
        if variant == "rsc":
          headers["RSC"] = "1"
        fetchWithRetry(originUrl, headers)
    )
  )

Once all paths are warmed, the worker reads the current doc version from the coordinator's DO storage to ensure we're not overwriting a newer version with an older one. If the version is still valid, it updates the DEPLOYMENT:{domain} key in KV for all connected domains and notifies the coordinator that cache warming is complete. The coordinator only unlocks after receiving this completion signal.

5. Proactive Prewarming on Content Updates

Beyond reactive revalidation, we also proactively prewarm caches when customers update their documentation. After processing a docs update, our backend calls the Cloudflare Worker's admin API to trigger prewarming:

POST /admin/prewarm HTTP/1.1
Host: workerUrl
Content-Type: application/json

{
  "paths": ["/docs/intro", "/docs/quickstart", "..."],
  "cachePrefix": "acme/42",
  "deploymentId": "dpl_abc123",
  "isPrewarm": true
}

The admin endpoint accepts batch prewarm requests and queues them for processing. It also updates the doc version in the coordinator's DO to prevent older versions from overwriting newer cached content.

This two-pronged approach ensures caches stay warm through both:

reactive revalidation system triggered when our code deployments create version mismatches
proactive prewarming triggered when customers update their documentation content

We have successfully moved our cache hit rate to effectively 100% based on monitoring logs from the Cloudflare proxy worker over the past 2 weeks. Our system solves for revalidations due to both documentation content updates and new codebase deployments in the following ways.

For code changes affecting sites (revalidation)

Vercel webhook notifies our backend of the new deployment
Backend writes the new deployment ID to Cloudflare KV
The first user request detects the version mismatch
Revalidation triggers in the background
The coordinator ensures only one cache warming operation runs globally
All pages get cached at the edge with the new version

For customer docs updates (prewarming)

Update workflow completes processing
Backend proactively triggers prewarming via admin API
All pages are warmed before users even request them

Our system is also self-healing. If a revalidation fails, the next request will trigger it again. If a lock gets stuck, alarms clean it up automatically after 30 minutes. And because we cache at the edge with a 15-day TTL, even if the origin goes down, users still get fast responses from the cache. Improving reliability as well as speed!

If you're running a dynamic site and chasing P99 latency at the origin, consider whether that's actually the right battle. We spent weeks trying to optimize ours (RSCs, multiple databases, signed S3 URLs) and the system was too complicated to debug meaningfully.

The breakthrough came when we stopped trying to make dynamic requests faster and instead made them not happen at all. Push your dynamic site towards being static wherever possible. Cache aggressively, prewarm proactively, and let the edge do what it's good at.

Hacker Times