Similarly, you often have to remind devs that in many databases an UPDATE is just an INSERT + DELETE, with all of the scaling issues implied.
The better approach is either to change your storage engine (e.g. OrioleDB is working on adding the undo log to Pg), or to shard which distributes the vacuum load across multiple servers.
DELETE with well-tuned autovacuum works pretty well. Have seen it work at TBs scale with no hicuups. If DELETEs are large, we used to recommend customers to follow that with a manual VACUUM for table to reclaim space right away for future rows.
DROP TABLE can be risky, it requires an ACCESS EXCLUSIVE LOCK and if its waiting, it blocks all other statements following it, because of how lock queues work in Postgres. And you cannot keep doing high concurrent DROP TABLEs to run your large scale CRUD app.
I sincerely hope that Planetscale’s efforts succeed long-term to shift devs’ understanding and acceptance of RDBMS operations. Their blog posts and docs are generally quite good. IME, devs (and even ops-ish teams) simply do not care about all of this, and will create elaborate bespoke tooling to run DELETEs in bulk, because they either don’t understand the capabilities of the database, or don’t want to deal with the [minor] increased complexity that a partitioned schema brings, and will happily pay the extra cost / latency for deletions.
Don't get me wrong, I've definitely done it before, but it's in the same bucket as VACUUM for me... high impact interventions used to fix a mistake I made, not "course of business" actions.
- We wrote a cronjob to periodically DELETE for a retention policy on a table we'd just created. Most senior person on the team reviewed it, looked fine.
- Unusually for us, we prioritize QA'ing a different feature for release, delaying the release of this cronjob and a bunch of other code.
- During that delay, the new table accumulated many times more rows to be deleted than we'd expected during review.
- Release happens. All looks well since the initial delete wasn't a migration and cronjob hasn't run yet; engineer doing the release signs off.
- Cronjob runs, deleting hundreds of millions of rows quickly.
- Next day, replica lag's high and MySQL's transaction history is very high. MySQL keeps transaction history around until purge threads have visited all the affected pages on disk.
- The bad cluster conditions last for days and lead to other problems.
This omits detail and the 'noise' of everything else we were watching. But it gets across how the code and MySQL behaved.
Like most exciting events, it led to multiple changes to avoid a repeat. For retention policies, our new approach was one at the end of PlanetScale's post, to partition and drop old partitions. Transitioning to this from a huge unpartitioned table can be fun!
If a table is append-only and already huge, with lots of rows already past the retention threshold, you might only copy the rows to be kept to the new partitioned table: copy what you can, lock tables, do a last catch-up copy and swap tables. (Roughly the blog's 'performant one-off delete'.)
If the table's merely kind of big, gh-ost or such could allow you to ALTER without causing lag, locking, etc.
At a scale below that, you could run a slow incremental 'nibble' delete while watching server stats, and a step below that, plain ALTERs or DELETEs are fine.
Using partitioning has fun bits, too. In MySQL, the partition key has to be part of any unique index, understandably. But you have to keep that in mind when you're using INSERT..ON DUPLICATE KEY UPDATE and relying on uniqueness to trigger the update. Things stay interesting!
I hear Vitess shops like PlanetScale usually don't run multi-terabyte myqsld instances in the first place: even when physical nodes are big, they run many smaller mysqlds on them. That wouldn't make all this fully irrelevant--huge deletes would still sometimes be worse than copy-swap-drop--but it does seem real handy for taming issues that tend to worsen with mysqld size, like replication lag. All to say, little bit jelly of their setup over there!
> Counterintuitively, large DELETEs add work to the database.
There is nothing counterintuitive about this. It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it? Obviously you have to do almost all the same operations: write a log, write the deletion, update indices, replicate it, etc.
And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table. Since DROP TABLE is ~instantaneous, mainly involving table-level metadata.
DELETE scales just fine, in the sense that if you are constantly inserting and deleting individual rows, DELETE scales the same as INSERT.
Basic database functionality is designed around the assumption of lots of small transactions. Whenever you have to do something involving millions of rows at once, you generally need to investigate solutions that work well in "bulk". E.g. loading rows directly from a file rather than with SQL, adding indices only after the data has been loaded rather than before, disabling foreign key checks on large operations (if you know by design that the keys are valid)... and yes, taking advantage of DROP TABLE instead of DELETE. This doesn't mean small transactions aren't scalable, it just means bulk operations are qualitatively different and benefit from their own solutions. And DELETE is no different from INSERT in this regard.
> And you cannot keep doing high concurrent DROP TABLEs to run your large scale CRUD app
In this kind of use case/design, I would assume it would make use of partitions to make this more palatable in which case it would seem that you would bypass this issue of "high concurrent DROP TABLE". Large scale CRUD app just points to recent-ish partitions. Old partitions are either going to be low or on access and can be dropped easily or transformed/transferred into some long term/cold storage.Technically correct, but for a small table with a high churn rate, the performance characteristics may be surprising in that the "n" in most big-O calculations includes all inserts since the last VACUUM, not the actual number of resident rows.
been exploring clickhouse and while it is definitely not a general purpose DB, for time-series shaped data that can survive some insert latency, the automatic partition-based TTL is very nice and, at least so far, requires zero attention to maintain
which I guess is solved by `pg_partman` at the bottom of the post
It takes far more work to delete/update than insert. My recent example is updating ~2TB of text data was about 40x slower than inserting 12TB (was trying to correct some large text truncation that occurred during migration into PG, ended up being faster to redo).
Dumb question but why does the optimizer not just do that in secret then? Seems like something that should be detectable with some heuristics.
The entire premise here is really about time-series workloads where most operations are based on a timestamp. In those apps partition dropping has been a standard and recommended retention strategy for years. That's precisely why extensions like pg_partman and TimescaleDB exist. Given that context, the title feels more clickbaity, and could easily mislead readers into thinking this applies broadly to OLTP systems when it doesn't;
> disabling foreign key checks on large operations
And you have to know that, according to your business logic, what you're doing is safe.
Updating rows of text data is going to be more work, because variable-length text can't be updated in-place. So in terms of allocating space, it's more like a delete plus an insert. That's not surprising. (An in-place update that doesn't touch indices is generally going to be faster than an insert, though.)
I'm not aware of instances where a delete is "far more work" than an equivalent insert though. That's not the general case, and I'm having a hard time thinking of any situations where that would be true.
You can only do the DROP TABLE trick if you know nothing else is writing to the table at the same time. You know if that's the case, according to your business logic. The database has no idea.
The DROP TABLE trick effectively bypasses all the normal guarantees of data consistency. This is why it's so fast. But you have to know that that's a safe thing to do for your data.
Unless you're using zHeap, you have a narrow Heap-only-Tuples scenario where the indexes stay the same. TOAST kinda helps there, if the update is off the tuple area itself. The original zHeap docs have a lot of detail about why an UNDO log can help with long running transactions from the past etc.
That is a postgresql specific thing though. Mysql indexes were created with the idea of different storage engines in mind, so Mysql doesn't suffer from the index update ovehead on update/delete the same way.
Uber had a long blog post about switching to Mysql from Postgres for wide tables with hundreds of indexes. The HN entry is still there[1], but I can't read the original post now.
As a side note, I've used postgres partitions to the same effect to drop old data periodically - detach and then drop the partition instead of a direct DELETE (similar tricks in HBase existed).
Transactionally across related items with constraints it can explode fast.
If you've ever used FoundationDB this rapidly becomes the defining PITA due to the transaction size limits. Adding/inserting/updates are all far more predictably bounded.
The tricky part is that the latency characteristics of these operations would be pretty surprising and unintuitive. It has the same problems as virtual memory and mark/sweep GC; sometimes, depending on system state and things that other threads are doing, an unrelated operation might block for very long time periods and give you huge user-visible pauses. It's often better to force these expensive operations to be explicit so that the developer has to think through the latency & consistency implications and make the tradeoffs they want.
Tom Pang | June 11, 2026
Counterintuitively, large DELETEs add work to the database.
From experience we can plainly claim the following: the most scalable Postgres data-deletion strategies revolve around deleting entire tables.
Individual row DELETE is fine at a small scale. However, big batch DELETE operations don't immediately free up physical disk space, add write and replication overhead, and are ultimately not good for large scale row cleanup.
If your application needs to delete large amounts of data, even very rarely, we recommend moving towards schema designs that let you express that as a DROP TABLE or a TRUNCATE.
Let's study why this is by looking at how DELETE works in Postgres.
When rows mutate, Postgres can maintain multiple versions of the same row, so that different transactions can see row values as of the time they were queried. This is Postgres' implementation of "Multi-Version Concurrency Control" (MVCC) and a core principle of its design.
Postgres makes an intentional tradeoff here. It stores modified and deleted rows alongside current ones, relying on transaction IDs and visibility maps to skip over "dead tuples." Later on, a vacuum process comes along and says, "Hey, these bytes in this heap page are now free, you can overwrite them."

Deletes also need to be fully replicated; they are still a work of writes, which means large-scale DELETEs can impact other writers to your application and cause them to wait for the DELETE replication to finish (under synchronous and semi-synchronous replication).
It's worth noting here that DELETE or even autovacuum doesn't typically return data to the operating system; they only say "the space in those pages can be written over." This is an intentional choice by Postgres. It optimizes for the case where DELETE workloads are mixed with INSERT ones, and releasing space to the operating system and then asking for it back is relatively expensive and should be avoided. VACUUM FULL allows for this, but takes an expensive lock for a long time.
Another related tradeoff Postgres makes is that index data is not touched at all when issuing a DELETE; instead, readers reading the index have to resolve "is this tuple dead." There's also a best-effort optimization where an index scan that finds a dead row can mark the entry as dead itself.
Overall, DELETE is really "work added," not "work done." If you want more details on Postgres MVCC, see Keeping a Postgres queue healthy.
If you're running a DELETE over a large amount of data, you can imagine how it adds work to every read query and autovacuum. Be aware that using foreign keys and CASCADE for deletions can cause a single row delete to delete gigabytes of data, resulting in the same set of problems.
In contrast, DROP TABLE and TRUNCATE require a heavyweight AccessExclusiveLock on the table, but are loosely independent of data size. At the physical layer they remove files from the operating system directly, plus sweep the Postgres buffer cache to remove pages related to the table.
That sweep can be less trivial on databases with large shared buffers, but it is only a metadata sweep. Postgres keeps a small fixed-size header (a BufferDesc, padded to 64 bytes) for every 8KB buffer, and dropping a table scans those headers, not the pages themselves. At 64 bytes per 8KB page, that's 1/128th of the cache size: with 128GB of shared buffers, you are sweeping only ~1GB of memory, sequentially, which is very fast on modern hardware.
DROP TABLE and TRUNCATE scale much better than DELETE. They produce zero dead tuples, zero vacuum debt, zero work for readers. They immediately free up space for the operating system.
One common case where folks need to delete large amounts of data is "my table is full of junk due to a bug." We encountered this recently in an internal observability tool. A bug caused the tool to write millions of rows that we wanted to delete from the database. The bad rows had an old updated_at timestamp; anything with a recent one was designed to be kept. There were only a few hundred thousand rows to keep; most of the data was junk.
For this case, especially because "lock the database for minutes" was not an issue at all, we performed some surgery, leaning on Postgres' transactional DDL:
BEGIN
Explicit LOCK TABLE ... IN ACCESS EXCLUSIVE MODE on the table in question; this prevents other transactions from reading or writing the table, so we get consistent data.
Create a temporary table to hold just the kept data, like so:
CREATE TEMP TABLE temp_keep_big_table AS SELECT * FROM big_table WHERE updated_at >= '2026-04-01';
TRUNCATE big_table;
INSERT INTO big_table SELECT * FROM temp_keep_big_table;. In our example, this took a handful of minutes to process on a very small instance with hundreds of thousands of rows.

This worked very well for a one-off; the only data written to the Write Ahead Log (WAL) are the reinserted rows in the big_table.
If holding an AccessExclusiveLock on the table for minutes during TRUNCATE is unacceptable, use a trigger-based approach instead: mirror writes to a new table, then swap with an atomic rename.
You should also know that this more advanced maneuver is roughly what the Postgres extension pg_squeeze (a more modern version of pg_repack) does. pg_squeeze is for optimizing tables that already have significant bloat. This blog post is really about preventing bloat in the first place. By structuring your schema to avoid large bulk DELETE, pg_squeeze becomes less necessary.
In cases where the data to keep is much larger than the data to discard, but the data to discard is still substantial, the typical approach is to perform many isolated batched deletes in a loop, e.g., 10,000 rows at a time. This keeps transactions short, avoids lock pileups, and lets you pace things so that autovacuum keeps up.
Since version 10, Postgres has had great partitioning support. A "parent" table can have "child" tables, and queries can be automatically routed to them. Postgres supports a variety of partitioning schemes; one that is extremely useful is date-based partitioning, but many others are available.
Partitioning can transform a workload that does "lots of DELETE" into a workload that does "occasional DROP TABLE." For example, if you have historical data that needs to be aged out, you can have a child partition per day, and a periodic process that deletes older child partitions (or use the pg_partman extension).

You can go further still. Partitioning in Postgres is recursive, so you could partition the top level by LIST (e.g., a "visible" rows partition), then partition the "not visible" child table by RANGE to age out old data.
DROPStructuring your schema and application so that large-scale DELETE becomes DROP or TRUNCATE can dramatically improve your database. It helps reduce read query latency in some cases, mitigates replication lag spikes, and overall improves database health.