Based on my understanding, olap queries will go to the parquet files which are stored in a columnar fashion and oltp style queries will go to a caching layer that sits on top of those parquet files?
What's the special sauce here? Seems like they're just caching the data which, for all intents and purposes, seems like the same solution of storing another copy of the data which is what they say they're avoiding.
Part of the value of doing an ETL pipeline via streaming replication is you get the full history of data in a table. An SCD type 2 table where each row also has a valid_from and valid_to timestamp column.
How would someone do the same thing with this architecture?
For Lakebase and Neon, our architecture needs the caching layer regardless (what we call Pageservers). Performing reads from S3 directly is too slow so we reconstruct pages and keep them on an nvme server for faster querying. Changing the format on S3 to be Parquet effectively introduces no additional copies over our existing architecture
We ended up with 'hot' data in oltp and 'cold/archival' data in olap because the storage size of oltp has always been limited.
(1) Limited by computation - there's only so much data that we can store on disks and nvme
(2) Limited by wallet - disks and nvme are EXPENSIVE
Also, the tight coupling of compute and data didn't help. It limited the size of databases on the individual expensive compute nodes.
So, another question will be -
What's currently stopping me from keeping the scd history tables right in my oltp db? what's forcing me to copy state into my etl/elt pipeline and the process it into scd into a dedicated olap db?
To some extent,the answer is still the same - the oltp cannot scale for the storage size required for keeping historical data. So, I've had to take out the 'cold' historical data and keep it in my olap freezer.
Now, if oltp itself is scaling, I'm not gonna bother with the copying step. I'll just prefer to store the history in oltp itself.
In my perspective (majorly from handling IoT systems), I need olap for 2 reasons - (1) storage scalability, and (2) analytical processing speed
I now consider (1) to be a solved problem
As for (2), I'm still not sure how this architecture ends up matching the query processing speeds of column-oriented storages. But again, I need to study more.
The SCD pipeline still remains in some form. Either in the form of (1) scd rows that we currently keep (etl pipeline) , or (2) as older lsn rows that simply don't get deleted (existing db engine).
I've done quite a lot of experimentation with (2), and it is a pretty solid concept to work with.
I've spent quite a lot of years hammering my brain at databases and datastores in general. And I've now got a feeling that this is it. Finally.
For example, if you know your user can change emails, and there might be events from another source that is keyed by user email (e.g. marketing-related events), then naturally you will need some sort of email_history table that has historical mapping of user id to email (you probably need it for audit purposes too). Then in this case there is no need to build SCD type 2 table of user from CDC, it's already there.
SELECT count * FROM my_table AS OF "2025-01-01"
https://delta.io/blog/2023-02-01-delta-lake-time-travel/
https://iceberg.apache.org/docs/latest/spark-queries/#spark-...
Parquet files are smaller than row based storage in a database (but not those databases with focus on strong compression).
And for backup - the files are probably easier to just copy to multiple disks for redundancy, as opposed to database dumps and incremental backups which at the Petabyte scale will be a pain.
there is a reason why people develop for S3: a lot of enterprise data is there. people ingest there from various sources. and it's not just parquet usually, it's multivendor sources writing to an iceberg catalog.
nobody will run minio on AWS other than hobby projects and small demos.
I regularly work with iceberg datasets in the double digit TB range per dataset. keep that in mind when you think about sizes. databricks, snowflake, large enterprise vendors: they are targeting these sizes.
So Ceph/SeaweedFS/RustFS/Garage are the alternatives I think
You realise not every company uses AWS for any/all its needs?
There are datacenters around the world owned by individual companies or co-located. And many companies still have servers on prem.
Compute and disks are getting more dense & liquid cooled, so less rack space is needed for same power.
And Minio and others can handle Petabytes+
https://www.cisco.com/c/en/us/products/collateral/servers-un...
Backblaze, Cloudflare R2 and other cheaper S3 compatible competitors also exist.
At this point S3 is a standard interface. All sorts of cloud providers and open-source projects provide S3. If you're on AWS, price isn't the reason. You pick AWS because you don't see your company taking a risk with anything else.
S3 doesn't mean expensive. AWS does. But AWS users are fully locked-in, they'll pay whatever the price is.
I've yet to met a Fortune 100 who isn't mostly using either on prem or a large hyperscaler (S3/Azure/GCS).
Have you ever spoken to a CTO? They most certainly are.
Also many are Microsoft houses so using Azure blob plus one of the reasons for Kubernetes/Openshift adoption was to be cloud neutral
Price is not the reason people chose AWS. Some companies use Azure. The current startup at $WORK uses yet another smaller Cloud. And yet AWS sill has the clear lead in market share. That's because price is far from the only factor, and not even the main factor.
You also mention removing CDC pipelines. I’m curious if the materialization (conversion across formats) can catchup to an OLTP workload that is heavy (50K+ tps), which is pretty common these days. Also CDC if done right and with care can be magical for users and stays native to the OLTP/OLAP data-store.
Third, data Lakes and open formats are suitable for Data Warehousing / Data analyst use-cases than real-time customer facing apps. Sure, you might work on changing that, which is what you are upto, but you’ll always run into tradeoffs, which will make it hard to unleash the best performance, much needed for the latter category.
In regards to error prone and speed (lag, latency at real-world scale), I wish the blog went into more detail and gave evidence than talk theory.
This architecture is better for OLTP because all maintenance operations are moved to storage AND it has all other benefits such as LTAP that emerge from having a scalable storage.
Separately, I understand taking care of it at storage level, but still don’t get “unifying storage” or “zero copy”.
Anyways, I’ll stop now. Good to see all the innovation happening on converging OLTP/OLAP front. Each with a different approach and perspective. :)
When I started my PhD at UC Berkeley 16 years ago, my advisor told me: "OLTP databases are a solved problem. They work. Focus on analytics." We were at the early innings of being able to collect far more data, structured and unstructured, and apply machine learning (which we now call “AI”). So I took the advice and joined my cofounders on the research project that became Apache Spark, and later on we started Databricks.
As we built Databricks, we started using various databases out there, and we realized OLTP databases were far from a solved problem: they were clunky, difficult to scale, and incredibly fragile. We were frustrated enough at some point that we asked ourselves what an OLTP database would look like if we were to design it today. That question led to Lakebase, our serverless Postgres database.
This post takes a deep dive into the Lakebase OLTP architecture. We start at the storage layer of a traditional monolithic database to see where the pain comes from, then we look at how Lakebase rearranges those same pieces into independent, externalized services. Finally, we turn to LTAP, where that same architecture lets transactions and analytics run on a single copy of the data, in real time, without the delays and extra cost of CDC or "mirroring.”
The vast majority of databases running in the world today are monoliths. This includes MySQL, Postgres, classic Oracle. Lakebase is built on Postgres (as it happens, was also born at Berkeley), so we will be using Postgres as the primary example here, but most databases work similarly: You provision one machine that runs the database engine and the storage. In these database systems, there are two things on disk that matter the most: the write ahead log (WAL) and the data files.

When you commit a transaction, the database does not immediately go and rewrite the data files. That would be slow, because the rows you are touching are scattered across the file in places that require random I/O. Instead, the database first appends a description of the change to the WAL, which is a sequential log on disk. A transaction is considered committed the moment that log entry is durably written. Only later, asynchronously, does the database go back and update the actual data files to reflect the change.
One simple way to think about this: the WAL exists to make writes fast (and safe), and the data files exist to make reads fast. The log lets you commit a transaction with a single sequential append instead of a scattering of random I/O. The data files let you answer a query by reading the current state directly, instead of replaying the entire history of the database from the beginning of time. (If you want to understand all the intricate details of this design, read the 69-page long ARIES paper. Be warned that this is one of the most complex papers in computer science.)
As this design has become the foundation for virtually all databases out there, the monolithic architecture also creates a lot of challenges:
Data loss from misconfiguration. A commit is only as durable as the disk flush behind it. If the database, the operating system, or the storage layer is configured such that a write to the WAL is acknowledged to the client before it has actually been flushed to durable media, then a commit can vanish in a power loss or kernel panic. These settings are subtle, easy to get wrong, and the failure is often silent. The operating system might even decide to lie to you about flushing!
Data loss from node loss. Even with flushes configured correctly, the WAL and the data files live on one machine. If that machine's disk dies, the data on it dies too. Note that network attached storage or redundancy techniques like RAID-1/RAID-10 can improve durability but do not fundamentally solve this issue. If the storage mount dies, so does your data access.
Scaling reads requires a physical clone. When one box can no longer serve your traffic, the standard answer is to add a read replica. But a read replica is a full physical copy of the entire database, streaming the WAL from the primary and replaying it. Provisioning one means copying the whole dataset and then catching up on the log. For a large database, that is not a quick operation and might even bring down the database.
High availability also requires a physical clone. Surviving the loss of the primary means running at least one additional standby node, which is itself a complete physical copy of the database kept in sync from the WAL. You pay for at least twice the infrastructure, you wait a long time to bring a standby online, and you have to set up synchronous replication to avoid losing any data when the primary goes down. (In practice, many recommend 3 or more nodes.)
Analytics contend with your transactional traffic. A heavy analytical query runs against the same hardware resources as your latency-sensitive transactional workload. One large reporting query or one GDPR cleanup can degrade your main OLTP queries. You can run the analytical queries in a separate replica, but you end up paying for the replica and still don’t get optimal performance due to the row oriented nature of OLTP storage (analytics requires column-oriented storage for high performance).
Almost every one of these problems traces back to the same root cause from the monolithic architecture: the WAL and the data files are stored inside a single machine. Durability is tied to that machine's disk. Scaling and availability require physically cloning that machine. Workloads interfere because they share that machine.
If you were to redesign an OLTP database today, you’d start with the components of the modern cloud: cheap and highly durable cloud object storage paired with elastic compute. This is the path the Neon team took on and the foundation of what became Lakebase.
The core move is to make the Postgres compute instances stateless. We do this by externalizing the WAL and the data files on local disks into purpose-built, independently scalable services. The compute layer becomes a stateless Postgres engine that can be started, stopped, and replicated freely, because it no longer owns the data.
Let’s see how these two storage services can work together to solve the aforementioned challenges without sacrificing performance.
In a monolith, a write is made durable by flushing it to the local disk. In a Lakebase, the WAL is externalized to a distributed storage service called the SafeKeeper. Instead of relying on disk flush for durability, a commit is made durable by replicating the log record across a quorum of SafeKeeper nodes using Paxos-based network replication. There is no longer a disk whose failure loses your data, and there is no longer a misconfigured flush quietly undermining your durability guarantee.
It’s natural to ask at this point: does moving commits from WAL on local disk to WAL on SafeKeeper increase the write latency due to the extra network hop? The answer is no. For any serious Postgres deployment that cares about durability and availability, you’d have to set up synchronous replication which requires the extra network hop, so externalizing the WAL into SafeKeeper does not incur additional overhead. As a matter of fact, due to how Postgres works internally, the combination of SafeKeeper and PageServer can lead to 5X higher write throughput and 2X lower read latency.
The data files move to another distributed storage service called the PageServer. The WAL is streamed from the SafeKeeper into the PageServer, and the PageServer asynchronously applies those changes to its version of the data, materializing pages into low-cost cloud object storage (the lake). You can think of the PageServer as a write through cache for the underlying object storage.
This is similar to the WAL-then-data-files relationship from the monolith, except the two halves now live in separate, independently scalable services connected by the network instead of sitting on the same disk. When a page is requested from the PageServer, and if the PageServer does not yet have the latest version yet (keep in mind changes are written to the SafeKeeper first before making their way to the PageServer), the PageServer applies the logs from the SafeKeeper to reconstruct the latest state.
A similar question: does moving data files from local disks to PageServer increase the read latency due to the extra network hop? The answer is also no for all practical purposes. The system is designed to isolate and minimize the latency impact through aggressive, multi-layered caching. To fetch a page, Postgres first looks up its buffer pool, which is in the node’s local memory. When the page is not present, it looks up a local disk cache. It only needs to go to the PageServer if there is a cache miss. Because a compute node can be configured with local memory and disk capacities identical to a monolithic setup, your local cache hit rate remains unchanged. For the vast majority of operations, read latency is indistinguishable from a monolith, but you gain the benefit of decoupled, virtually infinite storage.

Once the WAL lives in the SafeKeeper and the data files live in the PageServer, a long list of capabilities that were hard or impossible in the monolith become natural consequences of the architecture. The following are already widely available as part of the Lakebase product on both Databricks and Neon:
Still Postgres. This is real Postgres, so the wire protocol, SQL, drivers, and extensions all work as-is.
Unlimited storage. Data lives in cloud object storage rather than on a provisioned local disk. You are no longer sizing a box to a capacity ceiling. Storage is, for practical purposes, infinite.
Serverless, elastic compute. Because compute is stateless, it can scale up instantly under load and scale all the way down to zero when idle. You stop paying for a large machine to sit there waiting for traffic.
Durable writes and zero data loss. A commit is durable once it is replicated across SafeKeeper nodes via Paxos, not when a single local disk claims to have flushed it. The loss of any individual node does not lose committed data.
Simpler high availability. In the monolith, HA meant maintaining a second full physical clone, paying twice, and still risking data loss at cutover. Here, the durable state already lives in a replicated storage layer that is independent of any single compute instance. Failing over no longer means promoting a separate physical copy of the database and hoping the last segment of the log made it across.
Instant branching, cloning, and recovery. This is my favorite. For code, creating a branch is a sub-second, fully isolated copy of the entire codebase, and we do it dozens of times a day without thinking about it. For a monolithic database, cloning means physically copying the whole dataset, which is slow, expensive, and risky to the production system. When the data lives in an externalized, versioned storage layer, a branch or a clone is a metadata operation rather than a physical copy. You can branch a large production database in seconds, run an experiment or a risky migration against the branch, and throw it away. Recovery to a point in time works the same way. The database finally moves as fast as your code.
Separating compute from storage is not itself new. The previous post discussed the generation 2 cloud databases that had done this. However, the key with Lakebase is that we store operational data on commodity object storage in an open format. With this, we open up the opportunities for other engines to read it directly, which leads to LTAP.
Everything so far has been about making a single operational database better: more durable, more elastic, cheaper to run, faster to branch. But once the data lives in an externalized storage layer, something more interesting becomes possible. We can stop treating the transactional database and the analytical system as two separate worlds.
Go back to the PageServer for a moment. It already takes the stream of changes from the WAL and asynchronously materializes pages into object storage. That materialization step, the moment data lands in the lake, turns out to be exactly the right place to solve a much older problem…
Even with a Lakebase, the data in object storage was still written in Postgres's native page format, laid out row by row. That format is great for transactions and poor for analytics, so any analytical engine that wanted to read it had to either pay a conversion cost on every read or, more commonly, rely on a separate copy of the data kept in sync by a pipeline. The pipeline can be brittle, and the two copies of the data can become a governance nightmare with diverged permissions.
We recently announced LTAP, for Lake Transactional/Analytical Processing, that removes the two-copies-of-data problem. The key idea is to unify the two worlds at the storage layer rather than at the engine layer. We do not try to build one engine that is somehow great at both transactions and analytics. We keep the best tool for each job: Postgres, with full ACID semantics for transactions, and the Lakehouse engines for analytics. What changes is the data underneath them. Instead of two copies in two formats, there is one durable copy, open columnar formats like Delta and Iceberg, stored as Parquet, that both sides read (and with various levels of caches for better performance).
Note: this section requires more Postgres internal knowledge to understand than other sections.
As the PageServer materializes pages into object storage, it transcodes Postgres data from a row format into Parquet's columnar layout as it lands in the lake. We preserve the exact Postgres representation of every value, down to the bits, so any Postgres-compatible engine can reinterpret it without losing information. This is different from CDC based approach as CDC ships a stream of logical change events into a foreign schema and leaves Postgres's physical and transactional semantics behind; here we keep them. With a hyperoptimized engine, the spare CPU in the PageServer layer does the row-to-columnar transcoding as part of materializing the data into object storage, so it adds no burden to the Postgres compute serving your transactions. To serve transactional reads efficiently, the PageServer still materializes traditional row-based pages in a local cache, but this is strictly a performance cache. The underlying durable store remains unified in the lake, accessible by both sides.
Preserving Postgres semantics in columnar form comes down to two things: the type system and multi-versioning.
Type system. The majority of Postgres types map directly onto native Parquet types. The handful of values with no lossless columnar counterpart, e.g. NaN and ±Infinity, NUMERICs beyond the decimal range, exotic or extension types, are not dropped or coerced. They are carried alongside the original columns in a structured overflow field within the same table, holding the canonical Postgres text for those values. That field is both directly queryable by any engine and sufficient to reconstruct the original Postgres bytes exactly on the way back.
Multi-versioning. In Postgres, every row version that some transaction could observe is retained, which is exactly what makes snapshot isolation and point-in-time recovery possible. In contrast, open table formats expose table-wide consistent snapshots without any intermediate row versions. We get the benefits of both approaches by separating durability from visibility. Every row materialized to columnar carries its physical heap address (block and offset), so heap pages remain fully reconstructable. The classic Postgres heap page becomes a cache that accelerates point reads, while the durable source of truth lives in the columnar files in object storage. Postgres indexes aren't transcoded into columns; they are served and rebuilt from that hot cache tier. Intermediate row versions are retained to preserve Postgres's MVCC semantics and PITR, but they are not visible to Iceberg/Delta readers and are eventually garbage-collected. The net result: analytical engines see clean, snapshot-consistent tables, while the Postgres system underneath still sees a full, time-travelable version history.
There is also a pleasant side effect. Columnar data compresses far better than row data, often by more than ten times, so converting to columnar storage substantially cuts the volume of data crossing the network between the caching layer and the object store to the point that it’s often negligible. The format that makes analytics fast also makes the storage path cheaper. We even take advantage of this to dual write both row format and columnar format in object stores for data verification during the transitional rollout stage of LTAP (since we want to be extremely careful with storage changes).
One big challenge is freshness. If analytics reads from a copy in the lake, how does it see data that was committed a moment ago and has not been materialized in the object store yet? This is the question that sinks most "just point analytics at the lake" designs, so it is worth walking through how LTAP answers it.
When an analytical query starts (e.g. from the Lakehouse//RT product we just announced), it first asks Postgres for the current LSN, the log sequence number that marks the exact position in the WAL to read as of. This is a cheap metadata lookup. With that LSN, the analytical engine reads the overwhelming majority of the data, including everything already materialized up to that point, directly from object storage. The only thing left is the small set of very recent changes that have not yet been materialized to the lake, and those it fetches from the PageServer and merges on top.

The result is a consistent, fully up-to-date read of your data as of that LSN. Almost all of the work lands on cheap, scalable object storage. And critically, Postgres itself serves none of the analytical read traffic other than returning a single number (LSN). Your transactional workload does not slow down because someone kicked off a large analytical query.
There is one practical optimization worth mentioning here: For very small tables, the ones holding a handful of rows, we do not bother converting them to columnar form and creating the associated Iceberg metadata. The bookkeeping would cost more than it saves, and a table that tiny has no measurable effect on analytical performance regardless of how it is laid out. Those tables are still present and still queryable as part of the single copy.
Because of how important this problem is, there has been lots of noise in the market about integrating OLTP and analytics. A classic approach is CDC, effectively replicating data from the OLTP storage into a separate analytics storage tier. You might’ve heard of its other names such as “mirroring” or “zero CDC” or “zero ETL”.
In CDC or “mirroring”, because the data replication pipeline costs something, it cannot be applied to all the tables. You’d have to explicitly select which tables you care about, and this replication typically comes with a delay.
LTAP has nothing to opt into. A table that exists is, by construction, already in the lake and already queryable. There is no list of replicated or mirrored tables, because there is no replication. There is a single governed copy of the data in open formats, with no ETL pipeline to build, monitor, or unbreak (either by our customers or us). The transactional and analytical engines scale independently, each sized to its own workload. And because there is no data movement and no second copy, the two views can never drift: analytics is always reading the same data the application just wrote.
For another look at how LTAP comes together, check out this demo from Data and AI Summit.
If you know the field, you have already noticed that LTAP is a deliberate play on HTAP: hybrid transactional/analytical processing. HTAP has been the holy grail of database engineering, focusing on creating a single engine that's capable of doing both transactional and analytical workloads.
In practice, there has not been a single widely adopted HTAP database system out there. Why is that the case? In my opinion, HTAP systems suffer from one or more of the following:
Incomplete feature set. Designing a new proprietary engine from scratch to do a single job is a multi-year investment. Trying to build a single engine that can do the job of multiple engines compounds the investment required to reach the feature set engineers take for granted in a mature database. These systems often lag on things people assume are always there, from the breadth of SQL support (e.g. foreign key support) to the maturity of the query optimizer.
No ecosystem. Postgres and Spark each sit at the center of a vast ecosystem: drivers, extensions, tools, and decades of accumulated operational knowledge. A brand-new engine starts outside all of it, and an engine is only as useful as the ecosystem a team can actually build on.
No performance isolation. Many HTAP systems run transactions and analytics on the same hardware, so the two workloads contend for the same CPU and memory. This is the same failure we started with in the monolith, with an analytical query starving the transactional workload.
All three trace back to the same decision to unify the two workloads into one engine. Lakebase and LTAP circumvents these challenges by unifying at the storage layer, while using different compute engines for the different workloads, tapping into their full feature sets and ecosystem support, with full performance isolation.
When we first put forward the Lakebase architecture last year, we already knew that it would unlock unlimited storage, elastic compute, durable writes, simpler HA, and instant branching, based on what we’ve seen with the Neon platform. Those followed almost mechanically once the WAL lived in the SafeKeeper and the data files lived in the PageServer.
The LTAP idea came later, after the Neon and Databricks teams came together to solve the decades-old problem of running analytics against the freshest transactional data. As we iron out the kinks of LTAP and roll it out in the coming months, all of your Lakebase tables will just be available for analytics as high performance as the Lakehouse data.
What excites me most is what’s ahead. While LTAP is a natural next step, the same design also opens up lots of optimization opportunities to separate other heavyweight maintenance operations and the core transactional workloads. We are just beginning to explore what this architecture makes possible, and we are looking forward to sharing what comes next.
Acknowledgement: I’d like to thank the Lakebase team for making everything we discussed in this blog real, reviewing this blog, and keeping me honest with the technical details.