It's great that I can run this locally in a Docker container, I'd love to be able to run a managed instance on AWS billed through our existing Snowflake account
Under "Everything is a file", you can read or manipulate a wide variety of information via simple, open/read/write() APIs. Linux provides APIs to modify system settings via filesystem. Get the screen brightness setting? `cat /sys/class/backlight/device0/brightness`. Update? `echo 500 > /sys/class/backlight/device0/brightness`. No need for special APIs, just generic file operations and the kernel handles everything.
FUSE (Filesystem in Userspace) provides even more flexibility by allowing user space programs to build their own drivers that handle any data operation via the filesystem. You can mount remote systems (via SSH) and google drive, and copying files is as easy as `cp /mnt/remote/data/origin /mnt/googledrive/data/`. Or using unique FUSE like pgfuse and redisfs, updating redis value by postgres DB data is just `cat /mnt/postgres/users/100/full_name > /mnt/redis/user_100_full_name`.
But filesystems are only good for hierarchical data while a lot of real world data is relational. Many FUSE software tries hard to represent inherently non-hierarchical data in a filesystem. Data lake allows to use SQL, the elegant abstraction for relational data, across different underlying data sources. They can be physically distant and have different structures. A lot of real world applications are just CRUD on relational data. You can accomplish much more much easier if those data are just a big single database.
And this is where postgres does not cut it.
You need some more CPU and RAM than what you pay for in your postgres instance. I.e. a distributed engine where you don't have to worry about how big your database instance is today.
(1) Are there any plans to make this compatible with the ducklake specification? Meaning: Instead of using Iceberg in the background, you would use ducklake with its SQL tables? My knowledge is very limited but to me, besides leveraging duckdb, another big point of ducklake is that it's using SQL for the catalog stuff instead of a confusing mixture of files, thereby offering a bunch of advantages like not having to care about number of snapshots and better concurrent writes.
(2) Might it be possible that pg_duckdb will achieve the same thing in some time or do things not work like that?
This announcement seems huge to me, no?!
Is this really an open source Snowflake covering most use cases?
We've had this discussion like a week ago about how stupid is to use filesystem for this kind of data storage and here we go again. Actually i had to implement this "idea" in practice. What a nonsense.
Your compute asks Postgres “what is the current data for these keys?” Or “what was the current data as of two weeks ago for these keys?” And your compute will then download and aggregate your analytics query directly from the parquet files.
(2) In principle, it's a bit easier for pg_duckdb to reuse the existing Ducklake implementation because DuckDB sits in every Postgres process and they can call into each other, but we feel that architecture is less appropriate in terms resource management and stability.
Once those tables exist, queries against them are able to either push down entirely to the remote tables and uses a Custom Scan to execute and pull results back into postgres, or we transform/extract the pieces that can be executed remotely using a FDW and then treat it as a tuple source.
In both cases, the user does not need to know any of the details and just runs queries inside postgres as they always have.
Or COPY table TO STDOUT WITH (format 'parquet') if you need it on the client side.
I think this is a pretty big deal, though.
Snowflake does a lot more, though, especially around sharing data across company boundaries.
> This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.
- Separation of concerns, since with a single external process we can share object store caches without complicated locking dances between multiple processes. - Memory limits are easier to reason about with a single external process. - Postgres backends end up being more robust, as you can restart the pgduck_server process separately.
> This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.
I'll see if we can improve the docs or highlight that part better, if it is already documented—we did move some things around prior to release.
Data platforms like Snowflake are built as a central place to collect your organisation's data, do governance, large scale analytics, AI model training and inference, share data within and across orgs, build and deploy data products, etc. These are not jobs for a Postgres server.
Pg_lake foremost targets Postgres users who currently need complex ETL pipelines to get data in and out of Postgres, and accidental Postgres data warehouses where you ended up overloading your server with slow analytical queries, but you still want to keep using Postgres.
If it's anything like super base, your question the existence of God when trying to get it to work properly.
You pay them to make it work right.
[2] DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen: https://www.youtube.com/watch?v=YQEUkFWa69o
The configuration mainly involves just defining the default iceberg location for new tables, pointing it to the pgduck_server, and providing the appropriate auth/secrets for your bucket access.
it’s ok in dev/test and for me as the person in the team who’s enamored with duckdb, but it’s made the team experience challenging and so i’ve just kinda reverted to hive partitioned parquet files with a duckdb file that has views created on top of the parquet. attach that file as read only and query away.
i may work up a full example to submit as an issue but up until now too may other things are dominating my time.
When people ask me what’s missing in the Postgres market, I used to tell them “open source Snowflake.”
Crunchy’s Postgres extension is by far the most ahead solution in the market.
Huge congrats to Snowflake and the Crunchy team on open sourcing this.
DuckLake can do things that pg_lake cannot do with Iceberg, and DuckDB can do things Postgres absolutely can't (e.g. query data frames). On the other hand, Postgres can do a lot of things that DuckDB cannot do. For instance, it can handle >100k single row inserts/sec.
Transactions don't come for free. Embedding the engine in the catalog rather than the catalog in the engine enables transactions across analytical and operational tables. That way you can do a very high rate of writes in a heap table, and transactionally move data into an Iceberg table.
Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.
There's also the interoperability aspect of Iceberg being supported by other query engines.
Some service writes a lot of data in parquet files stored on S3 (e.g. logs), and now you want that data to be queryable from your application as if it was in postgres (e.g. near real-time analytics dashboard). pg_lake allows you to load these parquet files into postgres and query the data. You can also join that data with existing tables in postgres.
Video of their SVP of Product talking about it here: https://youtu.be/PERZMGLhnF8?si=DjS_OgbNeDpvLA04&t=1195
For the postgres grants themselves, we provide privs to allow read/write to the remote tables, which is done via granting the `pg_lake_read`, `pg_lake_write` or `pg_lake_read_write` grants. This is a blanket all-or-nothing grant, however, so would need some design work/patching to support per-relation grants, say.
(You could probably get away with making roles in postgres that have the appropriate read/write grant, then only granting those specific roles to a given relation, so it's probably doable though a little clunky at the moment.)
More likely, you don't need Snowflake to process queries from your BI tools (Mode, Tableau, Superset, etc), but you do need it to prepare data for those BI tools. Its entirely possible that you have hundreds of terabytes, if not petabytes, of input data that you want to pare down to < 1 TB datasets for querying, and Snowflake can chew through those datasets. There's also third party integrations and things like ML tooling that you need to consider.
You shouldn't really consider analytical systems the same as a database backing a service. Analytical systems are designed to funnel large datasets that cover the entire business (cross cutting services and any sharding you've done) into subsequently smaller datasets that are cheaper and faster to query. And you may be using different compute engines for different parts of these pipelines; there's a good chance you're not using only Snowflake but Snowflake and a bunch of different tools.
- https://github.com/smithclay/otlp2parquet (shameless plug, based on Clickhouse's Otel schema) - https://github.com/Mooncake-Labs/moonlink (also has OTLP support) - https://github.com/open-telemetry/otel-arrow (official community project under early dev)
That said, don't sleep on the "this is awesome" parts in this project... my personal favorite is the automatic schema detection:
``` CREATE TABLE my_iceberg_table () USING iceberg WITH (definition_from = 's3://bucket/source_data.parquet'); ```
https://youtu.be/HZArjlMB6W4?si=BWEfGjMaeVytW8M1
Also, nicer recording from POSETTE: https://youtu.be/tpq4nfEoioE?si=Qkmj8o990vkeRkUa
It comes down to the trade-offs made by operational and analytical query engines being fundamentally different at every level.
Additionally, the postgres extension system supports most of the current project, so wouldn't say it was forced in this case, it was a design decision. :)
With DuckLake, the query frontend and query engine are DuckDB, and Postgres is used as a catalog in the background.
With pg_lake, the query frontend and catalog are Postgres, and DuckDB is used as a query engine in the background.
Of course, they also use different table formats (though similar in data layer) with different pros and cons, and the query frontends differ in significant ways.
An interesting thing about pg_lake is that it is effectively standalone, no external catalog required. You can point Spark et al. directly to Postgres with pg_lake by using the Iceberg JDBC driver.
For instance, you could compute a `SELECT COUNT(*) FROM mytable WHERE first_name = 'David'` by querying all the rows from `mytable` on the DuckDB side, returning all the rows, and letting Postgres itself count the number of results, but this is extremely inefficient, since that same value can be computed remotely.
In a simple query like this with well-defined semantics that match between Postgres and DuckDB, you can run the query entirely on the remote side, just using Postgres as a go-between.
Not all functions and operators work in the same way between the two systems, so you cannot just push things down unconditionally; `pg_lake` does some analysis to see what can run on the DuckDB side and what needs to stick around on the Postgres side.
There is only a single "executor" from the perspective of pg_lake, but the pgduck_server embeds a multi-threaded duckdb instance.
How DuckDB executes the portion of the query it gets is up to it; it often will involve parallelism, and it can use metadata about the files it is querying to speed up its own processing without even needing to visit every file. For instance, it can look at the `first_name` in the incoming query and just skip any files which do not have a min_value/max_value that would contain that.
And a common permissioning/datasharing layer so I can share data to external and internal parties who can in turn bring their own compute to make their own latency choices.
The trap you end up in is you have to pay snowflake to access your data, iceberg and other technology help with the walled garden.
Not just snowflake, any pay on use provider.
(Context - have spent 5+ years working with Snowflake, it's great, have built drivers for various languages, etc).
I think pg_mooncake is still relatively early stage.
There's a degree of maturity to pg_lake resulting from our team's experience working on extensions like Citus, pg_documentdb, pg_cron, and many others in the past.
For instance, in pg_lake all SQL features and transactions just work, the hybrid query engine can delegate different fragments of the query into DuckDB if the whole query cannot be handled, and having a robust DuckDB integration with a single DuckDB instance (rather than 1 per session) in a separate server process helps make it production-ready. It is used in heavy production workloads already.
No compromise on Postgres features is especially hard to achieve, but after a decade of trying to get there with Citus, we knew we had to get that right from day 1.
Basically, we could speed run this thing into a comprehensive, production-ready solution. I think others will catch up, but we're not sitting still either. :)
pg_lake maps types into their Parquet equivalent and otherwise stores as text representation, there are a few limitations like very large numerics.
https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/ice...
DuckLake already has data-inlining for the DuckDB catalog, seems this will be possible once it's supported in the pg catalog.
> Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.
This is true, but it's not clear where I'd use this in practice. e.g. if I need to run a complex ETL job, I probably wouldn't do it in pg_cron.
I use DuckDB today to query Iceberg tables. In some particularly gnarly queries (huge DISTINCTs, big sorts, even just selects that touch extremely heavy columns) I have sometimes run out of memory in that DuckDB instance.
I run on hosts without much memory because they are cheap, and easy to launch, giving me isolated query parallism, which is hard to achieve on a single giant host.
To the extent that its possible, I dream of being able to spread those gnarly OOMing queries across multiple hosts; perhaps the DISTINCTs can be merged for example. But this seems like a pretty complicated system that needs to be deeply aware of Iceberg partitioning ("hidden" in pg_lake's language), right?
Is there some component in the postgres world that can help here? I am happy to continue over email, if you prefer, by the way.
Think "tiered storage."
See the example under https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/ice...:
select cron.schedule('flush-queue', '* * * * *', $$
with new_rows as (
delete from measurements_staging returning *
)
insert into measurements select * from new_rows;
$$);
The "continuous ETL" process the GP is talking about would be exactly this kind of thing, and just as trivial. (In fact it would be this exact same code, just with your mental model flipped around from "promoting data from a staging table into a canonical iceberg table" to "evicting data from a canonical table into a historical-archive table".)As far as Iceberg is concerned, DuckDB has its own implementation, but we do not use that; pg_lake has its own iceberg implementation. The partitioning is "hidden" because it is separated out from the schema definition itself and can be changed gradually without the query engine needing to care about the details of how things are partitioning at read time. (For writes, we respect the latest partitioning spec and always write according to that.)
> So is the magic here that it's Postgres? What makes being able to query something in Postgres special?
There are a bunch of pros and cons to using Postgres vs. DuckDB. The basic difference is OLTP vs. OLAP. It seems pg_lake aims to give you the best of both. You can combine analytics queries with transactional queries.
pg_lake also stores and manages the Iceberg catalog. If you use DuckDB you'll need to have an external catalog to get the same guarantees.
I think if you're someone who was happy using Postgres, but had to explore alternatives like DuckDB because Postgres couldn't meet your OLAP needs, a solution like pg_lake would make your life a lot simpler. Instead of deploying a whole new OLAP system, you basically just install this extension and create the tables you want OLAP performance from with `create table ... using iceberg`
> when we say it’s now queryable by Postgres, does this mean that it takes that data and stores it in your PG db?
Postgres basically stores pointers to the data in S3. These pointers are in the Iceberg catalog that pg_lake manages. The tables managed by pg_lake are special tables defined with `create table ... using iceberg` which stores the data in Iceberg/Parquet files on S3 and executes queries partially with the DuckDB engine and partially with the Postgres engine.
It looks like there is good support for copying between the Iceberg/DuckDB/Parquet world and the traditional Postgres world.
> Or it remains in S3 and this is a translation layer for querying with PG?
Yes I think that's right -- things stay in S3 and there is a translation layer so Postgres can use DuckDB to interact with the Iceberg tables on S3. If you're updating a table created with `create table ... using iceberg`, I think all the data remains in S3 and is stored in Parquet files, safely/transactionally managed via the Iceberg format.
https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/ice...
pg_lake integrates Iceberg and data lake files into Postgres. With the pg_lake extensions, you can use Postgres as a stand-alone lakehouse system that supports transactions and fast queries on Iceberg tables, and can directly work with raw data files in object stores like S3.
At a high level, pg_lake lets you:
pg_lakeThere are two ways to set up pg_lake:
Both approaches include the PostgreSQL extensions, the pgduck_server application and setting up S3-compatible storage.
Follow the Docker README to set up and run pg_lake with Docker.
Once you’ve built and installed the required components, you can initialize pg_lake inside Postgres.
Create all required extensions at once using CASCADE:
CREATE EXTENSION pg_lake CASCADE;
NOTICE: installing required extension "pg_lake_table"
NOTICE: installing required extension "pg_lake_engine"
NOTICE: installing required extension "pg_extension_base"
NOTICE: installing required extension "pg_lake_iceberg"
NOTICE: installing required extension "pg_lake_copy"
CREATE EXTENSION
pgduck_serverpgduck_server is a standalone process that implements the Postgres wire-protocol (locally), and underneath uses DuckDB to execute queries.
When you run pgduck_server it starts listening to port 5332 on unix domain socket:
pgduck_server
LOG pgduck_server is listening on unix_socket_directory: /tmp with port 5332, max_clients allowed 10000
As pgduck_server implements Postgres wire protocol, you can access it via psql on port 5332 and host /tmp and run commands via DuckDB.
For example, you can get the DuckDB version:
psql -p 5332 -h /tmp
select version() as duckdb_version;
duckdb_version
----------------
v1.3.2 (1 row)
You can also provide some additional settings while starting the server, to see all:
pgduck_server --help
There are some important settings that should be adjusted, especially on production systems:
--memory_limit: Optionally specify the maximum memory of pgduck_server similar to DuckDB's memory_limit, the default is 80 percent of the system memory--init_file_path <path>: Execute all statements in this file on start-up--cache_dir: Specify the directory to use to cache remote files (from S3)Note that if you want to make adjustments to duckdb settings, you can use the --init_file_path approach OR you can
connect to the running pgduck_server and make changes. For example:
$ psql -h /tmp -p 5332
psql (17.5, server 16.4.DuckPG)
Type "help" for help.
postgres=> set global threads = 16;
SET
The connection above is to the pgduck_server on its port (default 5332), NOT to the postgres/pg_lake server.
pg_lake to s3 (or compatible)pgduck_server relies on the DuckDB secrets manager for credentials and it follows the credentials chain by default for AWS and GCP. Make sure your cloud credentials are configured properly — for example, by setting them in ~/.aws/credentials.
Once you set up the credential chain, you should set the pg_lake_iceberg.default_location_prefix. This is the location where Iceberg tables are stored:
SET pg_lake_iceberg.default_location_prefix TO 's3://testbucketpglake';
You can also set the credentials on pgduck_server for local development with minio.
You can create Iceberg tables by adding USING iceberg to your CREATE TABLE statements.
CREATE TABLE iceberg_test USING iceberg
AS SELECT
i as key, 'val_'|| i as val
FROM
generate_series(0,99)i;
Then, query it:
SELECT count(*) FROM iceberg_test;
count
-------
100
(1 row)
You can then see the Iceberg metadata location:
SELECT table_name, metadata_location FROM iceberg_tables;
table_name | metadata_location
-------------------+--------------------------------------------------------------------------------------------------------------------
iceberg_test | s3://testbucketpglake/postgres/public/test/435029/metadata/00001-f0c6e20a-fd1c-4645-87c9-c0c64b92992b.metadata.json
You can import or export data directly using COPY in Parquet, CSV, or newline-delimited JSON formats. The format is automatically inferred from the file extension, or you can specify it explicitly with COPY options like WITH (format 'csv', compression 'gzip').
-- Copy data from Postgres to S3 with format parquet
-- Read from any data source, including iceberg tables, heap tables or any query results
COPY (SELECT * FROM iceberg_test) TO 's3://testbucketpglake/parquet_data/iceberg_test.parquet';
-- Copy back from S3 to any table in Postgres
-- This example copies into an iceberg table, but could be heap table as well
COPY iceberg_test FROM 's3://testbucketpglake/parquet_data/iceberg_test.parquet';
You can create a foreign table directly from a file or set of files without having to specify column names or types.
-- use the files under the path, can use * for all files
CREATE FOREIGN TABLE parquet_table()
SERVER pg_lake
OPTIONS (path 's3://testbucketpglake/parquet_data/*.parquet');
-- note that we infer the columns from the file
\d parquet_table
Foreign table "public.parquet_table"
Column | Type | Collation | Nullable | Default | FDW options
--------+---------+-----------+----------+---------+-------------
key | integer | | | |
val | text | | | |
Server: pg_lake
FDW options: (path 's3://testbucketpglake/parquet_data/*.parquet')
-- and, query it
select count(*) from parquet_table;
count
-------
100
(1 row)
A pg_lake instance consists of two main components: PostgreSQL with the pg_lake extensions and pgduck_server.
Users connect to PostgreSQL to run SQL queries, and the pg_lake extensions integrate with Postgres’s hooks to handle query planning, transaction boundaries, and overall orchestration of execution.
Behind the scenes, parts of query execution are delegated to DuckDB through pgduck_server, a separate multi-threaded process that implements the PostgreSQL wire protocol (locally). This process runs DuckDB together with our duckdb_pglake extension, which adds PostgreSQL-compatible functions and behavior.
Users typically don’t need to be aware of pgduck_server; it operates transparently to improve performance. When appropriate, pg_lake delegates scanning of the data and the computation to DuckDB’s highly parallel, column-oriented execution engine.
This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.

The team behind pg_lake has a lot of experience building Postgres extensions (e.g. Citus, pg_cron, pg_documentdb). Over time, we’ve learned that large, monolithic PostgreSQL extensions are harder to evolve and maintain.
pg_lake follows a modular design built around a set of interoperating components — mostly implemented as PostgreSQL extensions, others as supporting services or libraries.
Each part focuses on a well-defined layer, such as table and metadata management, catalog and object store integration, query execution, or data format handling. This approach makes it easier to extend, test, and evolve the system, while keeping it familiar to anyone with a PostgreSQL background.
The current set of components are:
pg_lake development started in early 2024 at Crunchy Data with the goal of bringing Iceberg to PostgreSQL. The first few months were focused on building a robust integration of an external query engine (DuckDB). To get to market early, we made the query/import/export features available to Crunchy Bridge customers as Crunchy Bridge for Analytics.
Next, we started building a comprehensive implementation of the Iceberg (v2) protocol with support for transactions and almost all PostgreSQL features. In November 2024, we relaunched Crunchy Bridge for Analytics as Crunchy Data Warehouse available on Crunchy Bridge and on-premises.
In June 2025, Crunchy Data was acquired by Snowflake. Following the acquisition, Snowflake decided to open source the project as pg_lake in November 2025. The initial version is 3.0 because of the two prior generations. If you’re currently a Crunchy Data Warehouse user there will be an automatic upgrade path, though some names will change.
Full project documentation can be found in the docs directory.
Copyright (c) Snowflake Inc. All rights reserved. Licensed under the Apache 2.0 license.
pg_lake is dependent on third-party projects Apache Avro and DuckDB. During build, pg_lake applies patches to Avro and certain DuckDB extensions in order to provide the pg_lake functionality. The source code associated with the Avro and DuckDB extensions is downloaded from the applicable upstream repos and the source code associated with those projects remains under the original licenses. If you are packaging or redistributing packages that include pg_lake, please note that you should review those upstream license terms.
If instead you can write to something like Parquet/Iceberg, you're not paying for access your data.
Snowflake is great at aggregations and other stuff (seriously, huge fan of snowflakes SQL capabilities), but let's say you have a visualisation tool, you're paying for pulling data out .
Instead, writing data to something like S3, you instead can hookup your tools to this.
It's expensive to pull data out of Snowflake otherwise.
It’s 36x more expensive than equivalent EC2 compute.
Ok so I build my data lake on s3 using all open tech. I’m still paying for S3 for puts and reads and lists.
Ok I put it on my own hardware. In my own colo. you’re still paying electricity and other things. Everything is lock in.
On top of that you’re beholden to an entire community of people and volunteers to make your tech work. Need a feature? Sponsor it. Or write it and fight to upstream it. On top of that if you do this at scale at a company what about the highly paid team of engineers you have to have to maintain all this?
With snowflake I alone could provide an entire production ready bi stack to a company. And I can do so and sleep well at night knowing it’s managed and taken care of and if it fails entire teams of people are working to fix it.
Are you going to build your own roads, your own power grid, your own police force?
Again my point remains. The vast majority of times people build on a vendor as a partner and then go on to build useful things.
Apple using cloud vendors for iCloud storage. You think they couldn’t do it themselves? They couldn’t find and pay and support all the tech their own? Of course they could. But they have better things to do than to reinvent the wheel I.e building value on top of dumb compute and that’s iCloud.
It's not that deep
I think a hybrid approach works best (store on Snowflake native and iceberg/tables where needed), and allows you the benefit of Snowflake without paying the cost for certain workloads (which really adds up).
We're going to see more of this (either open or closed source), since Snowflake has acquired Crunchydata, and the last major bastion is "traditional" database <> Snowflake.
Where pruning differences might arise for Iceberg tables is the structure of Parquet files and the availability of metadata. Both depend on the writer of the Parquet files. Metadata might be completely missing (e.g., no per column min/max), or partially missing (e.g., no page indexes), which will indeed impact the perf. This is why it's super important to choose a writer that produces rich metadata. The metadata can be backfilled / recomputed after the fact by the querying engine, but it comes at a cost.
Another aspect is storage optimization: The ability to skip / prune files is intrinsically tied to the storage optimization quality of the table. If the table is neither clustered nor partitioned, or if the table has sub-optimally sized files, then all of these things will severely impact any engine's ability to skip files or subsets thereof.
I would be very curious if you can find a query on an Iceberg table that shows a better partition elimination rate in a different system.