How we give every user SQL access to a shared ClickHouse cluster

We do the same thing, every employee can access our main financial/back office SQL database, but we just use PostgreSQL with row level security[0]. We never bothered to complicate it like the post does.

0: https://www.postgresql.org/docs/18/ddl-rowsecurity.html

We just create mini data "ponds" on the fly by copying tenant isolated gold tier data to parquet in s3. The users/agent queries are executed with duckdb. We run this process when the user start a session and generate an STS token scoped to their tenant bucket path. Its extremely simple and works well (at least with our data volumes).

Reasons 1-3 could very well be done with ClickHouse policies (RLS) and good data warehouse design. In fact, that’s more secure than a compiler adding a where to a query ran by an all mighty user.

Reason 4 is probably an improvement, but could probably be done with CH functions.

The problem with custom DSLs like this is that tradeoff a massive ecosystem for very little benefit.

We did this with MotherDuck, and without introducing a new language. Every tenant has their own isolated storage and compute, so it’s trivial to grant internal users access to specific tenants as needed. DuckDB’s SQL dialect is mostly just Postgres’ with some nice ergonomic additions and a host of extra functionality.

> How do you let users write arbitrary SQL against a shared multi-tenant analytical database without exposing other tenants' data or letting a rogue query take down the cluster?

For query operations I would try to find a way to solve this with tools like S3 and SQLite. There are a few VFS implementations for S3 and other CDNs.

New to ClickHouse here. Would you thing this kind of database has a niche when compared to usual RDBMS like MySQL and PostgreSQL?

The evolution of this is to use agents, and have users "chat with the data"

New to ClickHouse here. Would you thing this kind of database has a niche when compared to usual RDBMS like MySQL and PostgreSQL?

0: https://www.postgresql.org/docs/18/ddl-rowsecurity.html

Back office, employee access is a completely different problem to what is described in the post.

How do you enforce tenant isolation with that method, or prevent unbounded table reads?

I'd be so uncomfortable with this. It sounds like you're placing the full burden of access on a single boundary. I mean, maybe there's more to it that you haven't spoken about here, but "everything rests on this one postgres feature" is an unacceptably unsafe state to me.

Is that why it’s called DuckDb? Because data ponds?

How do you copy all the relevant data? Doesn't this create unnecessary load on your source DB?

Reasons 1-3 could very well be done with ClickHouse policies (RLS) and good data warehouse design. In fact, that’s more secure than a compiler adding a where to a query ran by an all mighty user.

Reason 4 is probably an improvement, but could probably be done with CH functions.

The problem with custom DSLs like this is that tradeoff a massive ecosystem for very little benefit.

You’re right RLS can go a long way here. With complex RBAC rules it can get tricky though.

The main advantages of a DSL are you can expose a nicer interface to users (table names, columns, virtual columns, automatic joins, query optimization).

We very intentionally kept the syntax as close to regular ClickHouse as possible but added some functions.

As long as you don't deviate too much from ANSI, I think the 'light sql DSL' approach has a lot of pros when you control the UX. (so UIs, in particular, are fantastic for this approach - what they seem to be targeting with queryies and dashboards). It's more of a product experience; tables are a terrible product surface to manage.

Agreed with the ecosystem cons getting much heavier as you move outside the product surface area.

This is explicitly not the problem they are trying to solve. In a single tenant database you don’t have to by definition worry about multi tenant databases

The evolution of this is to use agents, and have users "chat with the data"

Yes, you can actually do this already because we expose a REST API and TypeScript SDK functions to execute the queries.

Agreed with the ecosystem cons getting much heavier as you move outside the product surface area.

You’re right RLS can go a long way here. With complex RBAC rules it can get tricky though.

The main advantages of a DSL are you can expose a nicer interface to users (table names, columns, virtual columns, automatic joins, query optimization).

We very intentionally kept the syntax as close to regular ClickHouse as possible but added some functions.

Yes, you can actually do this already because we expose a REST API and TypeScript SDK functions to execute the queries.

Back office, employee access is a completely different problem to what is described in the post.

How do you enforce tenant isolation with that method, or prevent unbounded table reads?

They likely don't need tenant isolation and unbound table reads can be mitigated using timeouts.

We do something similar for our backoffice - just with the difference that it is Claude that has full freedom to write queries.

Conceptually that's no different to any security measures that prevent you from accessing data you're not supposed to? At the end of the day with all data that is colocated you're trusting that some permission feature somewhere is preventing you from accessing data you're not supposed to.

We trust that Amazon or Google or Microsoft are successful in protecting customer data for example. We trust that when you log into your bank account the money you see is yours, and when you deposit it we trust that the money goes into your account. But it's all just mostly logical separation.

row level security is not a feature specific to Postgres, but more a pretty standard and acceptable way to control access in a multitenant or multicontext environment that pretty much every data provider supports/implements. When it comes to answering a single specific question (like the one RLS targets) I believe you DO want a single, simple answer, vs. something like "it uses these n independent things working in conjunction..."

It's not like RLS is just some random feature they are misusing. It's specifically for security and is absolutely reliable. Maybe you should do a bit more research before making comments like this.

Is that why it’s called DuckDb? Because data ponds?

The DuckDB website has the following to say about the name:

> Why call it DuckDB?

> Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck's song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system.

https://duckdb.org/faq#why-call-it-duckdb

Idk but I named everything in the related code "duckpond" :)

Hannes (one of the creators) had a pet duck

How do you copy all the relevant data? Doesn't this create unnecessary load on your source DB?

We have various data sources (which is another benefit of this approach). Data from the application DB is currently pulled using the FE apis which handle tenant isolation and allow the application database to deal with the load. I think pg_duckdb could be a good solution here as well, but haven't gotten around to testing it. Other data come from analytics DB. Most of this is landed on an interval via pipeline scripts.

This is explicitly not the problem they are trying to solve. In a single tenant database you don’t have to by definition worry about multi tenant databases

They likely don't need tenant isolation and unbound table reads can be mitigated using timeouts.

We do something similar for our backoffice - just with the difference that it is Claude that has full freedom to write queries.

Idk but I named everything in the related code "duckpond" :)

The DuckDB website has the following to say about the name:

> Why call it DuckDB?

https://duckdb.org/faq#why-call-it-duckdb

In a system with organizations, projects and advanced user access permissions having separate databases doesn’t full solve the problem. You still need access control inside each tenanted database. It also makes cross-cutting queries impossible which means users can’t query across all their orgs for example.

The DSL approach has other advantages too: like rewriting queries to not expose underlying tables, doing automatic performance optimizations…

I guess the question then becomes, what problem does a multi-tenancy setup solve that an isolated database setup doesn't? Are they really not solving the same problem for a user perspective, or is it only from their own engineering perspective? And how do those decisions ultimately impact the product they can surface to users?

> At the end of the day with all data that is colocated you're trusting that some permission feature somewhere is preventing you from accessing data you're not supposed to.

Right but ideally more than one.

> But it's all just mostly logical separation.

Yes, ideally multiple layers of this. You don't all share one RDS instance and then get row level security.

It's not like RLS is just some random feature they are misusing. It's specifically for security and is absolutely reliable. Maybe you should do a bit more research before making comments like this.

Of course it's designed for security... that has nothing to do with my statement. No single boundary is "absolutely reliable", that's my entire point.

Right, RLS is great. What they are saying is this:

> every employee can access our main financial/back office SQL database

This means that there is no access gate other than RLS, which includes financial data. That is a lot of pressure on one control.

The DSL approach has other advantages too: like rewriting queries to not expose underlying tables, doing automatic performance optimizations…

Hannes (one of the creators) had a pet duck

Right, RLS is great. What they are saying is this:

> every employee can access our main financial/back office SQL database

This means that there is no access gate other than RLS, which includes financial data. That is a lot of pressure on one control.

Off the top of my head, managing 100 different database instances takes a lot more work from the business standpoint than managing 1 database with 100 users.

The article also mentioned that they isolate by project_id. That implies one customer (assume a business) can isolate permissions more granulary.

With multi-tenant vs multi-database decision one driver would be the level of legal/compliance/risk/cost/resource drivers around how segregated users really are.

Multi-database is more expensive generally but is a more brain dead guaranteed way to ensure the users are properly segregated, resilient across cloud/database/etc software releases that may regress something in a multi-tenant setup.

Multi-tenant you always run the risk of a software update, misconfiguration or operational error exposing existence of other users / their metadata / their data / their usage / etc. You also have a lot more of a challenge engineering for resource contention.

Of course it's designed for security... that has nothing to do with my statement. No single boundary is "absolutely reliable", that's my entire point.

You can (and in some cases should) combine this with other boundaries, like access control or specific query logic. RLS moves the core checks closer to the data being controlled (i.e. the database) specifically to prevent the errors like forgetting to add the "where user_id = xxx" clause. It is super-valuable in compliance scenarios where someone like your DB Admin has permission to control access but not see any data, and consumers (both devs and end users) can write queries to see different levels of access but are not allowed to control data.

Obviously it's not a silver bullet and the isolation can be confusing when debugging, but generally a single point for your applying RBAC is a feature not a shortcoming. The next level of security might be how you define your roles.

I actually believe the simplest, most secure client scenario is physical isolation, where you give the user/consumer only the data they are allowed to use and then don't try to control it (someone mentioned this above, using parquet & duckdb). There's downsides here too: doesn't work for write scenarios, can be resource intensive or time delayed, doesn't handle chain of custody well, etc. You typically have two strategies:

1. pick the best approach for the specific situation.

2. pick your one tool as your hammer and be a d!ck about it.

This is the real world not everybody can get a perfectly isolated database instance. Also you do realize that is not necessarily any more secure than RLS right? Something still has to control what database the code connects to. That could have a flaw just as much as you could have a flaw when setting up RLS.

> At the end of the day with all data that is colocated you're trusting that some permission feature somewhere is preventing you from accessing data you're not supposed to.

Right but ideally more than one.

> But it's all just mostly logical separation.

Yes, ideally multiple layers of this. You don't all share one RDS instance and then get row level security.

Can you give an example of more than one layer of logical separation at the data layer?

We all know that authentication should have multiple factors. But that's a different problem. Fundamentally at the point you're reading or writing data you're asking the question "does X has permission to read/write Y".

I don't see what you're getting at.

> How do you let users write arbitrary SQL against a shared multi-tenant analytical database without exposing other tenants' data or letting a rogue query take down the cluster?

For query operations I would try to find a way to solve this with tools like S3 and SQLite. There are a few VFS implementations for S3 and other CDNs.

1. pick the best approach for the specific situation.

2. pick your one tool as your hammer and be a d!ck about it.

With multi-tenant vs multi-database decision one driver would be the level of legal/compliance/risk/cost/resource drivers around how segregated users really are.

Open Table Formats (Iceberg, Delta Lake, Hudi, etc) are the approach we've taken. That let's us offer a query engine but also let's the tenant bring their preferred engine (Snowflake, Spark, DuckDB, etc). It also addresses dirty reads and some other state problems that come from trying to use the file system. It scales as much as the bucket does, so we haven't found a use case we couldn't scale to yet.

We (https://prequel.co) recently started offering this as a white labeled capability so anyone can offer it without building it yourself. Its a newer capability to our export product where instead of sending the data to the tenant's data warehouse, we enable you to provision an S3/GCS/ABS/etc bucket with the data formatted. Credential management, analytics, etc is all batteries included so you don't have to do that either. The initial interest from our customers was around BI integrations but agent use is starting to pick up which is kinda interesting to see.

Off the top of my head, managing 100 different database instances takes a lot more work from the business standpoint than managing 1 database with 100 users.

The article also mentioned that they isolate by project_id. That implies one customer (assume a business) can isolate permissions more granulary.

Yes it’s exactly this. There’s not a neat permission boundary when you have users, orgs, projects, environments. Let alone when you add RBAC too.

Can you give an example of more than one layer of logical separation at the data layer?

I don't see what you're getting at.

I don't know their use case enough to understand what would or would not be an appropriate mitigation. For example, with regards to financial data, you could have client side encryption on values where those keys are brokered separately. I can't exactly design their system for them, but they're describing a system in which every employee has direct database access and the database holds financial information.

I don't think you have any idea what you're talking about, certainly you don't know what I'm talking about.

Yes it’s exactly this. There’s not a neat permission boundary when you have users, orgs, projects, environments. Let alone when you add RBAC too.

I don't think you have any idea what you're talking about, certainly you don't know what I'm talking about.

How does it handle large tables like. 2b rows? And how does it stay updated?

How do you let users write arbitrary SQL against a shared multi-tenant analytical database without exposing other tenants' data or letting a rogue query take down the cluster?

That's the problem we needed to solve for Query & Dashboards. The answer is TRQL (Trigger Query Language), a SQL-style language that compiles to secure, tenant-isolated ClickHouse queries. Users write familiar SQL. TRQL handles the security, the abstraction, and the translation.

This post is a deep dive into how it all works. We'll cover the language design, the compilation pipeline, the schema system, and the features that make TRQL more than just a SQL passthrough.

Why build a DSL?

A DSL (domain-specific language) is a language designed for a particular problem domain. CSS is a DSL for styling. SQL is a DSL for querying databases. TRQL is a DSL for querying Trigger.dev data.

We could have exposed raw ClickHouse SQL directly. But there are three reasons we didn't:

1. The language itself is a security boundary. By defining our own grammar, we control exactly what operations are possible. INSERT, UPDATE, DELETE, DROP, and any ClickHouse function we haven't explicitly allowed simply don't exist in the language. This isn't validation that rejects dangerous queries; the parser physically cannot produce them. We cover this in more detail in the ANTLR section below.

2. Tenant isolation must be compiler-enforced, not user-trusted. In a multi-tenant system, every query must be scoped to the requesting organization. If we relied on users including WHERE organization_id = '...' in their queries, a missing filter would leak data across tenants. TRQL injects these filters automatically during compilation. There's no way to opt out.

3. Internal database details should be hidden. Our ClickHouse tables have names like trigger_dev.task_runs_v2 and columns like cost_in_cents and base_cost_in_cents. Users shouldn't need to know any of that. TRQL lets them write SELECT total_cost FROM runs while the compiler handles the translation.

4. We need features that don't exist in ClickHouse. Virtual columns, automatic time bucketing, value transforms, and rendering metadata are all things we've built into TRQL's schema layer. A raw SQL passthrough couldn't provide any of this.

A big thanks to PostHog who pioneered this approach with HogQL, a SQL-like interface on top of ClickHouse. TRQL started as a TypeScript conversion of their Python implementation but evolved significantly during development to handle our specific use cases.

Why ClickHouse?

Before we get into the language itself, it helps to understand the target. We chose ClickHouse as the analytical backend because it excels at exactly this kind of workload:

Columnar storage: Queries only read the columns they need, so a query selecting status and total_cost doesn't touch output, error, or any other column
Incredible performance: Handles billions of rows with sub-second query times for typical aggregations
Rich SQL: JSON extraction, complex aggregations, percentile functions, text search, and more
Battle-tested: Used by Cloudflare, Uber, eBay, and many others at scale

If you want to know more about how we run ClickHouse in production, we wrote a postmortem on a partitioning incident that goes into the internals.

Parsing with ANTLR

TRQL is parsed using ANTLR, a parser generator that takes a formal grammar definition and produces a lexer and a parser. The lexer breaks the raw query text into tokens (keywords, identifiers, operators, string literals). The parser takes those tokens and arranges them into a structured tree based on the grammar rules. You write the grammar, ANTLR generates the code for both.

This is important for security. The grammar defines what the language can express. If DELETE, UPDATE, DROP, or SET aren't in the grammar, they can never appear in a parsed query. It's not that we validate and reject them. They literally don't exist in TRQL's syntax. This is security by construction, not by validation.

TRQL's grammar is a strict subset of SQL. If you've written SQL before, TRQL will feel completely familiar. SELECT, FROM, WHERE, GROUP BY, ORDER BY, LIMIT, and common aggregation functions all work as expected. But the grammar is physically incapable of expressing writes or administrative commands.

Our ANTLR grammar targets TypeScript and produces a full abstract syntax tree (AST) for each query. The AST is a structured tree representation of the query that the compiler can inspect, validate, and transform. Every subsequent step in the pipeline operates on this AST rather than on raw text.

For example, the query SELECT task_identifier, SUM(total_cost) FROM runs WHERE status = 'Failed' produces this tree:

SelectStatement

├── SelectList

│ ├── SelectListItem

│ │ └── ColumnReference: task_identifier

│ └── SelectListItem

│ └── AggregateFunctionCall: SUM

│ └── ColumnReference: total_cost

├── FromClause

│ └── TableReference: runs

└── WhereClause

└── ComparisonExpression (=)

├── ColumnReference: status

└── StringLiteral: 'Failed'

Each node in the tree is something the compiler can reason about. It can check that runs is a valid table, that task_identifier and total_cost exist on that table, that SUM is an allowed function, and that 'Failed' is a valid value for the status column.

The compilation pipeline

Once parsed, the AST goes through a series of transformations before it becomes executable ClickHouse SQL. Here's each step:

Parse: The TRQL query is parsed into an AST using ANTLR. Only constructs that exist in the grammar can make it this far. Anything else is a syntax error.
Schema validation: We walk the AST and check every identifier against the table schemas. Does the table exist? Do all the referenced columns exist on that table? Are the functions valid? Are the argument types correct? If you write WHERE status = 123 but status is a string column with allowed values, this step catches it.
Tenant isolation: We inject tenant-specific filters into the WHERE clause. At a minimum, every query gets an organization_id filter. Depending on the query scope, we also add project_id and environment_id filters. These are added to the AST itself, so they're baked into the query structure before any SQL is generated. Without this step, any user could read any other organization's data.
Time restrictions: We add time bounds to prevent unbounded scans. Without this, a simple SELECT * FROM runs would attempt to scan the entire table history. The maximum queryable time range varies by plan on Trigger.dev Cloud.
Parameterize values: All literal values in the query (strings, numbers, dates) are extracted from the AST and replaced with named parameters like {tsql_val_0: String}. The actual values are passed separately to ClickHouse rather than being interpolated into the SQL string. Combined with the grammar restrictions from the parsing step, this means the generated ClickHouse SQL is always structurally safe.
Generate ClickHouse SQL: The transformed AST is "printed" into ClickHouse-compatible SQL. This is where virtual columns are expanded to their real expressions, table names are translated, and TRQL-specific functions are compiled to their ClickHouse equivalents.
Execute: The generated SQL is executed against ClickHouse in read-only mode. On Trigger.dev Cloud, queries run against a dedicated read-only replica to avoid impacting write performance.
Return results: Results come back in JSON format, along with column metadata that tells the UI how to render each value.

Here's the full pipeline visualized:

Example: TRQL in, ClickHouse out

Let's make this concrete. Here's a simple TRQL query that finds the cost of each task:

SELECT

task_identifier,

SUM(total_cost) AS cost

FROM

runs

GROUP BY

task_identifier

And here's the parameterized ClickHouse SQL that TRQL generates:

SELECT

task_identifier,

-- `total_cost` is actually the sum of two columns and needs converting to dollars

sum(((cost_in_cents + base_cost_in_cents) / 100.0)) AS cost

-- Table names are translated and FINAL is used to avoid stale data

FROM trigger_dev.task_runs_v2 AS runs FINAL

WHERE

and(

-- Tenant isolation: organization

equals(runs.organization_id, {tsql_val_0: String}),

),

-- Time restriction

greaterOrEquals(created_at, toDateTime64({tsql_val_1: String}, 3))

)

GROUP BY task_identifier

-- We limit results to 10k rows (we return an extra so we can tell the user if there are more)

LIMIT 10001;

Every step from the pipeline is visible here:

total_cost is a virtual column. Users write SUM(total_cost) but TRQL expands it to sum(((cost_in_cents + base_cost_in_cents) / 100.0)). The user never needs to know that costs are stored as two separate cent values in ClickHouse.
Table names are translated from the user-friendly runs to the actual trigger_dev.task_runs_v2 table. The FINAL keyword tells ClickHouse to read the latest merged data, which matters because ClickHouse uses a MergeTree engine that can have unmerged parts.
Tenant isolation is injected automatically via equals(runs.organization_id, {tsql_val_0: String}). There's no way to query data from another organization because this filter is added by the compiler, not the user.
Time restrictions are added via greaterOrEquals(created_at, ...). Without this, the query would scan the entire history of the table.
Parameterized values like {tsql_val_0: String} prevent SQL injection. The actual organization ID and timestamp are passed as separate parameters to ClickHouse, never interpolated into the query string.
Row limits are automatically applied. We request 10,001 rows so we can tell the user "there are more results" while only returning 10,000.

Schema design

The schema definition is where a lot of TRQL's power comes from. Each table is defined as a TypeScript object that describes not just the columns, but how they should be translated, validated, and rendered. Here's what's interesting about it.

Two tables

TRQL currently exposes two tables:

runs: Every task run, including status, timing, costs, machine type, tags, error data, and other metadata. This is the primary table for understanding what your tasks are doing.
metrics: CPU utilization, memory usage, and any custom metrics you record via OpenTelemetry. Metrics are pre-aggregated into 10-second buckets for efficient querying.

Virtual columns

Some of the most useful columns in TRQL don't exist in ClickHouse at all. They're defined as expressions that the compiler expands during query generation.

total_cost is a good example. In ClickHouse, costs are stored as two separate integer columns: cost_in_cents (compute cost) and base_cost_in_cents (invocation cost). The schema defines total_cost as:

total_cost: {

name: "total_cost",

expression: "(cost_in_cents + base_cost_in_cents) / 100.0",

// ...

}

When a user writes SELECT total_cost FROM runs, TRQL expands it to (cost_in_cents + base_cost_in_cents) / 100.0. The user gets a clean dollar amount without knowing about the internal storage format.

Other virtual columns follow the same pattern:

User-facing column	Expression
`execution_duration`	`dateDiff('millisecond', executed_at, completed_at)`
`total_duration`	`dateDiff('millisecond', created_at, completed_at)`
`queued_duration`	`dateDiff('millisecond', queued_at, started_at)`
`is_finished`	`if(status IN ('COMPLETED_SUCCESSFULLY', ...), true, false)`
`is_root_run`	`if(depth = 0, true, false)`

Users write WHERE execution_duration > 5000 and the compiler handles the rest.

Column renaming

ClickHouse column names are database artifacts. TRQL renames them to domain concepts:

TRQL name	ClickHouse name
`run_id`	`friendly_id`
`triggered_at`	`created_at`
`machine`	`machine_preset`
`attempt_count`	`attempt`
`dequeued_at`	`started_at`

This means we can refactor our ClickHouse schema without breaking user queries. The TRQL names are the stable public API.

Value transforms

Some columns need their values transformed at the boundary. For example, run IDs are stored in ClickHouse without a prefix, but users expect to write WHERE run_id = 'run_cm1a2b3c4d5e6f7g8h9i'. The schema defines a whereTransform that strips the run_ prefix before the value hits ClickHouse:

root_run_id: {

name: "root_run_id",

expression: "if(root_run_id = '', NULL, 'run_' || root_run_id)",

whereTransform: (value: string) => value.replace(/^run_/, ""),

// ...

}

The expression adds the prefix when reading (so results display run_...), and whereTransform strips it when filtering. Users never need to think about how IDs are stored internally. The same pattern applies to batch_id (stripping batch_) and parent_run_id.

Column metadata for rendering

Each column carries metadata that tells the UI how to display its values. The customRenderType field controls this:

Render type	Behavior
`runId`	Displayed as a clickable link to the run
`duration`	Formatted as human-readable time (e.g. "3.5s")
`costInDollars`	Formatted as currency
`runStatus`	Rendered with colored status badges
`tags`	Displayed as tag chips
`environment`	Resolved to the environment slug

This metadata is returned alongside query results, so the dashboard knows that 3500 in the execution_duration column should display as "3.5s", not as the raw number. The query engine isn't just returning data; it's returning instructions for how to present it.

Allowed values

Columns like status, machine, and environment_type declare their valid values directly in the schema:

status: {

name: "status",

allowedValues: ["Completed", "Failed", "Crashed", "Queued", ...],

// ...

}

These allowed values serve multiple purposes: the query editor uses them for autocomplete suggestions, the AI assistant uses them to generate valid queries, and the schema validator rejects queries that filter on values that don't exist.

Custom functions

TRQL includes functions that don't exist in ClickHouse. These are expanded during compilation into their ClickHouse equivalents.

timeBucket()

The most important custom function. timeBucket() automatically selects an appropriate time interval based on the query's time range. You use it like this:

SELECT

timeBucket(),

COUNT(*) as runs

FROM runs

GROUP BY timeBucket

ORDER BY timeBucket

The compiler looks at the time range of the query and chooses bucket sizes that balance detail with performance:

Time range	Bucket size
Up to 3 hours	10 seconds
Up to 12 hours	1 minute
Up to 2 days	5 minutes
Up to 7 days	15 minutes
Up to 30 days	1 hour
Up to 90 days	6 hours
Up to 180 days	1 day
Up to 1 year	1 week

This matters for three reasons. First, users don't need to think about granularity. A chart that covers the last hour gets 10-second resolution. The same query over 30 days automatically switches to hourly buckets. Second, it prevents queries from returning millions of rows. Without automatic bucketing, a time-series query over a year of data could try to return a row for every 10-second interval. Third, and possibly most importantly, when you add a chart to a dashboard and adjust the time range, the chart will automatically switch to the appropriate bucket size.

prettyFormat()

Schema columns carry rendering metadata automatically (a duration column knows it should display as "3.5s"). But what about computed expressions? If you write SUM(usage_duration), the result is just a raw number with no formatting hint.

prettyFormat() solves this. It takes two arguments: an expression and a format type. The expression is passed through to ClickHouse unchanged, but the format type is attached as column metadata in the response so the UI knows how to render the result.

SELECT

timeBucket(),

prettyFormat(avg(value), 'bytes') AS avg_memory

FROM metrics

WHERE metric_name = 'process.memory.usage'

GROUP BY timeBucket

ORDER BY timeBucket

LIMIT 1000

The available format types are:

Format type	Renders as
`duration`	Milliseconds as human-readable time (e.g. "3.5s", "2m 15s")
`durationSeconds`	Seconds as human-readable time
`costInDollars`	Dollar formatting with appropriate precision
`cost`	Generic cost formatting
`bytes`	Byte counts with binary units (KiB, MiB, GiB)
`decimalBytes`	Byte counts with decimal units (KB, MB, GB)
`quantity`	Large numbers abbreviated (1.2M, 3.4K)
`percent`	Percentage formatting

This is the same rendering system that powers the schema's customRenderType, but available for any expression you write. The dashboard widgets use it to display computed values with the right units.

ClickHouse functions

TRQL doesn't try to reinvent standard analytical functions. ClickHouse aggregations like quantile(), countIf(), avg(), sum(), and round() are all available directly and passed through to ClickHouse unchanged. TRQL only adds custom functions when it needs behavior that ClickHouse can't provide on its own.

The editor experience

The query editor in the Trigger.dev dashboard is built on CodeMirror 6 and uses a dual-parser architecture.

The query editor

Two parsers, one editor

Syntax highlighting and linting are handled by two completely different parsers:

For highlighting, we use CodeMirror's built-in Lezer grammar with the StandardSQL dialect. Lezer is an incremental parser, meaning it only re-parses the parts of the document that changed. This makes it fast enough to run on every keystroke without any perceptible lag. It tokenizes the text into syntax nodes (keywords, identifiers, strings, numbers, operators) and our custom theme maps these to colors.

For linting, we use the full ANTLR4-based TRQL parser. Every edit (debounced by 300ms) runs the complete TRQL pipeline: parseTSQLSelect() produces a full AST, then validateQuery(ast, schema) checks it against the table schemas. This catches unknown columns, invalid table names, and type mismatches and shows them as inline diagnostics.

Why two parsers? Lezer is fast but doesn't understand TRQL-specific semantics like virtual columns or allowed values. ANTLR understands everything but is too heavy to run on every keystroke for syntax coloring. Using both gives us the interactive responsiveness of Lezer with the correctness guarantees of ANTLR.

Schema-powered autocomplete

Autocompletion is entirely custom. We don't use CodeMirror's built-in SQL completion. Instead, the completion source analyzes the cursor position and the surrounding query text to determine context:

After FROM or JOIN: show table names
After SELECT, WHERE, GROUP BY, ORDER BY: show columns from the tables referenced in the query, plus functions
After tableName.: show columns for that specific table
After a comparison operator like = or IN (: show allowedValues from that column's schema definition

This is where the schema really pays off. When you type WHERE status = ' the editor immediately suggests Completed, Failed, Crashed, and the other valid status values, because the schema declares them. The same allowedValues arrays that power validation also power autocomplete.

The query editor showing schema-aware autocomplete suggestions

If you reference a column that doesn't exist, the linter catches it immediately and shows an inline error:

The query editor showing an inline lint error for an unknown column

The data flow

Every keystroke flows through three independent paths:

Lezer parser (immediate): tokenizes the SQL, applies syntax highlighting
Completion source (on trigger): analyzes cursor context, returns schema-aware suggestions
ANTLR linter (300ms debounce): full parse and validation, produces error diagnostics

The TableSchema type is the glue that connects all three. It defines table names for FROM suggestions, column definitions for column suggestions, allowed values for enum completion, and validation rules for the linter.

Limits

We enforce several limits to keep the system healthy for all users. Each one exists for a specific reason:

Concurrency limit: We limit the number of concurrent queries per organization. Without this, a single org running expensive queries in a loop could monopolize the ClickHouse cluster and degrade performance for everyone else.
Row limit: Results are capped at 10,000 rows. We actually request 10,001 from ClickHouse so we can tell the user "there are more results" without returning an unbounded dataset.
Time restrictions: Every query is bounded to a maximum time range. Without this, SELECT * FROM runs would attempt a full table scan across the entire history. The allowed range varies by plan.
Query resource limits: We cap the memory a query can use, the wall-clock time it can run, and the AST complexity. The AST complexity limit prevents deeply nested subqueries that could produce exponential query plans in ClickHouse.

What's next

TRQL is the foundation for everything we're building in observability. The same language powers the dashboard widgets, the SDK's query.execute() function, and the REST API. As we add more data to the system, we can expose it through new tables without changing the query language or the compilation pipeline.

If you haven't tried Query & Dashboards yet, every project already has a built-in dashboard waiting for you. Head to your Trigger.dev dashboard to try it out.

Read the companion post: Query & Dashboards: analytics for your Trigger.dev data

Hacker Times

Hacker Times

How we give every user SQL access to a shared ClickHouse cluster

Discussion

Discussion

Why build a DSL?

Why ClickHouse?

Parsing with ANTLR

The compilation pipeline

Example: TRQL in, ClickHouse out

Schema design

Two tables

Virtual columns

Column renaming

Value transforms

Column metadata for rendering

Allowed values

Custom functions

timeBucket()

prettyFormat()

ClickHouse functions

The editor experience

Two parsers, one editor

Schema-powered autocomplete

The data flow

Limits

What's next