They are overoptimising for the simplest part of writing the application; the beginning. They've half-implemented an actual database, with none of the safety features. There are a lot of potential headaches that this article has avoided talking about; perhaps because they haven't experienced them yet.
See: https://danluu.com/file-consistency/
What happens when you need to start expanding the scope of this feature? Joining users on profiles, or users on orgs?
Ask yourself: how many shops have seriously written an application backed by files and stuck with it over the long-run? The answer is likely very few. Therefore, this is likely doubling up the work required.
There is a reason people reach for a database first. I'd strongly encourage anyone to avoid doing stuff like this.
Since they’re using Go to accept requests and forwarding them to their SQLite connection, it may have been worthwhile to produce the same interface with Rust to demonstrate whether or not SQLite itself was hitting its performance limit or if Go had some hand in that.
Other than that, it’s a good demonstration of how a custom solution for a lightweight task can pay off. Keep it simple but don’t reinvent the wheel if the needs are very general.
If you need to ever update a single byte in your data, please USE A PROPER DATABASE, databases does a lot of fancy thing to ensure you are not going to corrupt/broke your data on disk among other safety things.
However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.
And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?
Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.
You'll likely end up quite a chump if you follow this logic.
sqlite has pretty strong durability and consistency mechanism that their toy disk binary search doesn't have.
(And it is just a toy. It waves away the maintenance of the index, for god's sake, which is almost the entire issue with indexes!)
Typically, people need to change things over time as well, without losing all their data, so backwards compatibility and other aspects of flexibility that sqlite has are likely to matter too.
I think once you move beyond a single file read/written atomically, you might as well go straight to sqlite (or other db) rather than write your own really crappy db.
What the world needs is a hybrid - database ACID/transaction semantics with the ability to cd/mv/cp file-like objects.
Total hosting costs are £0 ($0) other than the domain name.
Also notable mention for JSON5 which supports comments!: https://json5.org/
I don't choose a DB over a flat file for its speed. I choose a DB for the consistent interface and redundancy.
Different languages and stdlib methods can often spend time doing unexpected things that makes what looks like apples-to-apples comparisons not quite equivalent
It's the opposite. A file system is a database. And databases can recursively store their data within another database.
Then proceeds to (poorly) implement database on files.
Sure, Hash Map that take ~400mb in memory going to offer you fast lookups. Some workloads will never reach this size can be done as argument, but what are you losing by using SQLite?
What happens when services shutdowns mid write? Corruption that later results in (poorly) implemented WAL being added?
SQLite also showed something important - it was consistent in all benchmarks regardless of dataset size.
So my opinion has thoroughly shifted to "start with a database, and if you _really_ don't need one it'll be obvious.
But you probably do.
I ended up just buying a VPS, putting openclaw on it, and letting it Postgres my app.
I feel like this article is outdated since the invention of OpenClaw/Claude Opus level AI Agents. The difficulty is no longer programming.
Nope. There are non-persistent in-memory databases too.
In fact, a database can be a plethora of things and the stuff they were building is just a subset of a subset (persistent, local relational database)
This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.
There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."
One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.
So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.
If you need scale, cache the most often accessed data in memory and you have the best of both worlds.
My winning combo is sqlite + in-memory cache.
For our use case — merge, split, compress — we went fully stateless. Files are processed in memory and never stored. No database needed at all.
The only time a database becomes necessary is when you need user accounts, history, or async jobs for large files. For simple tools, a database is often just added complexity.
The real question isn't "do you need a database" but "do you need state" — and often the answer is no.
As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).
The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.
The very small bonus you get on small apps is hardly worth the time you spend redeveloping the wheel.
I can use databases just fine, but will never be able to make wise decisions about table layouts, ORMs, migrations, backups, scaling.
I don't understand the culture of "oh we need to use this tool because that's what professionals use" when the team doesn't have the knowledge or discipline to do it right and the scale doesn't justify the complexity.
a jsonl file and a single go binary will literally outlive most startup runways.
also, the irony of a database gui company writing a post about how you dont actually need a database is pretty based.
I think a better way to ask this question is "does this application and its constraints necessitate a database? And if so, which database is the correct tool for this context?"
That's why it could handle massive traffic with very little issues.
Memory of course, as you wrote, also seems reasonable in many cases.
Just have to use locks to be careful with writes.
I figured I'd migrate it to a database after maybe 10k users or so.
Overhead in any project is understanding it and onboarding new people to it. Keeping on "mainline" path is key to lower friction here. All 3 languages have well supported ORM that supports SQLite.
In my (hypothetical, 'cause I never actually sat down and wrote that) case, I wanted the personal transactions in a month, and I realized I could just keep one single file per month, and read the whole thing at once (also 'cause the application would display the whole month at once).
Filesystems can be considered a key-value (or key-document) database. The funny thing about the example used in the link is that one could simply create a structure like `user/[id]/info.json` and directly access the user ID instead of running some file to find them -- again, just 'cause the examples used, search by name would be a pain, and one point where databases would handle things better.
Now that said, if there's value to the "database" being human readable/editable, json is still well worth a consideration. Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.
> Doing atomic writes is extremely fragile if you are just on top of the filesystem.
This is not true, at least in Linux. pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
The requirements being that the write must be block-aligned and no larger than the underlying FS's guaranteed atomic write sizeAt least write to a temp file(in the same filesystem), fsync the file and its folder and rename it over the original.
There are also things besides databases that I'll DIY and then still wonder why so many people use a premade tool for it, like log4j
I suggest every developer learn how to replicate, backup and restore the very database they are excited about, from scratch at least once. I propose this will teach them what takes to build a production ready system and gain some appreciation for other ways of managing state.
But yeah, the page cache point is real and massively underappreciated. Modern infrastructure discourse skips past it almost entirely. A warm NVMe-backed file with the OS doing the caching is genuinely fast enough for most early-stage products.
I'm pretty sure most startups just use a quick and easy CRM that makes this process easy, and that tool will certainly use a database.
They were terrific reads; his writing on object-oriented databases was the most fun technical reading I did in grad school. And I even learned a lot!
In practice, the records themselves took no less than 30 joins for a flat view of the record data that was needed for a single view of what could/should have been one somewhat denormalied record in practice. In the early 2010's that meant the main database was often kicked over under load, and it took a lot of effort to add in appropriate caching and the search db, that wound up handling most of the load on a smaller server.
And then another index. And at some point you want to ensure uniqueness or some other constraint.
And then you’re rewriting a half-complete and buggy SQLite. So I’ve come around to defaulting to SQLite/PostgresQL unless I have a compelling need otherwise. They’re usually the right long-term choice for my needs.
How? With SQL is super easy to search, compare, and update data. That's what it’s built for.
"Why use spray paint when you can achieve the same effect by ejecting paint from your mouth in a uniform high-velocity mist?" If you happen to have developed that particular weird skill, by all means use it, but if you haven't, don't start now.
That probably sounds soft and lazy. I should learn to use my operating system's filesystem APIs safely. It would make me a better person. But honestly, I think that's a very niche skill these days, and you should consider if you really need it now and if you'll ever benefit from it in the future.
Also, even if you do it right, the people who inherit your code probably won't develop the same skills. They'll tell their boss it's impossibly dangerous to make any changes, and they'll replace it with a database.
Doesn't scale at all, though - all of the data that needs to be self-consistent needs to be part of the same file, so unnecessary writes go through the roof if you're only doing small updates on a giant file. Still gotta handle locking if there is risk of a stray process messing it up. And doing this only handles part of ACID.
Edit: I just submitted a link to Joe Armstrong's Minimum Viable Programs article from 2014. If the response to my comment is about the enterprise and imaginary scaling problems, realize that those situations don't apply to some programming problems.
If that kind of filesystem traffic is unsuitable for your application then you will reinvent journaling or write-ahead logging. And if you want those to be fast you'll implement checkpointing and indexes.
This is by design, the idea is that scaling your application layer is easy but scaling your storage/db layer is not
Hence make the storage dumb and have the application do the joins and now your app scales right up
(But tbh I agree a lot of applications don’t reach the scale required to benefit from this)
check_empty_vhosts () {
# Check which vhost adapter doesn't have any VTD mapped
start_sqlite
tosql "SELECT l.vios_name,l.vadapter_name FROM vios_vadapter AS l
LEFT OUTER JOIN vios_wwn_disk_vadapter_vtd AS r
USING (vadapter_name,vios_name)
WHERE r.vadapter_name IS NULL AND
r.vios_name IS NULL AND
l.vadapter_name LIKE 'vhost%';"
endsql
getsql
stop_sqlite
}
check_empty_vhosts_sh () {
# same as above, but on the shell
join -v 1 -t , -1 1 -2 1 \
<(while IFS=, read vio host slot; do
if [[ $host == vhost* ]]; then
print ${vio}_$host,$slot
fi
done < $VIO_ADAPTER_SLOT | sort -t , -k 1)\
<(while IFS=, read vio vhost vtd disk; do
if [[ $vhost == vhost* ]]; then
print ${vio}_$vhost
fi
done < $VIO_VHOST_VTD_DISK | sort -t , -k 1)
}We have a bunch of these applications and they are a joy to work with.
Funny enough, even if you have a database, if you wonder if you need caches to hold state in your application server, the answer is, kindly, fuck no. Really, really horrible scaling problems and bugs are down that path.
There are use cases to store expensive to compute state in varnish (HTTP caching), memcache/redis (expensive, complex datastructures like a friendship graph), elasticsearch/opensearch (aggregated, expensive full-text search), but caching SQL results in an application server because the database is "slow" beyond a single transaction brings nothing but pain in the future. I've spent so much energy working around decisions born out of simple bad schema design decisions and tuning...
During Cretaceous, when dinosaurs were at their peak, sharks had already become very similar to the sharks of today, e.g. there were big sharks that differed very little from the white sharks and tiger sharks of today.
Then the dinosaurs have disappeared, together with the pterosaurs and the mosasaurs, and they have been replaced by other animals, but the sharks have continued to live until today with little changes, because they had already reached an optimized design that was hard to improve.
Besides the sharks, during Cretaceous there already existed along the dinosaurs other 2 groups of big predators that have changed little since then, crocodiles and big constrictor snakes similar to the pythons of today.
Therefore all 3 (sharks, crocodiles and big constrictor snakes) are examples of locally optimum designs that have been reached more than 70 million years ago, without needing after that any major upgrades.
In practice, I almost always separate the auth chain from the service chain(s) in that if auth gets kicked over under a DDoS, at least already authenticated users stand a chance of still being able to use the apps. I've also designed auth system essentially abstracted to key/value storage with adapters for differing databases (including SQLite) for deployments...
Would be interested to see how LevelDB might perform for your testing case, in that it seems to be a decent option for how your example is using data.
weve basically been brainwashed to think we need kubernetes and 3 different databases just to serve a few thousand users. gotta burn those startup cloud credits somehow i guess.
mad respect for the honesty though, actually makes me want to check out db pro when i finally outgrow my flat files.
Well, I guess that at least confirms Oracle on Itanium (!?) still supported RAW 5 years ago.
I'm guessing everyone's on ASM by now though, if they're still upgrading. I ran into a company not long ago with a huge oracle cluster that still employed physical database admins and logical database admins as separate roles...I would bet they're still paying millions for an out of date version of Oracle and using RAW.
That it's now in the box (node:sqlite) for Deno/TS makes it that much more of an easy button option.
Sadly no solution for non-rooted consoles.
One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.
There is also cases for having both, and there is cases for graph-oriented databases or even columnar-oriented ones such as duckdb.
Seems to me, with my very limited experience of course, everything leads to same boring fundamental issue: Rarely the issue lays on infrastructure, and is mostly bad design decisions and poor domain knowledge. Realistic, how many times the bottleneck is indeed the type of database versus the quality of the code and the imlementation of the system design?
Why waste time screwing around with ad-hoc file reads, then?
I mean, what exactly are you buying by rolling your own?
Wait until you actually need it.
“Virding's First Rule of Programming: Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.”
This is a solid takeaway and applies to a lot of domains. Great observation
NoSQL is the "correct" answer if your queries are KV oriented, while predictable performance and high availability are priority (true for most "control planes"). Don't think any well-designed system will usually need to "graduate" from NoSQL to SQL.
In this case, I feel like using the filesystem directly is the opposite: doing much more difficult programming and creating more complex code, in order to do less.
It depends on how you weigh the cost of the additional dependency that lets you write simpler code, of course, but I think in this case adding a SQLite dependency is a lower long-term maintenance burden than writing code to make atomic file writes.
The original post isn't about simplicity, though. It's about performance. They claim they achieved better performance by using the filesystem directly, which could (if they really need the extra performance) justify the extra challenge and code complexity.
Premature optimisation I believe that's called.
I've seen it play out many times in engineering over the years.
SELECT \* from read_csv('example.csv');
Writing generally involves reading to an in-memory database, making whatever changes you want, then something like COPY new_table TO 'example.csv' (HEADER true, DELIMITER ',');you should be squashing bugs related to your business logic, not core data storage. Local data storage on your one horizontally-scaling box is a solved problem using SQLite. Not to mention atomic backups?
I'm also a convert.
Similar sentiment.
I don't think it makes any sense to presume everyone around you is brainwashed and you are the only soul cursed with reasoning powers. Might it be possible that "we" are actually able to analyse tradeoffs and understand the value of, say, have complete control over deployments with out of the box support for things like deployment history, observability, rollback control, and infrastructure as code?
Or is it brainwashing?
Let's put your claim to the test. If you believe only brainwashed people could see value in things like SQLite or Kubernetes, what do you believe are reasonable choices for production environments?
However, if your all application state can be represented in a single json file of less than a dozen MB, yes, a database can be overkill.
NoSQL gains you no speed at all in redesigning your system. Instead, you trade a few hard to do tasks in data migration into an unsurmountable mess of data inconsistency bugs that you'll never actually get into the end of.
> is mostly bad design decisions and poor domain knowledge
Yes, using NoSQL to avoid data migrations is a bad design decision. Usually created by poor general knowledge.
A lot of the bespoke no-sql data stores really started to come to the forefront around 2010 or so. At that time, having 8 cpu cores and 10k rpm SAS spinning drives was a high end server. Today, we have well over 100 cores, with TBs of RAM and PCIe Gen 4/5 NVME storage (u.x) that is thousands of times faster and has a total cost lower than the servers from 2010 or so that your average laptop can outclass today.
You can vertically scale a traditional RDBMS like PostgreSQL to an extreme degree... Not to mention utilizing features like JSONB where you can have denormalized tables within a structured world. This makes it even harder to really justify using NoSQL/NewSQL databases. The main bottlenecks are easier to overcome if you relax normalization where necessary.
There's also the consideration of specialized databases or alternative databases where data is echo'd to for the purposes of logging, metrics or reporting. Not to mention, certain layers of appropriate caching, which can still be less complex than some multi-database approaches.
You could also consider renting an Oracle DB. Yep! Consider some unintuitive facts:
• It can be cheaper to use Oracle than MongoDB. There are companies that have migrated away from Mongo to Oracle to save money. This idea violates some of HN's most sacred memes, but there you go. Cloud databases are things you always pay for, even if they're based on open source code.
• Oracle supports NoSQL features including the MongoDB protocol. You can use the Mongo GUI tools to view and edit your data. Starting with NoSQL is very easy as a consequence.
• But... it also has "JSON duality views". You start with a collection of JSON documents and the database not only works out your JSON schemas through data entropy analysis, but can also refactor your documents into relational tables behind the scenes whilst preserving the JSON/REST oriented view e.g. with optimistic locking using etags. Queries on JSON DVs become SQL queries that join tables behind the scenes so you get the benefits of both NoSQL and SQL worlds (i.e. updating a sub-object in one place updates it in all places cheaply).
• If your startup has viral growth you won't have db scaling issues because Oracle DBs scale horizontally, and have a bunch of other neat performance tricks like automatically adding indexes you forgot you needed, you can materialize views, there are high performance transactional message queues etc.
So you get a nice smooth scale-up and transition from ad hoc "stuff some json into the db and hope for the best" to well typed data with schemas and properly normalized forms that benefit from all the features of SQL.
Never!
Battle-tested, extremely performant, easier to use than a homegrown alternative?
By all means, hack around and make your own pseudo-database file system. Sounds like a fun weekend project. It doesn't sound easier or better or less costly than using SQLite in a production app though.
What's special about SQLite is that it already solves most of the things you need for data persistence without adding the same kind of overhead or trade offs as Postgres or other persistence layers, and that it saves you from solving those problems yourself in your json text files...
Like by all means don't use SQLite in every project. I have projects where I just use files on the disk too. But it's kinda inane to pretend it's some kind of burdensome tool that adds so much overhead it's not worth it.
What is more likely, if you are making good decisions, is that you'll reach a point where the simple approach will fail to meet your needs. If you use the same attitude again and choose the simplest solution based on your _need_, you'll have concrete knowledge and constraints that you can redesign for.
Surely it does? Otherwise you cannot trust the interface point with SQLite and you're no further ahead. SQLite being flawless doesn't mean much if you screw things up before getting to it.
Why setup a go binary and a json file? Just use google forms and move on, or pay someone for a dead simple form system so you can capture and commmunicate with customers.
People want to do the things that make them feel good - writing code to fit in just the right size, spending money to make themselves look cool, getting "the right setup for the future so we can scale to all the users in the world!" - most people don't consider the business case.
What they "need" is an interesting one because it requires a forecast of what the actual work to be done in the future is, and usually the head of any department pretends they do that when in reality they mostly manage a shared delusion about how great everything is going to go until reality hits.
I have worked for companies getting billions of hits a month and ones that I had to get the founder to admit there's maybe 10k users on earth for the product, and neither of them was good at planning based on "what they need".
Regardless of whether most apps have enough requests per second to "need" a database for performance reasons, these are extremely important topics for any app used by a real business.
my point is strictly about premature optimizaton. ive seen teams spend their first month writing helm charts and terraform before they even have a single paying user. if you have product-market fit and need zero-downtime rollbacks, absolutly use k8s. but if youre just validatng an mvp, a vps and docker-compose (or sqlite) is usually enough to get off the ground.
its all about trade-offs tbh.
Stop and go ask more questions until you have a better understanding of the problem.
Even then, PostgreSQL and even MS-SQL are often decent alternatives for most use cases.
What overhead?
SQLite is literally more performant than fread/fwrite.
Yes agreed, but it's usually a lot easier to find the filename part, especially if the application follows XDG. Sqlite databases are usually buried somewhere because they aren't expected to be looked at.
It may not have a very rigid schema, you may later add several other optional fields.
You need very large scale (as in no of concurrent accesses), you want to shard the data by e.g. location. But also, the data is not "critical", your highschool not being visible temporarily for certain users is not an issue.
You mostly use the whole dataset "at the same time", you don't do a lot of WHERE, JOIN on some nested value.
In every other case I would rather reach for postgres with a JSONB column.
That said, I've leaned into avoiding breaking up a lot of microservices unless/until you need them... I'm also not opposed to combining CQRS style workflows if/when you do need micro services. Usually if you need them, you're either breaking off certain compute/logic workflows first where the async/queued nature lends itself to your needs. My limited experience with a heavy micro-service application combined with GraphQL was somewhat painful in that the infrastructure and orchestration weren't appropriately backed by dedicated teams leading to excess complexity and job duties for a project that would have scaled just fine in a more monolithic approach.
YMMV depending on your specific needs, of course. You can also have microservices call natural services that have better connection sharing heuristics depending again on your infrastructure and needs... I've got worker pools that mostly operate of a queue, perform heavy compute loads then interact with the same API service(s) as everything else.
e.g. worry about what makes your app unique. Data storage is not what makes your app unique. Outsource thinking about that to SQLite
If you have either of those problems, you will know it very clearly.
Also, ironically, Postgres became one of the most scalable NoSQL bases out there, and one of the most flexible to use unstructured data too.
So yeah running a relational DB used to be quite high effort but it got a lot better over time.
I should have charged him a percentage. Even if I had charged 0.5%, I would have made more money.
TBF, I haven't had to use Oracle in about a decade at this point... so I'm not sure how well it competes... My experiences with the corporate entity itself leave a lot to be desired, let alone just getting setup/started with local connectivity has always been what I considered extremely painful vs common alternatives. MS-SQL was always really nice to get setup, but more recently has had a lot of difficulties, in particular with docker/dev instances and more under arm (mac) than alternatives.
I'm a pretty big fan of PG, which is, again, very widely available and supported.
I think PG doesn't have most of the features I named, I'm pretty sure it doesn't have integrated queues for example (SELECT FOR UPDATE SKIP LOCKED isn't an MQ system), but also, bear in mind the "postgres" cloud vendors sell is often not actually Postgres. They've forked it and are exploiting the weak trademark protection, so people can end up more locked in than they think. In the past one cloud even shipped a transaction isolation bug in something they were calling managed Postgres, that didn't exist upstream! So then you're stuck with both a single DB and a single cloud.
Local dev is the same as other DBs:
docker run -d --name <oracle-db> container-registry.oracle.com/database/free:latest
See https://container-registry.oracle.comWorks on Intel and ARM. I develop on an ARM Mac without issue. It starts up in a few seconds.
Cost isn't necessarily much lower. At one point I specced out a DB equivalent to what a managed Postgres would cost for OpenAI's reported workload:
> I knocked up an estimate using Azure's pricing calculator and the numbers they provide, assuming 5TB of data (under-estimate) and HA option. Even with a 1 year reservation @40% discount they'd be paying (list price) around $350k/month. For that amount you can rent a dedicated Oracle/ExaData cluster with 192 cores! That's got all kinds of fancy hardware optimizations like a dedicated intra-cluster replication network, RDMA between nodes, predicate pushdown etc. It's going to perform better, and have way more features that would relieve their operational headache.
Yes I meant it in this sense: "If you knock something up, you make it or build it very quickly, using whatever materials are available."
https://www.collinsdictionary.com/dictionary/english/knock-u...
A database is just files. SQLite is a single file on disk. PostgreSQL is a directory of files with a process sitting in front of them. Every database you have ever used reads and writes to the filesystem, exactly like your code does when it calls open().
So the question is not whether to use files. You're always using files. The question is whether to use a database's files or your own. And for a lot of applications, especially early-stage ones, the answer might be: your own.
Now, obviously we love databases. We're building DB Pro, a database client for Mac, Windows, and Linux. But the honest answer to "do you need one?" depends on your scale, and most applications are smaller than people assume. We tested this. We built the same HTTP server in Go, Bun, and Rust, using two storage strategies, and hammered them with wrk. Here's what the numbers look like.
Three flat files: users.jsonl, products.jsonl, orders.jsonl. The format is newline-delimited JSON (JSONL): one record per line, appended on write. Each file holds one entity type.
Two HTTP endpoints: POST /users to create, GET /users/:id to fetch by ID. We benchmarked the GET path. Reads are where the strategies diverge.
The simplest thing you can do: when a request comes in for user abc-123, open the file, scan every line, parse each one as JSON, check the ID. Return when you find a match.
Go:
TypeScript (Bun):
Rust:
This is O(n). Every request reads the entire file from top to bottom, on average scanning half of it before finding the target. The larger the file, the slower every request gets.
On startup, read the entire file once and store every record in a hash map keyed by ID. Writes go to both the map and the file. Reads are a single map lookup.
The file is the durable backing store. The map is the index. If the process restarts, reload from the file.
Go:
TypeScript (Bun):
Rust:
Read path is now O(1) at any scale. The sync.RWMutex in Go and RwLock in Rust let multiple readers proceed in parallel, so concurrent requests don't block each other.
What if you need reads that don't load everything into RAM, but also don't scan the whole file? The middle ground: sort the data file by ID, build a fixed-width index alongside it, and binary search the index using ReadAt. Each lookup does O(log n) seeks (about 20 for 1M records), then reads exactly one record from the data file.
The index format is simple: one line per record, exactly 58 bytes: <36-char UUID>:<20-digit byte offset in data file>\n. Fixed width means you can jump to any entry with a single ReadAt(buf, entryIndex * 58).
The data file must be sorted by ID before building the index. We sort once at seed time, or as a one-time migration step on an existing JSONL file. Appending new records breaks the sort, so in a real system you'd rebuild the index periodically or keep an unsorted write-ahead buffer and merge it in. That merge pattern is what an LSM-tree does.
We seeded three datasets (10k, 100k, and 1M records) and used wrk to run 10 seconds of load against each server: 4 threads, 50 concurrent connections, random GET requests picking from a sampled list of real IDs.
All servers ran on the same machine (Apple M1 Mac mini, macOS 15). Go 1.26, Bun 1.3, Rust 1.94 (release build).
We also tested two more approaches in Go: a binary search against a sorted file on disk, and SQLite using modernc.org/sqlite (pure Go, no CGO). The binary search uses a fixed-width index file (58 bytes per entry: <uuid>:<offset>) to do O(log n) ReadAt calls, then seeks directly to the matching record. No data loaded into RAM.
Requests per second (higher is better)
| 10k records | 100k records | 1M records | |
|---|---|---|---|
| 783 | 85 | 23 | |
| 45,742 | 41,661 | 38,866 | |
SQLite (Go) |
26,000 | 25,507 | 25,085 |
| 97,040 | 98,277 | 97,829 | |
| 469 | 61 | 19 | |
| 106,064 | 107,058 | 105,367 | |
| 2,883 | 251 | 52 | |
| 163,687 | 155,364 | 169,106 |
Average latency per request (lower is better)
| 10k records | 100k records | 1M records | |
|---|---|---|---|
| 84ms | 552ms | 1,010ms | |
| 1.2ms | 1.4ms | 1.4ms | |
SQLite (Go) |
2.0ms | 2.0ms | 2.1ms |
| 497µs | 571µs | 584µs | |
| 101ms | 754ms | 1,060ms | |
| 449µs | 443µs | 463µs | |
| 17ms | 195ms | 753ms | |
| 231µs | 482µs | 221µs |
A few things worth pointing out.
Linear scan degrades linearly. At 1M records, Go is serving 23 requests per second and each Bun request takes over a second on average. At that point you're not tuning performance, you're explaining to users why the page won't load.
Binary search on disk is fast and flat. 45k req/s at 10k records, 38k req/s at 1M records. That's only a 15% drop as the dataset grows 100x. The OS page cache does a lot of work here: after a warmup period, the 566KB index file for 10k records fits entirely in cache. For 1M records the index is ~55MB, but the top levels of the binary search always hit the same offsets near the middle of the file, so those pages stay hot regardless of which key you're looking up. Each lookup does ~20 ReadAt calls on the index plus one Seek into the data file.
Binary search beats SQLite. This was unexpected. Plain sorted files with a hand-rolled index outperform SQLite's B-tree by about 1.7x at every scale. SQLite does more work per lookup than a hand-rolled binary search, even for a simple primary key read. That overhead is worth it when you need the features. For a pure ID lookup, you're paying for machinery you're not using.
SQLite is fast and consistent. 25,000 to 26,000 req/s regardless of dataset size, with 2ms average latency. The B-tree index means lookup time barely changes as records grow from 10k to 1M.
In-memory map is the ceiling. 97k req/s with sub-millisecond latency at every scale. If your dataset fits in RAM, nothing on disk will match it.
Bun (JavaScript) beats Go on the map approach. Bun's HTTP server averages around 106k req/s vs Go's 97k. Bun uses JavaScriptCore as its JS engine and implements its HTTP layer natively in Zig via uWebSockets, bypassing libuv entirely. The language matters less than the runtime.
Rust wins on linear scan by a wide margin. At 10k records, Rust does 2,883 req/s vs Go's 783 and Bun's 469. That's 3-6x faster for the naive file scan, likely a combination of lower I/O overhead and faster JSON deserialization via serde. For the map approach, Rust leads but the gap narrows considerably.
Pick by use case:
| Use case | Winner |
|---|---|
| Absolute fastest throughput | |
| Fastest without loading everything into RAM | |
| Need SQL queries later | SQLite (Go) (25k req/s, full SQL when you need it) |
| Quickest to ship |
Before we talk about when you need a database, let's put these numbers in context. Because "25,000 requests per second" sounds like a lot, and it is, but it helps to think about what kind of product generates that kind of load.
Traffic is not uniform. Users are awake during the day and asleep at night. Capacity planning guides for web applications generally assume a peak-to-average ratio of around 1.5 to 2.0 for B2B and B2C products (ByteByteGo, Geek Culture). Let's use 2:1, which means a product averaging 12,500 req/s across the day might spike to 25,000 req/s during its busiest hour.
Now work backwards. Let's assume an active user triggers around 10 database lookups per hour — loading their profile, fetching their data, that kind of thing. That's a rough number; your app might be higher or lower. Let's also assume 10% of your daily active users are online at the same time during peak.
Peak req/s = DAU × 0.10 × (10 lookups/hr ÷ 3600 sec/hr) = DAU × 0.000278
Flip it around to find the DAU that saturates each approach:
| Approach | Peak capacity | DAU to saturate it |
|---|---|---|
| 783 req/s | 2.8M users | |
| 40,000 req/s | 144M users | |
SQLite (Go) |
25,000 req/s | 90M users |
| 97,000 req/s | 349M users | |
| 106,000 req/s | 381M users | |
| 169,000 req/s | 608M users |
The linear scan breaks at a real product scale: around 3 million daily active users with a 10k record file. That's a meaningful number.
Everything else? You would need tens of millions of daily active users to push these approaches hard. Instagram was at 400 million daily active users and still running PostgreSQL as their primary data store (Instagram Engineering). Most products never get there.
To give a more grounded example: a SaaS with 10,000 paying customers where each customer uses the app once a day generates around 3 req/s peak under these assumptions. A consumer app with 100,000 DAU generates around 30 req/s at peak under these assumptions. Neither comes close to any of the approaches we tested.
The honest answer to "do you need a database?" is: probably not yet. And when you do, SQLite running from a flat file handles 90 million daily active users on a single server.
For lookup by ID: the in-memory map handles ~97k req/s, binary search on disk handles ~40k req/s, and SQLite handles ~25k req/s. All three are well above what most applications will ever see from a single server.
The cases where you'll outgrow flat files:
Your dataset doesn't fit in RAM. The in-memory map approach requires loading everything at startup. At a few million small records that's fine. At tens of millions, you're looking at gigabytes of RAM just for the index. You need a way to page data in and out. A database does this for you.
You need to query by more than one field. Right now, the only fast operation is "find by ID." If you need "find all orders for user X" or "find all products with price under $50," you'd need to scan the file or maintain additional maps. Once you have three or four of those, you've built a query engine.
You need joins. Combining orders with users and products in a single response means loading from three files and assembling the result in application code. SQL does this more efficiently.
Multiple processes need to write at the same time. The RwLock in these servers protects concurrent access within one process. As soon as you run two instances of the server, both with their own in-memory map, they diverge. You need an external source of truth. That's what a database is.
You need atomic writes across entities. Creating an order while decrementing inventory needs to either both succeed or both fail. With separate files, you'd have to implement a transaction log yourself. Databases solve this with ACID guarantees.
None of these constraints apply to a lot of applications. Plenty of internal tools, side projects, and early-stage products will never have a dataset that doesn't fit in a single server's RAM, never need to join across tables under heavy load, and never run more than one instance. For those applications, this approach works.
The file is still there if you need to migrate later. JSONL is trivially importable into any database. You're not locked in.
The server code for all three languages is embedded above. The seed script, benchmark runner, and wrk Lua script aren't shown inline — download the full project to run it yourself:
Download the benchmark code (.zip)
The zip contains go-server/, bun-server/, rust-server/, seed.ts, and run_bench.sh. The benchmark script seeds data at three scales, generates a Lua script with sampled real IDs, starts each server, runs wrk, and tears it down.
Quick start: