The Third Hard Problem

I have always called this the “one true taxonomy” problem, because whenever you sit with multiple stakeholders in a room talking about a taxonomy, you can never get to agreement, because there is no such thing as the “one true taxonomy”.

Any hierarchical taxonomy classifies on one dimension at each taxonomic level. Invariably someone wants to classify on one criteria when someone else wants to classify on another. Taxonomies that humans use aren’t multi-dimensional. So if there is a disagreement, someone wins and someone(s) has to lose.

No one is wrong; they just have different priorities or preferences or goals.

So now as an architect I never argue (and seldom discuss) taxonomies. I make two points and then bow out:

1. Whatever your taxonomy is, you need a rubric for each level. You need a procedure or set of questions that unambiguously map any $THING you encounter into exactly one bucket. Validate that competent people with no specific domain knowledge can properly classify things with your rubric; it must be repeatable by amateurs, not just experts (software is dumb).

2. Existence trumps theory. If there exists a taxonomy and rubric for what you’re classifying, you need to provide a $DARN_GOOD_REASON why this wheel needs reinventing. Personal preference and your 1% edge case probably don’t justify all the work to reinvent everything.

Then, I go back to the implementers and tell them to design in a tagging system, which is a DIY taxonomy, and except in ridiculous use cases, I can make indexes make it fast enough to let everyone overlay their own classification system.

I have a deep distrust of hierarchies, because they keep you trapped into a single model that keeps extending its authority, usuall without anyone explicitly deciding that it should do so. For example, the file system: once it was deemed hierarchy is the main metaphor for navigation, the structure persisted and was reused for organisation, ownership, access control and governance. And it became infrastructure we cannot easily remove before we could even question if it was right or not. And once it dominated, non-hierarchical things were retrofitted as glue, e.g.: symlinks, aliases, shortcuts... also, when's the last time you've used a tag?

The webs are so much more malleable, but they're also not free. All the 'good enough's you were that a hierarchy that was taking care of implicitly are now your responsibility to model precisely and make sure they're performant as well. Look at ReBAC, for example. It gives you expressive power, but it also forces you to reason precisely about relationships, graph traversal, consistency, and cost. Strikingly similar is GraphQL.

Interestingly, source code is hierarchical, but compiles almost immediately to a graph IR and most analysis and optimisations happen there. But almost nobody looks at a CFG/SSA graph directly. You author in a hierarchical manner, yet the operational substrate is a malleable graph.

I doubt there would exist a perfect taxonomy for everything. Taxonomies are subjective to individuals but somehow can be (ahem! AI) mapped relative to someone else’s preference taxonomy. What efficiency and meaning each individual (or organization or community) yields would be completely different.

The problem with trees is that the are a dimensional reduction, an aggregation; taking a problem without directionality and applying a useful/functional hierarchy.

And that's a problem because Aggregability is NP-Hard: https://dl.acm.org/doi/abs/10.1145/1165555.1165556

So a tree is a way to take a high dimensionality graph and make it usefully lower dimensionality, but, given the aforementioned proof, that reduction is going to go from being a lossless compression to a heuristic. So any interesting problem (at least, any problem interesting to me) is only going to be aided (read: not solved exhaustively) by that hierarchy.

I'm okay with this. Being okay with this has been one of the most freeing things over the last 20 years of my career. Accept inaccuracy, and find usefulness in your data structures.

Well elucidated. This problem has irked me for years in the form of multiple inheritance. When it's disallowed (like Java, unfortunately), trying to reduce a directed graph structure to a single dominant hierarchy is quite the bothersome choice.

I think all three problems are really one problem under the hood:

Are these two things actually the same thing, or they separate?

For me, the canonical example is organising images in folders vs tags.

I think I've always called this "Ontology is hard". It's genuinely useful when it's used as a tool for clarification. It's constraining when it's used as a tool for modeling.

One nice tool for analyzing maps as a tree is as a dominator trees. I wrote a bit about it here: https://neugierig.org/software/blog/2023/07/dominator.html

I thought the two hard problems were naming things, cache invalidation, and off-by-one errors?

Every few years I watch, with amusement, our management restructuring the organizational hierarchy, allegedly because the old one didn't work.

There are only two hard problems in computer programming:

1. Naming things 2. Cache invalidation 3. off-by-one errors

I thought it was timezones.

Putting object into trees is basically a caching problem.

I wonder whether the author deliberately avoided ontology? That's what comes to mind when I read this. The age-old debate between taxonomy and ontology.

The article veers from saying computers are different, to saying they should be different but maybe aren't, back to how special they are:

> The next time you sit down to an empty design doc and don’t know where to start, be kind to yourself. You’re solving a hard problem.

This supposed hard problem in computing has always been with us, in real life. Which he admits multiple times, e.g.:

> Yet Victorian-era gentlemen might have pondered the same questions while sorting letters as we do while sorting virtual paper.

He appears to claim that the sole organizing principle in real life is the hierarchy, but, of course, that computers and ideas are different:

> Hierarchies are so natural to us that they ... [work] for physical objects that can be in only one place at a time. Ideas and information, however, resist taxonomies. They form intricate webs that penetrate rigid boundaries.

This distinction of physical vs. virtual requirements doesn't hold up under any sort of rigorous analysis. As he admits, hierarchies are not always ideal in physical space -- do we organize parts and supplies separate from tools, or place them next to their probable job sites?

And of course, the "in only one place at a time" is certainly true for any given group of atoms, but we have become adept at making fungible copies of atoms for many things. I might have drywall screws or 33 ohm resistors in multiple cached locations, and I have soldering irons and screwdrivers and pliers on more than one workbench.

One thing that is true is that we can usually add non-hierarchical groupings to information more easily than we can to groupings of atoms.

Another thing that is true is that we already often do so whenever the convenience outweighs the various costs.

And the third thing that is true is that this, also, is not much different than the physical world, where we routinely both break our hierarchies and create copies of things when needed.

The first chapter of this waves away the fact that hierarchical filesystems are now useless, but it is still a fact. There is no more reason to organize your files than there is to drive around in a chariot. It is hard to map one domain to the other, but it is also not necessary. With AI indexing and recall it's less necessary than it has ever been.

This is more true as stated than people want to give credit for, usually.