I've got scraps of paper and post-it notes and just throw them away after they've been sitting around for a while and I forget what they are about.
The premise of Atomic, the knowledge base project I'm currently working on, is that there is still significant value in vectors, even in an agentic context. https://github.com/kenforthewin/atomic
Isn’t the biggest benefit of graph databases the indexing and additional query constructs they support, like shortest path finding and whatnot?
I've started to think about maybe a fine-tuned model is needed, specifically for "journal data retrieval" or something like that, is anyone aware of any existing models for things like this? I'd do it myself, but since I'm unwilling to send larger parts of my data to 3rd parties, I'm struggling collecting actual data I could use for fine-tuning myself, ending up in a bit of a catch 22.
For some clients projects I've experimented with the same idea too, with less restrictions, and I guess one valuable experience is that letting LLMs write docs and add them to a "knowledge repository" tends to up with a mess, best success we've had is limiting the LLMs jobs to organizing and moving things around, but never actually add their own written text, seems to slowly degrade their quality as their context fills up with their own text, compared to when they only rely on human-written notes.
Clicked the link expecting to see some tool or method that actually allows graph-like queries and traversals on files in a file system, all I found was some rant about someone on the internet being wrong.
Waste of time.
It can only handle 3 way multiple cross references by using 2 folders and a file now (meta) and it's very verbose on the disk (needs type=small otherwise inodes run out before disk space)... but it's incredibly fast and practially unstoppable in read uptime!
Also the simplicity in using text and the file system sort of guarantees longevity and stability even if most people like the monolithic garbled mess that is relational databases binary table formats...
Anyway, why care how the data is stored? You need a catalog. You need an index. You need automation. Helps keeping order and helps with inevitable changes and flips and pivots and whims and trends and moods and backups and restoration and snapshots and history and versioning and moon travels and collaboration and compatibility and long summer evening walks and portability.
Folders give you hierarchical categories.
You still want tags for horizontal grouping. And links and references for precise edges.
But that gives you a really nice foundation that should get you pretty damn far.
I also now am telling the llm to add a summary as the first section of the file is longer.
In somewhat of an inversion, I've been getting the initial naming done by an LLM (well, I was, until CoPilot imposed file upload limits and the new VPN blocked access to it) --- for want of that, I just name each scan by Invoice ID, then use a .bat file made by concatenating columns in a spreadsheet to rename them to the initial state ready for entry.
I'm currently experimenting with Tobi's QMD (https://github.com/tobi/qmd) to see how it performances with local models only on my Obsidian Vault.
> I'm struggling collecting actual data I could use for fine-tuning myself,
Journalling or otherwise writing is by far the best way to do this IMO but it doesn't take very much audio to accurately do a voice-clone. The hard thing about journalling is that it can actually be really biased away from the actual "distribution" of you, whether it's more aspirational or emotional or less rigorous/precise with language.
What I'm starting to do is save as many of my prompts as possible, because I realized a lot of my professional writing was there and it was actually pretty valuable data (especially paired with outputs and knowledge of what went well and waht didn't) for finetuning on my own workloads. Secondly is assembling/curating a collection of tools and products that I can drop into each new context with LLMs and also use for finetuning them on my own needs. Unlike "knowledge repositories" these both accurately model my actual needs and work and don't require me to do really do anything unnatural.
The other thing I'm about to start doing is "natural" in a certain sense but kinda weird, basically recording myself talking to my computer (verbalizing my thoughts more so it can be embedded alongside my actions, which may be much sparser from the computer's perspective) / screen recordings of my session as I work with it. This is something I've had to look into building more specialized tools for, because it creates too much data to save all of it. But basically there are small models, transcoding libraries, and pipelines you can use for audio/temporal/visual segmentation and transcription to compress the data back down into tokens and normal-sized images.
This is basically creating a semantic search engine of yourself as you work, kinda weird, but IMO it's just much weirder that your computer can actually talk back and learn about you now. With 96GB you can definitely do it BTW. I successfully finetuned an audio workload on gemma 4 2b yesterday on a 16GB mac mini. With 96GB you could do a lot.
> letting LLMs write docs and add them to a "knowledge repository"
I think what you actually want them to do is send them to go looking for stuff for you, or actively seeking out "learning" about something like that for their own role/purposes, so they can embed the useful information and better retrieve it when they need it, or produce traces grounded in positive signals (eg having access to this piece of information or tool, or applying this technique or pattern, measurably improves performance at something in-distribution to whatever you have them working on) they can use in fine-tuning themselves.
https://neo4j.com/docs/graph-data-science/current/algorithms...
My “knowledge” is spread out on various SaaS (Google, slack, linear, notion, etc). I don’t see how I can centralize my “knowledge” without a lot of manual labour.
Deep inside a project dir, feels like some the ease of LLMs is just not having to cd into the correct directory, but you shouldn't need an LLM to do that. I'm gonna try setting up some aliases like "auto cd to wherever foo/main.py is" and see how that goes.
Maybe it is why mind maps never spoke to me. I felt that a tree structure (or even - planar graphs) were not enough to cover any sufficiently complex topic.
1. Why does AI need that folder structure? Why not a flat list of files and let the AI agent explore with BM25 / grep, etc.
2. pre-compute compression vs compute at query time.
Kaparthy (and you) are recommending pre-compressing and sorting based on hard coded human abstraction opinions that may match how the data might be queried into human-friendly buckets and language.
Why not just let the AI calculate this at run time? Many of these use cases have very few files and for a low traffic knowledge store, it probably costs less tokens if you only tokenize the files you need.
Now just need to find a good way to maintain the order...
You may want to do as described and link to Slack messages (etc), but just remember any external link should be treated as ephemeral. You may not have access to the Slack anymore, for example. That may mean you don't need that note either, or it may mean you lost access to a node on your knowledge graph, you have to determine whether that matters.
By starting now, at least everything going forward is captured in a way you can both own and utilize it. Then it may be a bit of a pain and some manual work to get existing notes into your tool of choice, but you can determine what needs to be in there from other tools as you go forward.
Progressive disclosure, same reason you don't get assaulted with all the information a website has to offer at once, or given a sql console and told to figure it out, and instead see a portion of the information in a way that is supposed to naturally lead you to finding the next and next bits of information you're looking for.
> use cases
This is essentially just where you're moving the hierarchy/compression, but at least for me these are not very disjoint and separable. I think what I actually want are adaptable LoRa that loosely correspond to these use cases but where a dense discriminator or other system is able to adapt and stay in sync with these too. Also, tool-calling + sql/vector embeddings so that you can actually get good filesystem search without it feeling like work, and let the model filter out the junk.
> let the AI calculate this at run time?
You still do want to let it do agentic RAG but I think more tools are better. We're using sqlite-vec, generating multimodal and single-mode embeddings, and trying to make everything typed into a walkable graph of entity types, because that makes it much easier to efficiently walk/retrieve the "semantic space" in a way that generalizes. A small local model needs at least enough structure to know these are the X ways available to look for something and they are organized in Y ways, oriented towards Z and A things.
Especially on-device, telling them to "just figure it out" is like dropping a toddler or autonomous vehicle into a dark room and telling them to build you a search engine lol. They need some help and also quite literally to be taught what a search engine means for these purposes. Also, if you just let them explore or write things without any kind of grounding in what you need/any kind of positive signals, they're just going to be making a mess on your computer.
Two reasons I think:
Coding agents simulate similar things to what they have been trained on. Familiarity matters.
And they tend to do much better the more obvious and clear a task is. The more they have to use tools or "thinking", the less reliable they get.
Today, I can use even the small models of OpenAI and Anthropic to get valuable sessions, but if I wanted to actually use those for fine-tuning a local model, I'd need to actually start sending the data I want to use for fine-tuning to OpenAI and Anthropic, and considering it's private data I'm not willing to share, that's a hard-no.
So then my options are basically using stronger local models so I get valuable sessions I can use for fine-tuning a smaller model. But if those "stronger local models" actually worked in practice to give me those good sessions, then I'd just use those, but I'm unable to get anything good enough to serve as a basis for fine-tuning even from the biggest ones I can run.
I'd love to "send them to go looking for stuff for you", but local models aren't great at this today, even with beefy hardware, and since that's about my only option, that leaves me unable to get sessions to use for the fine-tuning in the first place.
Which is great, but on all major OSes you'd eventually hit performance issues with flat directories like this. Might not be an issue in month one, or even year one, but after 10 years of note taking/journaling that approach will show the issue with large flat directories.
So eventually you'd need to shard it somehow, so might as well start categorizing/sorting things from the get go, at least in some broad major categories at least, because doing so once you already have 10K entries in a directory, it sucks big time to do it.
It doesn't. The human creating the files needs it, to make it easier to traverse in future as the file count grows. At 52k files, that's a horrendous list to scroll through to find the thing you're looking for. Meanwhile, an AI can just `find . -type f -exec whatever {} \;` and be able to process it however it needs. Human doesn't need to change the way they work to appease the magic rock in the box under the desk.
Do you still have your prompt by chance, and willing to share it? I took a stab at this and it didn't want to make much change. I think I need to be more specific but am not sure how to do that in a general way
If real organization is needed, seems like that'd be easier in hindsight than having foresight
and then worked from there, giving feedback on the proposed folder structure, until I was happy
Basically you need a squad of specialized models to do this in a mostly-structured way that ends up looking kind of like a crawling or scraping/search operation. I can share a stack of about 5-6 that are working for us directly if you want, I want to keep the exact stack on the DL for now but you can check my company's recent github activity to get an idea of it. It's basically a "browser agent" where gemma or qwen guide the general navigation/summarization but mostly focus on information extraction and normalization.
The other thing I've done, which obviously not everybody is going to want to do, is create emails and browser profiles for the browser agent (since they basically work when I'm not on the computer, but need identity to navigate the web) and run them on devices that don't have the keys to the kingdom. I also give them my phone number and their own (via an endpoint they can only call me from). That way if they run into something they have a way to escalate it, and I can do limited steering out of the loop. Obviously this is way more work than is reasonable for most people right now though so I'm hoping to show people a proper batteries-included setup for it soon.
Edit: Based on your other comment, I think maybe what you're really looking for most are "personal traces". Right now that's something we're working on with https://github.com/accretional/chromerpc (which uses the lower level Chrome Devtools Protocol rather than Puppetteer to basically fully automate web navigation, either through an LLM or prescriptive workflows). It would be very simple to set up automation to take a screenshot and save it locally every Xm or in response to certain events and generate traces for yourself that way, if you want. That alone provides a pretty strong base for a personal dataset.
Symbolic links can form a graph, and you can process them as needed using readlink etc. to traverse the graph, but they'll still be considered broken if they form a cycle.
why? The human would just talk to the AI agent. Why would they need to scroll through that many files?
I made a similar system with 232k files (1 file might be a slack message, gitlab comment, etc). it does a decent job at answering questions with only keyword search, but I think i can have better results with RAG+BM25.
Just because AI exists doesn't mean we can neglect basic design principles.
If we throw everything out the window, why don't we just name every file as a hash of its content? Why bother with ASCII names at all?
Fundamentally, it's the human that needs to maintain the system and fix it when it breaks, and that becomes significantly easier if it's designed in a way a human would interact with it. Take the AI away, and you still have a perfectly reasonable data store that a human can continue using.
Sure, but what I'm talking about is that the current SOTA models are terrible even for specialized small use cases like what you describe, so you can't just throw a local modal on that task and get useful sessions out of them that you can use for fine-tuning. If you want distilled data or similar, you (obviously) need to use a better model, but currently there is none that provides the privacy-guarantees I need, as described earlier.
All of those things come once you have something suitable for the individual pieces, but I'm trying to say that none of the current local models come close to solving the individual pieces, so all that other stuff is just distraction before you have that in place.
But again, you're missing my point :) I cannot, since the models I could generate useful traces from are run by platforms I'm not willing to hand over very private data to, and local models that I could use I cannot get useful traces from.
And I'm not holding out hope for agent orchestration, people haven't even figured out how to reliably get high quality results from agent yet, even less so with a fleet of them. Better to realistically tamper your expectations a bit :)
Karpathy recently posted about using LLMs to build personal knowledge bases — collecting raw sources into a directory, having an LLM “compile” them into a wiki of interlinked markdown files, and viewing the whole thing in Obsidian. He followed it up with an “idea file,” a gist you can hand to your agent so it builds the system for you.
This is a great idea, I’ve been doing some form of this for over a decade. My Staff Eng co-host @davidnoelromas reached out after the tweet to ask for more details on how I’ve been using obsidian and AI. This an expanded version of what I told him.
I’ve collected possibly too many markdown files.
find . -type f | wc -l
52447
That’s my obsidian vault, and I use it with AI everyday without a special database, or a vector store, or a RAG pipeline. It’s merely files on disk.
Think about the context you carry around in your head for your job. The history of decisions on a project. What you discussed with your manager three months ago. The Slack thread where the team landed on an approach. The Google Doc someone shared in a meeting you half-remember. The slowly evolving understanding of how a system works that lives across fifteen people’s heads and nowhere else.
Now think about what happens when you need to produce something from all that context. A design doc. A perf packet. A project handoff. An onboarding guide for a new team member. You spend hours reassembling context from Slack, docs, emails, your own memory, and you still miss things.
The knowledge base turns this into a system instead of a scramble.
A file system with markdown and wikilinks is already a graph database. Files are nodes. Wikilinks are semantic edges. Folders introduce taxonomy. You don’t need a special MCP server or plugin. The file system abstraction is the interface, and LLMs are surprisingly good at navigating it.
I use a structure borrowed from Tiago Forte’s Building a Second Brain, with the PARA taxonomy as a starting point, extended with categories that match how I actually work:
/projects/{name}
/areas/{topics}
/people/{slack_handle}
/daily/{year}/{month}/{day}/
/meetings/{year}/{month}/{day}/
Markdown files are nodes, wikilinks ([[target]]) are edges, the folder taxonomy is the schema and LLMs is the query engine. A graph database with a natural language query interface. No infrastructure required.
After every meeting, the agent creates a note in daily/{year}/{month}/{day}/, downloads any attached Google Docs, and links everything to the long-running notes I keep for each person I interact with regularly. A note from a 1:1 with my boss JP gets a wikilink to [[/people/jp|jp]] and to whatever projects we discussed.
Over months, each person’s note becomes a timeline of every conversation, decision, and open thread. Each project folder accumulates every relevant artifact. You don’t have to remember where things are. The graph remembers.
For a work project, I can point the agent at a starting doc and say: “Spider through every tool you have access to and pull down all the related context.” It grabs Slack threads, Google Docs, web resources, all rendered as markdown inside the project folder. From that assembled context, the agent can draft design docs, product vision statements, problem/solution analyses. The output is better than prompting cold because the LLM is working with the real history of the project, not your summary of it.
This is the part Karpathy’s tweet hints at but doesn’t fully spell out: the knowledge base isn’t just for research. It’s a context engineering system. You’re building the exact input your LLM needs to do useful work.
You might be thinking: I already ask Claude to help me write a design doc. True. But there’s a real difference between prompting “help me write a design doc for a rate limiting service” and prompting an LLM that has access to your project folder with six months of meeting notes, three prior design docs, the Slack thread where the team debated the approach, and your notes on the existing architecture.
The knowledge base is a context engineering system. You’re not building a wiki for the sake of having a wiki. You’re building the input layer that makes every future LLM interaction better. Every meeting note, every linked decision, every filed artifact improves the quality of every query that follows.
The piece I haven’t cracked is automated inbox processing. The idea is straightforward: web clippings, meeting notes, Slack saves, and random captures all land in an inbox folder. The agent processes everything new, applies progressive summarization, breaks content into atomic pieces, correlates each piece with the right project, area, or person.
I have a graveyard of experiments here. The LLM is good at summarizing and categorizing. The hard part is defining what “processed” means in a way that’s consistent enough to be useful six months later but flexible enough to handle the variety of stuff that lands in an inbox. Every attempt has been either too rigid (everything gets the same treatment) or too loose (the vault drifts into chaos).
If you’ve solved this, I’d genuinely like to hear about it.
You don’t need 52,000 files to get value from this. Start with three things:
One: Create the folder structure. Projects, areas, people, daily. Even empty, the taxonomy gives you and the LLM a schema.
Two: After your next meeting, have the agent create a note and link it to the relevant person and project. Do this for a week. Watch the graph start to form.
Three: The next time you need to write something, a design doc, a status update, a perf self-review, point the agent at the relevant folders and ask it to draft from what’s there.
The difference is noticeable right away. Not because the LLM is smarter, but because it finally has the context to be useful.
Your work compounds. That’s the thing that feels genuinely new.