It's not tested properly after I genericized it. Will try to go through it properly and add more updates.
Two big things on my TODO: 1) Make use of this indexing and using Claude's help, make video editing faster with Davinci Resolve (now that I have a good index of all the content)
2) I currently did this for videos, but I want to add more things to this for my thousands of still images of my camera - need to make sense of them. So I'll be working on this as well.
Not gonna lie, llama.cpp had the fans spinning at max speed. But it worked and I got the job done.
When your Claude wrote this post they might not have selected the right URL to share, unless your home folder is exposed. Care to share the skill files?
1. What is the search index?
2. The "description.md" example has things like "faces -> cluster_id". Is this from Davinci Resolve's face index? Things like faces+names and locations are really important with photo collections, but general LLMs don't handle them so well.
I am pretty sure that the vast majority of Airbnb hosts would not agree with you.
> equals TripAdvisor crucifixion
I have no idea how the Airbnb hosts with fake listings survive, really.
This is an excellent thing to do. Especially that LLMs excel at batching thus you can index multiple photos and videos in parallel for no performance penalty.
The idea of capable local models could be a huge unlock here if they are able to do the bottom-up context collection research / tagging / etc. at scale.
We ban accounts that do this and I don't want to ban you, so please write everything that you post to HN by hand.
Of course, it's impossible to know for sure what was LLM processed or not, but we're getting complaints about some of your posts and, upon inspection, the complaints seem justified.
But on the other hand, genuine videos do take time and slows down the process.
I have a Claude max sub and plenty of OpenRouter credit, but I donāt feel good about uploading my familyās private videos
This always confuses me - don't people want their computations to run as fast as possible and thus inevitably produce more heat that needs to be vented?
I suppose sometimes it is just an analogy for "its utilizing 100% of my resources" (which I'm guessing it is here), but I've definitely had people say it as an actual complaint in different contexts
I was vaguely aware of all these pieces existing (except for running a facial recognition database at home o_o), but it's really neat to put them all together like that.
Something which I can query later - Like when brainstorming with Claude "I wanna make some videos of the Luxury rooms in the lodge" and it knows what all videos could help here (going through the files).
There's also a folder root level files that aggregates the text descriptions to make it easier to find.
I've just attached an image in the blog showing an example - https://blog.simbastack.com/_media/gvcycx2n.png
2) No - nothing from DaVinci Resolve. Framedex is a standalone pipeline. Resolve isn't involved.
Faces come from insightface (the open-source buffalo_l pack - RetinaFace for detection), running locally on CPU. For each clip it detects faces in the sampled frames, embeds them, and writes rows to ~/.framedex/faces.db.
Tbh, this part I know it's building up in my local DB but I haven't tested how good is it. Will check them out properly soon.
But yeah, on your broader point that's why framedex deliberately does not ask the LLM to handle faces or locations.
----
Faces ā insightface / ArcFace embeddings. Deterministic, comparable across clips. The vision model only contributes a rough people_count; it never tries to identify anyone.
Locations ā EXIF GPS via exiftool, reverse-geocoded through Nominatim/OpenStreetMap. Hard metadata, not a guess.
The LLM only does what it's good at: scene description, mood, shot type, keywords, keep/review/cull rating (this last part is also debatable though).
(Also email is in my profile).
So if you give it a bunch of screenshots it will try and intelligently name them based upon what is in the screenshot. Same for videos, PDFs, etc.
But to your point I haven't even tried charging money as it feels like something Apple is just going to bake in as a feature.
But I can tell it's only a matter of time before agents become smart enough to let my non-tech friends be able to just say "Make sense of all these videos in my folder" and it just does it.
But because of the fear of non-perfection, I used to put away things like creating this article or even posting it anywhere. And I do think the article has real value that HN would appreciate (I am myself an HN-enthusiast).
I'll try more. Someone else shared this project which would be really helpful - https://github.com/blader/humanizer
Also a side note, the blog is posted on my self-created Slopit.io platform which is purely meant for your personal agents (working along with you) to post content - I recommend trying it out. https://blog.slopit.io/this-blog-post-is-slop/
I know, things are getting difficult with all the slop around, but my personal opinion is, as the agents get better at writing, the "annoying-ness" factor reduces and pieces of substance will still be appreciated, even if it was written by agents. This and the fact that agents aren't going away.
If I've automated a lot of my coding, I feel like engineers like me would naturally progress to also taking agents' help to write useful content.
PS - this comment was 100% hand-typed.
PS - I just put this together in the last few mins, removed my personal files and references. So it's not tested properly, please let me know if any issues.
It's still an early hack, but I have thousands of still images as well from my camera which I've not processed and I need to do the same analysis for those.
So I'll continue working on it, but happy to receive any PRs if anyone finds any use for it.
I'm tired of having a backlog of thousands of images and videos, leaving it for later.
Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.
Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.
Using API to analyze even a subset of this would've been painful imo.
The tells were unmistakable but it still had a human touch, so I for one am glad you published anyway.
Still blows my mind I can do all this from my 2021 MBP.
I'll try to do a post once I have the next steps working (helping with planning and editing videos with Davinci Resolve).
https://github.com/blader/humanizer
You get a pass here because you're doing really cool stuff but it's kinda tough to read past the AI nonsense, and it's relatively easy to screen out "it's not x it's y" kind of things and the bolded bullet points.
Great job. Long live the M1 Max!
Hiding these clues by another AI pass doesn't solve the core problem. Now you just end up with content that camouflaged better but is still equally low in nutritional value.
Tbh, I have a lot of thoughts and ideas and things to share and I do spend time and effort trying to de-AI-ing it but this should help a lot.
I'll try it out.
In fact, I was expecting getting shit on by HN readers for this but was pleasantly surprised that readers moved past it.
Llama is about 1/3 slower on Apple Silicon.
Although knowing how good these local models are getting, I am now eyeing the upcoming M5 Ultra Mac Studio (256gigs perhaps). But knowing how crazy the market is, it might be a year before I get the chance to get my hands on it. If it even launches by WWDC.
> I also use a lot of AI but you really have to demand quality from it, whether it's writing, media, or code. It's clear you've got the taste from your media work, and we're all still learning as we go...
Their use of AI for "media work" has shown a taste but their writing usage still needs to equal that.
I'm more hot about it because it's frustrating having so many HN posts be a place for people to work out first drafts, especially when the first piece of feedback is "hey, uh, you clearly used AI and it's horrible to read as a result." So easy to avoid...good on you for being kinder.
(part of my frustration is I was excited because I write an local LLM client and thought I missed Gemma 4 has streaming video input support, but after reading through the slop it turns out its just the ol' "extract frames" workflow. tbf that would have happened AI or not, but put me in a mood)
May 21, 2026
I'm in the Maasai Mara about half the year, in three-month stretches. Animals out the front of the lodge, motorcycles, friends in the Maasai villages, kids who think a drone is the funniest thing they have ever seen. That's one half of my year. The other half is sixteen-hour days in front of a terminal, Silicon Valley hacker brain on Africa time. Both real, both consuming attention.
The first half is a constant flood of footage from the iPhone, the DJI Pocket, the drone, the Nikon Z8, and lately the Ray-Ban Metas too. There's always something being recorded. Every photographer or videographer I know is sitting on the same problem: an archive that grows faster than they can edit it. The second half is why mine never gets touched.
Airport security somewhere between Nairobi and Spain. Two trays of cameras, headphones, drone bits, batteries, SSDs, more cables than anyone needs. Most of it records something. Almost none of what they record gets touched again any time soon.
Three months ago the lodge's social channels went dark. Not for lack of content; the lodge has years of raw footage across multiple SSDs. The bottleneck was editing time, and my time disappeared. Claude Code with Opus 4.5 (and then 4.6) hit the point in February where you could leave agents running for hours and come back to merged PRs. KaribuKit was going live with its first paying property in the same window. I stopped sleeping properly, started running three or four agents in parallel in the background, and the months when I would have cut reels turned into months when I shipped software instead.
So one weekend I sat down to fix it. The first thing I tried was wrong.
The initial pitch (to myself, after about an hour of research) was a SaaS stack: Eddie AI for iterative editing, Higgsfield MCP for generative B-roll, Submagic for captions, Buffer for cross-posting. About $140 a month, slick on paper.
Two problems showed up before I ran any of it.
First, generative AI video has no place on a real travel brand. Guests pay $300 a night and up to see the actual place, and mislabeled AI shots equals TripAdvisor crucifixion. Higgsfield out.
Second, 3-5 posts a week was aggressive for me, and the realistic floor was more like 2-3. The pitch was optimistic in a way that would have me failing by week two.
Then I remembered I already own DaVinci Resolve Studio, and Resolve 21 ships IntelliSearch (semantic clip search), Smart Bins (auto-organizing folders), and Voice to Subtitle that produces 90-95% accurate captions on the timeline. That's roughly 70% of what Eddie sells, so Eddie was out too.
What I was left with was Claude Code driving Resolve via the open-source DaVinci Resolve MCP, with ElevenLabs handling voiceover on informational clips where it earned its place, and the cost had dropped from $140 a month to $22.
But the deeper thing only landed once I tried to actually use any of this. Every AI video editor on the market assumes your footage is already labeled. Mine is IMG_*.mov and DJI_*.mp4 across folders with names like Mara june 2024 backup final FINAL. Eddie can search by transcript, but none of these tools can find "the elephant on the hill at golden hour" against an unlabeled archive.
The AI editor is solving the wrong problem. Or more precisely, it's solving the second problem; the first problem is the index.
I asked it out loud: how does the agent know what's in each clip?
There's no answer for an unlabeled archive. You can throw transcripts at it, GPS coordinates, filenames, parent folders. None of that gives you "the wide shot at sunrise with the giraffe in the frame" unless something has actually looked at the pixels.
The leverage is upstream. Build the index first, make the archive queryable in English, and the editor on top becomes a thin layer doing what it was designed to do.
So I built the index, locally.
This is the kind of AI-native build I do for clients at SimbaStack, except I was both the client and the engineer this time, which made the decision tree a lot shorter.
Four constraints set the shape:
.description.md per clip, living right next to it, plain text and grep-able. It survives if my indexer breaks tomorrow, and it travels with the data when files move between drives.The per-clip pipeline:
ffprobe for metadata.exiftool for GPS lat/lon/altitude. Works on iPhone, DJI Pocket, drone footage, all the same.ffmpeg extracts five evenly-spaced frames at 1920px.insightface detects faces and stores 512-dim ArcFace embeddings in a centralized SQLite face DB for cross-archive person queries later.Here's what that looks like on a real clip from the Mara Hilltop archive.
One frame from IMG_1103.MOV. Ellie on the deck of one of the luxury tents at the lodge, midday. None of that context lives in the filename.
The sidecar Gemma wrote for the same clip. YAML on top (lighting enum, time-of-day enum, color palette, face embeddings, GPS), prose ## Description below. It picked up the safari-tent setting, the camera pan from interior to savanna, the shot type, and suggested two use cases (marketing reels and travel-vlog B-roll). The filename had IMG_1103.MOV; the sidecar has the rest of what I needed to find it again.
A real Mara Hilltop archive folder after the indexer has run through it. Every clip has a .description.md sidecar next to it; the _INDEX.json and _INDEX.md at the top are folder-level rollups for fast grep and LLM-friendly handoff.
The whole thing is a Claude Code skill, about 1,400 lines of Python. Claude Code wrote almost all of it. My work was the architecture, the prompts, the schema design, and the bug triage when things went wrong.
This is the part that actually surprised me.
I bought a 16-inch MacBook Pro M1 Max with 64GB of RAM in 2021, and the reason had nothing to do with LLMs. I'd been hitting 32GB limits on my previous machine for a while. A messy hacker brain running hundreds of Chrome tabs alongside DaVinci Resolve, Slack, Discord, and Drive was too much for pre-unified-memory hardware to handle without paging constantly. I maxed out the RAM on the new M1 Max because the old one wouldn't stop killing my workflow and I had the money to fix it.
Five years later, that same laptop is running Gemma 4 31B Q4 in LM Studio against a year of video footage.
LM Studio with Gemma 4 31B Q4 loaded. 28.40 GB of model in memory, REST API at 127.0.0.1:1234. The bottom panel is the server log during a real bulk run, encoding frames one clip at a time.
The bulk run pushed the laptop past where 64GB of RAM alone would carry it. Activity Monitor reported 50.89 GB of swap at the peak.
64 GB of physical RAM, 50.89 GB of swap used. Memory pressure in the yellow band, the kind of state you absolutely should not run on a normal Tuesday. Apple's swap is designed for it, and the fans were loud.
I Googled whether that would damage the SSD, and apparently for a day or two it's fine. Don't make it your normal operating state, but a weekend of pushing the machine hard is well within tolerance. My laptop ran hot, the fans spun up, and it kept producing sidecars while I worked on other things.
The M1 Max 16-inch is, honestly, legendary. People in the Mac community talk about it that way for good reason: five years on, it's running 31B-parameter models at usable speed with the kind of headroom that should not exist on hardware this old. I expect another three to five years out of this thing, comfortably, because local LLMs only get more efficient and the hardware is the floor, not the ceiling.
I bought it for Chrome. It's running a model that didn't exist when I bought it.
The build was mostly Claude Code holding the pen. The interesting work was the four times it almost shipped something wrong.
WhisperX 3.8 broke its diarization API between when I last touched it and now. Two breaking changes had landed: whisperx.DiarizationPipeline moved to the whisperx.diarize submodule, and the constructor kwarg use_auth_token was renamed to token (inherited from pyannote 3.x). The fix was signature introspection: the script tries token= first and falls back to use_auth_token= if the constructor raises a TypeError, so it survives the next API shuffle automatically. When you're shelling out to AI libraries that move this fast, defensive constructor calls are cheap insurance.
The Claude CLI returns permission errors as successful responses. On the first test of the CLI backend, all four sidecars came back identical with the text "I need permission to read the image frames...", and the script's success check passed because exit code was 0 and the output wasn't empty. The cause was that in non-interactive mode without --permission-mode bypassPermissions, the CLI returns the permission-denial text as the response body instead of prompting, which means the failure mode looks exactly like success unless you string-match for it. The fix was adding the flag plus a defensive check that flags any short response containing "I need permission" as an error rather than a description. When you script AI tools, the non-interactive permission flow is where the silent failures hide.
Gemma returned people_count: "many" instead of an integer. My vision prompt literally said integer or the string "many" if >10. Gemma followed instructions correctly; the bug was schema design. The fix was a stricter prompt (integer 0-99 with explicit guidance to estimate) plus a coercion in the parser for the legacy "many" responses. Don't union-type schema fields. Pick always-int or always-string, never "int or this one specific string," because every downstream consumer pays for the choice.
Then there was the motorcycle clip that shouldn't have been culled. My initial cull prompt was photographer-portfolio-shaped: heavy motion blur, soft focus, and jittery stability got rated cull. Technically correct. Then I tested it on a handheld nighttime motorcycle clip from a Spain trip and it culled it. I caught it: that's a fun memory, the blur is the vibe. I reframed the cull criteria to "not a real recording" only (lens cap, pocket footage, two-second test clips, fully clipped exposure), not "imperfect capture." Photo archives cull aggressively; video memories cull permissively. Same schema, different criteria, and you have to be explicit about which mode you're in.
Three things I now believe more strongly than I did a week ago.
Enum constraints beat instructions for confabulation prevention. I tested Gemma 4 E4B on a coworking-space photo I'd taken at night, and it described the scene as "brightly lit, abundant natural light, floor-to-ceiling windows," except the windows were pitch black outside, because it was night. Then I tested 31B with a structured schema prompt that forces the model to pick from golden_hour | bright_daylight | overcast | dim_interior | nighttime | mixed | unclear, and both thinking-off and thinking-on recovered nighttime correctly. A model can lie about open-ended prose, but it can only mis-pick from an enum, never invent a new value. Use schemas, not instructions.
Local 31B with structured prompts closes most of the gap to cloud. Gemma 4 31B Q4 thinking-off against a structured schema produces output that's hard to distinguish from Sonnet 4.6 on most of my test clips. The cloud premium earns its keep on the hard 10-20%. Bulk indexing at scale (thousands of clips overnight) should run local; cloud is the re-rate pass on clips local flagged as review. That two-tier setup is the one that scales.
AI video editors are pitched one layer too high. The valuable layer is the index. Once your archive is queryable in plain English ("show me handheld interior clips from Mara, golden hour, with people, longer than 8 seconds"), the editor on top is straightforward. Most of the AI-editor space is competing for the surface above an index that doesn't exist, and the index is the prerequisite they're all skipping past.
Looking back, time wasn't really what kept this from getting fixed sooner. I had every AI superpower currently available pointed at the work side of my life: Claude Code refactoring codebases overnight, Codex writing most of my pull requests, the agentic stack I'd just spent three months using to ship KaribuKit. On the editing side, I was using none of it. The not-getting-to-it had become its own small, low-grade frustration that lived in the back of my head all year, the kind of thing you notice every time you open a folder on the SSD and close it again without doing anything. What clicked one Saturday was that the editing backlog was a tooling problem, and tooling is the one kind of problem I happen to be well-equipped to fix right now.
This weekend I'm building the editor: Claude Code as the orchestrator, DaVinci Resolve MCP for the cuts, ElevenLabs for voiceover on informational clips. There's one hard rule baked into the tooling: the voice clone is for utility content only. Directions, room descriptions, multilingual versions, factual stuff I'd say in person anyway. Never for testimonials or founder messages. Disclosure laws are real in 2026, and trust in a hospitality brand is too easy to lose.
The index makes all of that tractable. Without it, I would still be scrubbing through 47GB of DJI Pocket footage looking for the sunrise wide.
For now: a year of Mara Hilltop footage is queryable in English on a five-year-old laptop. Cost was a weekend of my time and 50GB of swap. The remaining years across older SSDs are next.
A fair check on all of this: Mara Hilltop's social channels are still dead today. The indexer solves only half the problem (finding the right clip); the editor that turns those clips into finished reels is the other half, and that's the part I'm building this weekend. If it works, the channels light back up and I write part two. If it doesn't, I write about why.
In all honesty, the right answer here might be to hire someone. Finding an editor with the right sensibility for Mara Hilltop (warm, observational, no over-cut MTV-energy reels) is harder than writing another skill. If you know someone who works in that register, send them my way.
Edit: code at github.com/Simbastack-hq/framedex. Thanks to the HN commenter who flagged that the original local-path reference wasn't useful. If you're working on something similar (indexing personal archives, getting a local model to do real archival work, building agents that drive editing tools), I'd be glad to compare notes.
ā NJ
Building KaribuKit (AI-native PMS for hospitality), running Mara Hilltop (eco-lodge in the Maasai Mara), and consulting through SimbaStack.
#local-llm#claude-code#video-archive#mara-hilltop#simbastack