Monday May 11 2026

Hacker Times

Local AI needs to be the norm

Discussion

One of the current trends in modern software is for developers to slap an API call to OpenAI or Anthropic for features within their app. Reasonable people can quibble with whether those features are actually bringing value to users, but what I want to discuss is the fundamental concept of taking on a dependency to a cloud hosted AI model for applications.

This laziness is creating a generation of software that is fragile, invades your privacy, and fundamentally broken. We are building applications that stop working the moment the server crashes or a credit card expires.

We need to return to a habit of building software where our local devices do the work. The silicon in our pocket is mind bogglingly faster than what was available a decade ago. It has a dedicated Neural Engine sitting there, mostly idle, while we wait for a JSON response from a server farm in Virginia. That’s ridiculous.

Even if your intentions are pure, the moment you stream user content to a third party AI provider, you’ve changed the nature of your product. You now have data retention questions and all the baggage that comes with that (consent, audit, breach, government request, training, etc.)

On top of that you also substantially complicated your stack because your feature now depends on network conditions, external vendor uptime, rate limits, account billing, and your own backend health.

Congratulations! You took a UX feature and turned it into a distributed system that costs you money.

If the feature can be done locally, opting into this mess is self inflicted damage.

“AI everywhere” is not the goal. Useful software is the goal.

Concrete Example: Brutalist Report’s On-Device Summaries

Years ago I launched a fun side project named The Brutalist Report, a news aggregator service inspired by the 1990s style web.

Recently, I decided to build a native iOS client for it with the design goal of ensuring it would remain a high-density news reading experience. Headlines in a stark list, a reader mode that strips the cancer that has overtaken the web, and (optionally) an “intelligence” view that generates a summary of the article.

Here’s the key point though: the summary is generated on-device using Apple’s local model APIs. No server detours. No prompt or user logs. No vendor account. No “we store your content for 30 days” footnotes needed.

It has become so normal for folks that any AI use is happening server-side. We have a lot of work to do to turn this around as an industry.

It’s not lost on me that sometimes the use-cases you have will demand the intelligence that only a cloud hosted model can provide, but that’s not the case with every use-case you’re trying to solve. We need to be thoughtful here.

Available Tooling

I can only speak on the tooling available within the Apple ecosystem since that’s what I focused initial development efforts on. In the last year, Apple has invested heavily here to allow developers to make use of a built-in local AI model easily.

The core flow looks roughly like this:

import FoundationModels

let model = SystemLanguageModel.default
guard model.availability == .available else { return }

let session = LanguageModelSession {
  """
  Provide a brutalist, information-dense summary in Markdown format.
  - Use **bold** for key concepts.
  - Use bullet points for facts.
  - No fluff. Just facts.
  """
}

let response = try await session.respond(options: .init(maximumResponseTokens: 1_000)) {
  articleText
}

let markdown = response.content

And for longer content, we can chunk the plain text (around 10k characters per chunk), produce concise “facts only” notes per chunk, then runs a second pass to combine them into a final summary.

This is the kind of work local models are perfect for. The input data is already on the device (because the user is reading it). The output is lightweight. It’s fast and private. It’s okay if it’s not a superhuman PhD level intelligence because it’s summarizing the page you just loaded, not inventing world knowledge.

Local AI shines when the model’s job is transforming user-owned data, not acting as a search engine for the universe.

There are plenty of AI features that people want but don’t trust. Summarizing emails, extract action items from notes, categorize this document, etc.

The usual cloud approach turns every one of those into a trust exercise. “Please send your data to our servers. We promise to be cool about it.”

Local AI changes that. Your device already has the data. We’ll do the work right here.

You don’t build trust with your users by writing a 2,000 word privacy policy. You build trust by not needing one to begin with.

The tooling available on the platform goes even further.

One of the best moves Apple has made recently is pushing “AI output” away from unstructured blobs of text and toward typed data.

Instead of “ask the model for JSON and pray”, the newer and better pattern is to define a Swift struct that represents the thing you want. Give the model guidance for each field in natural language. Ask the model to generate an instance of that type.

That’s it.

Conceptually, it looks like this:

import FoundationModels

@Generable
struct ArticleIntel {
  @Guide(description: "One sentence. No hype.") var tldr: String
  @Guide(description: "3–7 bullets. Facts only.") var bullets: [String]
  @Guide(description: "Comma-separated keywords.") var keywords: [String]
}

let session = LanguageModelSession()
let response = try await session.respond(
  to: "Extract structured notes from the article.",
  generating: ArticleIntel.self
) {
  articleText
}

let intel = response.content

Now your UI doesn’t have to scrape bullet points out of Markdown or hope the model remembered your JSON schema. You get a real type with real fields, and you can render it consistently. It produces structured output your app can actually use. And it’s all running locally!

This isn’t just nicer ergonomics. It’s an engineering improvement.

And if you’re building a local first app, this is the difference between “AI as novelty” and “AI as a trustworthy subsystem”.

“But Local Models Aren’t As Smart”

Correct.

But also so what?

Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.

And for those tasks, local models can be truly excellent.

If you try to use a local model as a replacement for the entire internet, you will be disappointed. If you use it as a “data transformer” sitting inside your app, you’ll wonder why you ever sent this stuff to a server.

Use cloud models only when they’re genuinely necessary. Keep the user’s data where it belongs. And when you do use AI, don’t just glue it as a chat box. Use it as a real subsystem with typed outputs and predictable behavior.

Stop shipping distributed systems when you meant to ship a feature.

nullc•about 14 hours ago

Here is an example-- I'm running hermes + qwen3.6-27b on a workstation GPU (an older RTX A6000 which gets 55tok/s, though people run this model on more limited hardware).

A friend an I had previously worked on an entropy extraction scheme and he recently got around to making a writeup about our work: https://wuille.net/posts/binomial-randomness-extractors/

I instructed the agent to read the URL, implement the technique in C++ for 32-bit registers, then make a SIMD version that interleaves several extractors in parallel for better performance. It implemented it (not hard since there was an implementation there that it read), then wrote more extensive tests. Then it vectorized it. It got confused a few times during debugging because the algorithm uses some number theory tricks so that overflows of intermediate products don't matter and it was obviously trained a lot on ordinary code were such overflows are usually fatal. I instructed it to comment the code explaining why the overflows are fine and had it continue which mostly solved its confusion.

It successfully got the initial 12MB/s scalar implementation to about 48MB/s. Then I told it to keep optimizing until it reaches 100MB/s. I came back the next day and it had stopped after 6 hours when it achieved just over 100MB/s. Reading what it did: it went off looking at disassembly, figured out what hardware it was running on, and reading microarch timing tables online and made some better decisions, tried a lot of things that didn't work, etc. (And of course, the implementation is correct).

I'm pretty skeptical about AI and borderline hateful of many people who (ab)use it and are deluded by it-- but I think this experience shows that a small local model can be objectively useful.

(oh and this experience was also while I only had the model running at 19tok/s)

Running the model in a loop where it can get feedback from actually testing stuff allows you to make progress in spite of making many mistakes.

I could have done this work myself but I didn't have to and I certainly spent less time checking in and prodding it than it would have taken me to do it. In my case I wondered how much faster parallel extractors using SIMD might be-- an idle curiosity that would have gone unanswered if not for the AI.

avazhi•about 4 hours ago

> LLMs in 2022 were very impressive.

No they weren't. They were a gimmick - it is only in the past 6 or so months that frontier models have started to do stuff beyond mere gimmicks when it comes to coding, and you could make the argument that Mythos has been the first 'Holy shit' moment that we've had that has stepped us beyond 'Yeah that's really neat but...'

> Those models are absolutely runnable on consumer hardware now,

A sub 50B model is awful and can't even write proper English sentences half the time, to say nothing of how bad its world knowledge is. Try the 32B Gemma 4 local model for a week and then go back to Claude and then get back to me.

> We're long past "danger". If what we have is the best we'll ever have open source, we're already in an excellent position.

Not sure what to tell you other than that you and I have very different standards. What we have locally right now is barely more than a glorified autocomplete, and it feels worse than using ChatGPT 2 years ago because the context window is less and it doesn't have good webhooks on consumer setups. Another thing I'd say is that you clearly have no clue what 'consumer hardware' means, or what consumers that can even get this stuff running locally would have to do to get it to even rival the frontier models in terms of their usability (most consumers are't going to just boot into Ubuntu and run this thing from a command line) flow, to say nothing of the hardware requirements. I'd love to never use Claude or Gemini or ChatGPT again for both privacy and money reasons, but the quality of outputs and depth of thinking and writing ability between even the very best local models you can run right now is many orders of magnitude less than what you get using distributed frontier models, and those 'very best' local models require a top of the line machine that 99.9999% of consumers don't have and would never consider buying. The cloud models all have like a trillion(!) parameters now. It isn't even close.

I sure hope the local side of things massively improves over the next 2-3 years, but based on how this has gone my guess is that in 3 years you'll be lucky, if you have very top of the line hardware, to get benchmark performance that we had 6 months ago with the frontier models. The distributed hardware/memory gap is just too big.

majormajor•about 10 hours ago

Storage has moved back and forth but I don't thnk compute has ever really gone back to thin client. Even Gmail, Google Docs, etc are running a buttload of javascript on the user device. Various attempts at avoiding that (remote .NET or JVM stuff on early "smart-ish" phones) crashed and burned.

Video game streaming is the closest thing, and it's never really taken off. (And this, IMO, is a good comparison because it's a pretty similar magnitude up-front-cost, $500-$4000.)

Once the local-AI-is-good-enough (Sonnet level for a lot of basic tasks, say) for a $1k up-front investment the appeal of having something that can chew on various tasks 24/7 w/o rate limits, API token budget charge concerns, etc, is going to unlock a lot of new approaches to problems. Essentially more fully-baked line-of-business OpenClaw-type things. Or the smart home automation bot of Siri's dreams. You can more easily make that all private and secure when all the compute is local: don't give any outside network access. Push data into the sandbox periodically via boring old scripts-on-cronjobs, vs giving any sort of "agentic" harness external access. Have extremely limited data structures for getting output/instructions back out. I'd never want to pass info about my personal finances into a third party remote model; but I'd let a local one crunch numbers on it.

Even if you need Opus/Mythos/whatever level for certain tasks, if 95% of everything else you'd pay Anthropic or OpenAI for can now be done on things you own w/o third party risk... what does that do to the investment appeal of building better AI appliances to sell end users vs building better centralized models?

I think "what if today's LLM performance, but running entirely under your control and your own hardware" opens up a LOT of interesting functionality. Crowdsource the whole world's creativity to figure out what to do with it, vs waiting for product managers and engineers at 3 individual companies to release features.