The thing I've found with MLX vs llama.cpp is that the memory efficiency story is much better on unified memory machines. With llama.cpp you're fighting the CPU/GPU split; with MLX it just uses the whole pool. Made a meaningful difference for us running 4B models alongside an Electron app.
Curious whether the Ollama MLX backend exposes any controls for cache management or whether it's abstracted away entirely. That's been one of the trickier parts of tuning for our use case.
#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.
Lemonade or even llama.cpp are much better optimised and arguably just as easy to use.
Yeah, I can still save money by buying a cheaper device with less RAM and just paying my PPQ.AI or OpenRouter.com fees .
For example right now a lot of work is being done on improving tool calling and agentic workflows, which tool calling was first popping up around end of 2023 for local LLMs.
This is putting aside the standard benchmarks which get "benchmaxxed" by local LLMs and show impressive numbers, but when used with OpenCode rarely meet expectations. In theory Qwen3.5-397B-A17B should be nearly a Sonnet 4.6 model but it is not.
The 64GB model is 2240€ base and the 128GB is 3069€ base + all the stuff you need to add to make it an actual computer.
As a comparison the 64GB Mac Mini is 2499€ here and a 128GB Mac Studio is 4274€.
The lack of proper support for SSD offload (via mmap or otherwise) is really the worst part about this. There's no underlying reason why a 3B-active model shouldn't be able to run, however slowly, on a cheap 8GB MacBook Neo with active weights being streamed in from SSD and cached. (This seems to be in the works for GGML/GGUF as part of upgrading to newer upstream versions; no idea whether MLX inference can also support this easily.)
any plans for providing it through brew for easy installation?
Looks like they have pivoted completely over to Gemini, thank god.
There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)
The devex is great and familiar to folks who have used Docker. Reading through the Lemonade documentation, it seems like a natural migration, but we're talking about two steps for getting started versus just one. So I'd need a reason to make that much change when I'm happy enough with Ollama.
That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).
ollama run $model "calculate fibonacci numbers in a one-line bash script" --verbose
Model PromptEvalRate EvalRate
------------------------------------------------------
qwen3.5:35b-a3b-q4_K_M 6.6 30.0
qwen3.5:35b-a3b-nvfp4 13.2 66.5
qwen3.5:35b-a3b-int4 59.4 84.4
I can't comment on the quality differences (if any) between these three.What I'm waiting for next is MLX supported speech recognition directly from Ollama. I don’t understand why it should be a separate thing entirely.
Local LLMs don't make sense for most people compared to "cloud" services, even more so for coding.
cmd(){ local x c r a; while [[ $1 == -* ]]; do case $1 in -x)x=1;shift;; -c)c=1;shift;; *)break;; esac; done; r=$(apfel -q -s 'Output only a shell command.' "$*" | sed '/^```/d;/^#/d;s/^[[:space:]]*//;/^$/d' | head -1); [[ $r ]] || { echo "no command generated"; return 1; }; printf '\e[32m$\e[0m %s\n' "$r"; [[ $c ]] && printf %s "$r" | pbcopy && echo "(copied)"; [[ $x ]] && { printf 'Run? [y/N] '; read -r a; [[ $a == y ]] && eval "$r"; }; return 0; }
cmd find all swift files larger than 1MBcmd -c show disk usage sorted by size
cmd -x what process is using port 3000
cmd list all git branches merged into main
cmd count lines of code by language
without calling home or downloading extra local models
and well, maybe one day they get their local models .... more powerful, "less afraid" and way more context window.
Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.
It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.
I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.
For example, last week I built a real-time voice AI running locally on iPhone 15.
One use case is for people learning speaking english. The STT is quite good and the small LLM is enough for basic conversation.
i have ran lmstudio for a while but i don't really use local models that much other than to mess about.
Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.
Second, for the best performance on a Mac you want to use an MLX model.
(regarding mlx, there were toolkits built on mlx that supported qlora fine tuning and inference, but also produced a bunch of heat)
But a local model + good harness with a robust toolset will work for people more often than not.
The model itself doesn't need to know who was the president of Zambia in 1968, because it has a tool it can use to check it from Wikipedia.
ChatGPT free falls back to GPT-5.2 Mini after a few interactions.
Most users are fixing grammar/spelling, summarising/converting/rewriting text, creating funny icons, and looking up simple facts, this is all far from frontier model performance.
I've a feeling that if/when Apple release their onboard LLM/Siri improvements that can call out if needed, the vast majority of people will be happy with what they get for free that's running on their phone.
I think what "need" youspeak of is a bit of a colored statement.
I am using the model they recommended in the blog post - which I assumed was using MLX?
When you're not making network calls, you stop thinking in "loading states" and start thinking in "local state machines." The UX design space opens up completely. Interactions that felt too fast to justify a server round-trip are suddenly viable.
The backporting issue is painful though. I've been shipping features wrapped in #available(iOS 26, *) and the fallback UX is basically a different product. It forces you to essentially maintain two app experiences.
Still think this is the right direction — especially for junior devs just learning to ship. Fewer moving parts, less infrastructure to debug.

Today, we’re previewing the fastest way to run Ollama on Apple silicon, powered by MLX, Apple’s machine learning framework.
This unlocks new performance to accelerate your most demanding work on macOS:
Personal assistants like OpenClaw
Coding agents like Claude Code, OpenCode, or Codex
Accelerate coding agents like Pi or Claude Code
OpenClaw now responds much faster
Ollama on Apple silicon is now built on top of Apple’s machine learning framework, MLX, to take advantage of its unified memory architecture.
This results in a large speedup of Ollama on all Apple Silicon devices. On Apple’s M5, M5 Pro and M5 Max chips, Ollama leverages the new GPU Neural Accelerators to accelerate both time to first token (TTFT) and generation speed (tokens per second).
Prefill performance
0 500 1000 1500 2000 tokens/s 1810 Ollama 0.19 1154 Ollama 0.18
Decode performance
0 40 80 120 160 tokens/s 112 Ollama 0.19 58 Ollama 0.18
Testing was conducted on March 29, 2026, using Alibaba’s Qwen3.5-35B-A3B model quantized to `NVFP4` and Ollama’s previous implementation quantized to `Q4_K_M` using Ollama 0.18. Ollama 0.19 will see even higher performance (1851 token/s prefill and 134 token/s decode when running with `int4`).
Ollama now leverages NVIDIA’s NVFP4 format to maintain model accuracy while reducing memory bandwidth and storage requirements for inference workloads.
As more inference providers scale inference using NVFP4 format, this allows Ollama users to share the same results as they would in a production environment.
It further opens up Ollama to have the ability to run models optimized by NVIDIA’s model optimizer. Other precisions will be made available based on the design and usage intent from Ollama’s research and hardware partners.
Ollama’s cache has been upgraded to make coding and agentic tasks more efficient.
Lower memory utilization: Ollama will now reuse its cache across conversations, meaning less memory utilization and more cache hits when branching when using a shared system prompt with tools like Claude Code.
Intelligent checkpoints: Ollama will now store snapshots of its cache at intelligent locations in the prompt, resulting in less prompt processing and faster responses.
Smarter eviction: shared prefixes survive longer even when older branches are dropped.
This preview release of Ollama accelerates the new Qwen3.5-35B-A3B model, with sampling parameters tuned for coding tasks.
Please make sure you have a Mac with more than 32GB of unified memory.
Claude Code:
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
OpenClaw:
ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4
Chat with the model:
ollama run qwen3.5:35b-a3b-coding-nvfp4
We are actively working to support future models. For users with custom models fine-tuned on supported architectures, we will introduce an easier way to import models into Ollama. In the meantime, we will expand the list of supported architectures.
Thank you to:
In part of it, one group tries to take control of a huge ship from another group. They in part do this by trying to bypass all the cybersecurity. But in those far future days, you don't interface with all the aeons of layers of command protocols anymore, you just query an AI who does it for you. So, this group has a few tech guys that try the bypass by using the old command protocols directly (in a way the same thing like the iOS exploit that used a vulnerability in a PostScript font library from 90s).
Imagine being used to LLM prompting + responses, and suddenly you have to deal with something like
sed '/^```/d;/^#/d;s/^[[:space:]]\*//;/^$/d' | head -1); [[ $r ]]
and generally obtuse terminal output and man pages.:)
(offtopic: name your variables, don't do local x c r a;. Readability is king, and a few hundred thousand years from now some poor Qeng Ho fellow might thank his lucky stars you did).
----
Full reaction:
Yes but perhaps not in a way you might expect. Qwen's reasoning ability isn't exactly groundbreaking. But it's good enough to weave a story, provided it has some solid facts or notes. GraphRAG is definitely a good way to get some good facts, provided your notes are valuable to you and/or contain some good facts.
So the added value is that you now have a super charged information retrieval system on your notes with an LLM that can stitch loose facts reasonably well together, like a librarian would. It's also very easy to see hallucinations, if you recognize your own writing well, which I do.
The second thing is that I have a hard time rereading all my notes. I write a lot of notes, and don't have the time to reread any of them. So oftentimes I forget my own advice. Now that I have a super charged information retrieval system on my notes, whenever I ask a question: the graphRAG + LLM search for the most relevant notes related to my question. I've found that 20% of what I wrote is incredibly useful and is stuff that I forgot.
And there are nuggets of wisdom in there that are quite nuanced. For me specifically, I've seen insights in how I relate to work that I should do more with. I'll probably forget most things again but I can reuse my system and at some point I'll remember what I actually need to remember. For example, one thing I read was that work doesn't feel like work for me if I get to dive in, zoom out, dive in, zoom out. Because in the way I work as a person: that means I'm always resting and always have energy for the task that I'm doing. Another thing that it got me to do was to reboot a small meditation practice by using implementation intentions (e.g. "if I wake up then I meditate for at least a brief amount of time").
What also helps is to have a bit of a back and forth with your notes and then copy/paste the whole conversation in Claude to see if Claude has anything in its training data that might give some extra insight. It could also be that it just helps with firing off 10 search queries and finds a blog post that is useful to the conversation that you've had with your local LLM.
could also be considered a triage layer
Maybe in the distant future when device compute capacity has increased by multiples and efficiency improvements have made smaller LLMs better.
The current data center buildouts are using GPU clusters and hybrid compute servers that are so much more powerful than anything you can run at home that they’re not in the same league. Even among the open models that you can run at home if you’re willing to spend $40K on hardware, the prefill and token generation speeds are so slow compared to SOTA served models that you really have to be dedicated to avoiding the cloud to run these.
We won’t be in a data center crunch forever. I would not be surprised if we have a period of data center oversupply after this rush to build out capacity.
However at the current rate of progress I don’t see local compute catching up to hosted models in quality and usability (speed) before data center capacity catches up to demand. This is coming from someone who spends more than is reasonable on local compute hardware.
But it's still the easiest and cleanest way to get decent local AI speeds on a non-Mac.
This might be the cost of privacy, and it might be worth paying, unless cloud models reach an inflection point that make local models archaic.
No. It runs on MacOS but uses Metal instead of MLX.
They've usually been intended for ereader/off-grid/post-zombie-apocalypse situations but I'd guess someone is working on an llm friendly way to install it already.
Be interesting to know the tradeoffs. The Tienammen square example suggests why you'd maybe want the knowledge facts to come from a separate source.
> solves the problem of too much demand for inference
False, it creates consumer demand for inference chips, which will be badly utilised.
> also would use less electricity
What makes you think that? (MAYBE you can save power on cooling. But not if the data center is close to a natural heat sink)
> It's just a matter of getting the performance good enough.
The performance limitations are inherent to the limited compute and memory.
> Most users don't need frontier model performance.
What makes you think that?
I was looking for details about cars and it started interjecting how the safety would affect my children by name in a conversation where I never mention my children. I was asking details about Thunderbolt and modern Ryzen processors and a fresh Gemini chat brought in details about a completely unrelated project I work on. I’ve always thought local LLMs would be important, but whatever Google did in the past few weeks has made that even more clear.
I am wondering how people rave so much about local "small devices" LLM vs what codex or Claude code are capable of.
Sadly there are too much hype on local LLM, they look great for 5min tests and that's it.
These models are dumber and slower than API SoTA models and will always be.
My time and sanity is much more expensive than insurance against any risk of sending my garbage code to companies worth hundreds of billions of dollars.
For most, it's a downgrade to use local models in multiple fronts: total cost of ownership, software maintenance, electricity bill, losing performance on the machine doing the inference, having to deal with more hallucinations/bugs/lower quality code and slower iteration speed.
I would expect consumer inference ASIC chips will emerge when model developments start plateauing, and "baking" a highly capable and dense model to a chip makes economic sense.
Users really don’t matter at all. The revenue for AI companies will be B2B where the user is not the customer - including coding agents. Most people don’t even use computers as their primary “computing device” and most people are buying crappy low end Android phones - no I’m not saying all Android phones are crappy. But that’s what most people are buying with the average selling price of an Android phone being $300.
Worth adding that I had reasoning on for the Tiananmen question, so I could see the prep for the answer, and it had a pretty strong current of "This is a sensitive question to PRC authorities and I must not answer, or even hint at an answer". I'm not sure if a research tool would be sufficient to overcome that censorship, though I guess I'll find out!
Actual consumers not only don't care, they will not even be aware of the difference.
I have an M4 MacBook Air with 24 GB RAM and it doesn’t feel sufficient to run a substantial coding model (in addition to all my desktop apps). I’m thinking about upgrading to an M5 MacBook Pro with much more RAM, but I think the capabilities of cloud-hosted models will always run ahead of local models and it might never be that useful to do local inference. In the cloud you can run multiple models in parallel (e.g. to work on different problems in parallel) but locally you only have a fixed amount of memory bandwidth so running multiple model instances in parallel is slower.
[1] https://9to5mac.com/2026/03/03/apple-macbook-price-increase-...
Its not 100% offline, but there is a dramatic drop in token usage. As long as you can put up with the speed.
Mac Studio, Mac Mini, MacBook Pro, you can find even some used ones with enough RAM that will run models like Qwen reasonably well.
I'm using a M1 Max MacBook Pro and it runs Qwen 3.5 on Ollama (without MLX) at a decent speed.
You can use models like qwen3.5 running on local hardware in ollama and redirect Claude to use the local ollama API endpoint instead of Anthropic’s servers.
'Dense' models of yesteryear (e.g. llama:70b, gemma2/3:27b) tend to be significantly slower by comparison, therefore, your hardware spends a lot more time 'maxed out' for a given prompt.
brew install Arthur-Ficial/tap/apfelWhy waste time with subpar AI?
What are you doing with these local models that run at x tokens/sec.
Do you have the equivalent of ChatGPT running entirely locally? What do you do with it? Why? I honestly don’t understand the point or use case.
The instruction to the AI was to create _a_ shell command. So it's a random shell command generator (maybe).
It needs to be just smart enough to use the tools and distill the responses into something usable. And one of the tools can be "ask claude/codex/gemini" so the local model itself doesn't actually need to do much.
Have you tried the free version of ChatGPT? It is positively appalling. It’s like GPT 3.5 but prompted to write three times as much as necessary to seem useful. I wonder how many people have embarrassed themselves, lost their jobs, and been critically misinformed. All easy with state-of-the-art models but seemingly a guarantee with the bottom sub-slop tier.
Is the average person just talking to it about their day or something?
I think the opposite is true. Local inference doesn't have to go over the wire and through a bunch of firewalls and what have you. The performance from just regular consumer hardware with local, smaller models is already decent. You're utilizing the hardware you already have.
> The performance limitations are inherent to the limited compute and memory.
When you plug in a local LLM and inference engine into an agent that is built around the assumption of using a cloud/frontier model then that's true.
But agents can be built around local assumptions and more specific workflows and problems. That also includes the model orchestration and model choice per task (or even tool).
The Jevons Paradox comes into play with using cloud models. But when you have less resources you are forced to move into more deterministic workflows. That includes tighter control over what the agent can do at any point in time, but also per project/session workflows where you generate intermediate programs/scripts instead of letting the agent just do what ever it wants.
I give you an example:
When you ask a cloud based agent to do something and it wants more information, it will often do a series of tool calls to gather what it thinks it needs before proceeding. Very often you can front load that part, by first writing a testable program that gathers most of the necessary information up front and only then moving into an agentic workflow.
This approach can produce a bunch of .json, .md files or it can move things into a structured database or you can use embeddings or what have you.
This can save you a lot of inference, make things more reusable and you don't need a model that is as capable if its context is already available and tailored to a specific task.
There are so many CPUs, GPUs, RAM and SSDs which are underutilized. I have some in my closet doing 5% load at peek times. Why would inference chips be special once they become commodity hardware?
The fact that today's and yesterday's models are quite capable of handling mundane tasks, and even companies behind frontier models are investing heavily in strategies to manage context instead of blindly plowing through problems with brute-force generalist models.
But let's flip this around: what on earth even suggests to you that most users need frontier models?
You can try asking it the same question as Claude and compare the answers. I can guarantee you that the ChatGPT answer won't fit on a single screen on a 32" 4k monitor.
Claude's will.
That doesn't fix the "you don't know what you don't know" problem which is huge with smaller models. A bigger model with more world knowledge really is a lot smarter in practice, though at a huge cost in efficiency.
Sure but you're paying per-token costs on the SoTA models that are roughly an order of magnitude higher than third-party inference on the locally available models. So when you account for per-token cost, the math skews the other way.
2. They aren't harvesting your data for government files or training purposes
3. They won't be altered overnight to push advertising or a political agenda
4. They won't have their pricing raised at will
5. They won't disappear as soon as their host wants you to switch
This is all on top of the (to me) insufferable tone of the non-thinking models, but that might well be how most users prefer to be talked to, and whether that's how these models should accordingly talk is a much more nuanced question.
Regardless of that, everybody deserves correct answers, even users on the free tier. If this makes the free tier uneconomical to serve for hours on end per user per day, then I'd much rather they limit the number of turns than dial down the quality like that.
"when hostapd initializes 80211 iface over nl80211, what attributes correspond to selected standard version like ax or be?"
It works fine, avoids falling into trap due to misleading question. Probably works even better for more popular technologies. Yeah, it has higher failure rates but it's not a dealbreaker for non-autonomous use cases.I'll just say, if not a joke, the bit is appreciated either way!
"AI change to the home directory. Make it snappy!"
Picking a model that's juuust smart enough to know it doesn't know is the key.
None of them are as good as the big hosted models, but you might be surprised at how capable they are. I like running things locally when I can, and I also like not worrying about accidentally burning through tokens.
I think the future is multiple locally run models that call out to hosted models when necessary. I can imagine every device coming with a base model and using loras to learn about the users needs. With companies and maybe even households having their own shared models that do heavier lifting. while companies like openai and anhtropic continue to host the most powerful and expensive options.
MLX is faster because it has better integration with Apple hardware. On the other hand GGUF is a far more popular format so there will be more programs and model variety.
So its kinda like having a very specific diet that you swear is better for you but you can only order food from a few restaurants.
Having access to a model that is drawing from good sources and takes time to think instead of hallucinating a response is important in many domains of life.
I could be wrong because I'm not following this too closely, but the open weights future of both Llama and Qwen looks tenuous to me. Yes, there are others, but I don't understand the business model.
Cost is a pretty big reason.
Getting the local weather using a free API like met.no is a good first tool to use.
I worked for a research focused AI startup that had a strict "no external LLM" policy for code touching our core research.
You're right that the average consumer doesn't care about privacy, but there are many, many users who do. The average consumer also don't have a desktop with GPU or high end Mac Studio, but that doesn't mean there aren't many people working with AI how do have these things.
If we continue to see improvements in running local models, and RAM prices continue to fall as they have in the last month, then suddenly you don't have to worry about token counts any more and can be much more trusting of your agents since they are fully under your control.
No? What? Oh, you can't?
Neither can consumers. Most consumers are very aware of the lack of privacy, the manipulation, and have very cynical feelings about Facebook and similar companies. But it's where their friends and family are.
For most people the web is a mine field maze where basic things they want are compromised everywhere. And they are routinely creeped out by ads that reveal they know them far too personally.
You are mistaking network capture for preference.
Another telling example. Lots of privacy valuing technical people, who would never have a Facebook account, send unencrypted text emails.
It is network capture, not preference.
Oh an wait till ad companies start selling your healthcare data and you will see how fast things turn 'given a choice'.
Is there already some research or experimentation done into this area?
The world is not moving back to on prem.
They are choosing to give Facebook info.
Every company has dozens of SaaS products that store their business critical information. Amazon installs Office on each computer, Slack (they were moving away from Chime when I left), and the sales department uses SalesForce - SA’s and Professional Services (former employee).
The addressable market of even companies that care about privacy is not a large addressable market. How long will it be before computers become cheap enough that can run even GPT 4 level LLMs that companies will give it to all of their developers?
Yeah but if they can rake in 100x as much by making products for people who don't care about privacy, then why spend time developing stuff for people who care?
There is still a small market left, of course, but that market will not have the billions of R&D behind it.
Meaning: these 5000 tokens consume tiny amounts of energy being moved all around from the data center to your PC, but enormous amounts of energy being generated at all. An equivalent webpage with the same amount of text as these tokens would be perceived as instant in any network configuration. Just some kilobytes of text. Much smaller than most background graphics. The two things can't be compared at all.
However, just last week there have been huge improvements on the hardware required to run some particular models, thanks to some very clever quantisation. This lowers the memory required 6x in our home hardware, which is great.
In the end, we spent more energy playing videogames during the last two decades, than all this AI craze, and it was never a problem. We surely can run models locally, and heat our homes in winter.
As someone who has hardware in that price range and plays with local LLMs: The gap between Opus or GPT and the local models is still very large for work beyond simple queries.
Self-hosted also starts making my office hot due to all of the power consumption when I use it for anything more than short queries. If you haven't heard your Mac's fans spin up much yet, running local LLMs will get you acquainted with the sound of their cooling systems at full blast.
Lol, you should tell my customers (that are moving back on prem) that!
You should also tell Microsoft, who just yesterday said they are going back to focusing on local apps.
Yes, they do. That's is exactly the phenomena my comment addressed.
But the way you wrote that implies an improbable motivation or choice framing.
Perhaps their real motive/choice is to share with other people on the site.
It is called a network effect.
If (1) Facebook had been the surveillance/manipulation capital of the world from inception, (2) an equally inviting privacy protecting site took off at the same time, and (3) everyone chose Facebook over E2EE anyway, then sure, we could throw up our hands! Those silly users!
The term I have for when people discuss choices involving many-dimensional criteria, as if the choice involved just one or two selected dimensions, is "dimension blindness". It happens in a lot of heated discussions about phone choices too.
People have said this since Pytorch was published and it's not any more true now than it was 10 years ago.
They are. The majority aren't doing inference on a Mac Mini, but instead using it as a local host for cloud-based inference. You could have the same general experience on a $200 Chromebook or $300 Windows box.
They aren’t buying high end $2000+ Mac Minis.
This is clearly true. There is an implied point here but I am not sure what.
They share in their profile what they want other people to see. And often choose to not fill out everything. Nobody signs up to share with Meta, Inc.
Most people would love a "[ ] Do not share with Facebook".
People choosing an imperfect option, from imperfect options, are not demonstrating evidence they don't care about the imperfections.
What, would be helpful to make clearer?
--
Stated one way:
Precisely because Facebook does not offer users any choice to separate sharing with peers from sharing with Facebook Inc, interpreting a user's desire to share with peers on Facebook, is not evidence that they are happy that Facebook Inc is going to use their information.
Stated another way:
Person U(ser) doesn't like person F(Facebook), but wants to be around person P(eer) and all their mutual friends.
When F has a party, P is always there with their friends. U showing up isn't evidence they wanted to interact with F. Even though they are choosing to go somewhere they have to.