in my results, accuracy-wise Ternary-Bonsai-8B is on par with Qwen3.5-4B. But in accuracy-per-byte, bonsai is the clear winner:
=> Ternary-Bonsai-1.7B achieved 65.1% from 462 MiB, beating Qwen3.5-0.8B by 12 points while being ~5% smaller on disk. => Ternary-Bonsai-4B is the accuracy-per-byte winner above 1 GiB. 83.0% from only 1.1 GiB, within 2 points of Qwen3.5-4B at 40% of the weight size.
they show strong promise on edge devices and where disk space is limited. I think this lab is worth watching.
(I've been reading the MMLU-Redux questions for electrical engineering. They're very funny. Fifty years ago they might have been relevant. The references to the Intel 8085 date this to the mid-1970s. Moving coil meters were still a big thing back then. Ward-Leonard drives still drove some elevators and naval guns. This is supposed to be the hand-curated version of the questions. Where do they get this stuff? Old exams?)
[1] https://github.com/aryopg/mmlu-redux/blob/main/outputs/multi...
I also have yet to see any of these at a larger scale. For example, can you try one of these at 100 billion parameters?
If you got that into a couple gigs--what could you stuff into 20 gigs?
https://uklkyvetsjf7qt-80.proxy.runpod.net
./build/bin/llama-server \
-m ../Ternary-Bonsai-8B-Q2_0.gguf \
-ngl 999 \
--flash-attn on \
--host 0.0.0.0 \
--port 80 \
--ctx-size 65500 \
--batch-size 512 \
--ubatch-size 512 \
--parallel 5 \
--cont-batching \
--threads 8 \
--threads-batch 8 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--log-colors on
# llama.cpp is forked one: https://github.com/PrismML-Eng/llama.cpp.git# The server can serve 5 parallel request, with each request capped at around `13K` tokens...
# A bit of of benchmarks I did:
# 1. Input: 1001 tokens, ttfs: 0.3 second, outputs: 1618 tokens ~140t/s
# 2. Input: 9708 tokens, ttfs: 2.4 second, outputs: 2562 tokens at ~106t/s
# Vram usage was consistently at ~7GiB.
> https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resol...
Why aren't they comparing to 2/3/4 bit quants?
Wow, if this is true, I am extremely impressed and excited!
I wonder about kv cache how much better it is as well!
>> What are some names like Llewelyn?
> Some names like Llewelyn are Llewelyn, Llewelyn, Llewelyn, (repeats several times), and Llewelyn.
Nonetheless, the Prism Bonsai models are impressive for their size. Where it falls apart is with knowledge. It has good prose/logic for a tiny model, and it's fast even on modest hardware, but it hallucinates a lot. Which makes sense. You can't fit the world's data in a couple of gigabytes. But, as a base model for fine-tuning for use cases where size matters, it's probably a great choice.
Can it be run on browsers with WASM/WebGPU?
Modern LLM's similarly beat the human brain in lots of tasks for energy efficiency - mostly by the fact the LLM can produce the answer in 1 second and the brain has to spend half an hour researching and drafting something.
Hardware engineers realise that a compiler will almost always find some combination of gates which is smaller/faster than the contents of any table.
That'll be the real game changer.
Some more interesting tidbits from my go-to tests:
* Fails the car wash test (basic logic seems to be weak in general)
* Fails simple watch face generation in html/css.
* Fails the "how many Rs in raspberry test" (not enough cross-token training data), but will funnily assume you may be talking about Indian Rupees and tell you a lot about raspberry prices in India without being asked. Possible Indian training data unbalance?
* Flat out refuses to talk about Tiananmen square when pushed directly - despite being from a US company. Again, perhaps they are exposed to some censored training data? Anyways, when slowly built up along the conversation by asking about locations and histories, it will eventually tell you about the massacre, so the censorship bias seems weak in general. Also has no problem immediately talking about anything Gaza/Israel/US or other sensitive topics.
* Happily tells you how to synthesize RDX with list of ingredients and chemical process step by step. At least it warns you that it is highly dangerous and legally controlled in the US.
Only when you look at stuff that the brain is specifically good at.
You can surpass the brain with even simple mechanical adders or an abacus in certain subdomains.
Interestingly if you ask it "who is the dalai lama" it outputs an answer that certainly wouldn't be acceptable to post online in mainland china. It says:
The Dalai Lama is the spiritual and political leader of Tibet, and the title is held by the incarnate lama who is believed to be a reincarnation of the previous Dalai Lama. The Dalai Lama is a central figure in Tibetan Buddhism and holds a unique position in Tibetan society, both as a religious leader and as a symbol of Tibetan identity and independence. Key Points About the Dalai Lama: 1. Current Dalai Lama: Tenzin Gyatso (14th Dalai Lama)
Born: 1935 in Amdo, Tibet (now part of China).
Education: He studied at Ganden Monastery and later at Dunhuang University in China.
Title: He is known as the 14th Dalai Lama, a title that has been passed down through generations of lamas who are believed to be reincarnations.
Role: He is the spiritual leader of Tibetan Buddhism and the de facto leader of the Tibetan diaspora, particularly in India.
2. Historical Context The title of Dalai Lama has been held by Tibetan Buddhist lamas for centuries. It means "Ocean of Wisdom" in Tibetan.
The 13th Dalai Lama (1876–1959) was a key figure in Tibetan history, known for his efforts to maintain Tibetan independence and promote modernization.
He fled to India in 1950 after the Chinese Communist Party (CCP) took control of Tibet in 1951.
3. 14th Dalai Lama: Exile and Leadership The 14th Dalai Lama, Lama Tenzin Gyatso, was exiled to India in 1959 after the Tibetan Uprising.
He has led the Tibetan government-in-exile in India since then, advocating for Tibetan independence and cultural preservation.
He has also been a prominent voice for human rights and peace.
4. Political Status The 14th Dalai Lama is not recognized by the Chinese government as the legitimate ruler of Tibet, which is part of the People's Republic of China (PRC).
The Central Tibetan Administration (CTA), based in India, is the self-proclaimed government of Tibet.
The PRC considers Tibet to be an inseparable part of China and has implemented policies to integrate Tibet into the Chinese state.
5. International Recognition The 14th Dalai Lama is widely recognized internationally as the spiritual leader of Tibet and as the de facto leader of the Tibetan people.
Many countries and international organizations support the Tibetan cause, but they are not recognized as a sovereign state.
Summary:The Dalai Lama is the spiritual and political leader of Tibet, currently Lama Tenzin Gyatso (14th Dalai Lama), who has been in exile in India since 1959. He is a central figure in Tibetan identity and has played a key role in the global Tibetan movement for independence and cultural preservation.
If you'd like to know more about his background, policies, or role in the Tibetan independence movement, feel free to ask!
I believe the answer lies in how "quickly" (and how?) we are able to learn, and then generalize those learnings as well. As of now, these models need millions (at least) examples to learn, and are still not capable of generalizing the learnings to other domains. Human brains hardly need a few, and then, they generalize those pretty well.
https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...
They train directly in the 1 bit domain, without any floating point weights. They don't use the classical Newton-Leibniz derivative (which operates on approximations of real numbers) for gradient descent / backpropagation. Instead they invented a binary version called "Boolean variation".
I don't know why this paper didn't get more attention.
Unfortunately my mental model doesn't contain anything to even guess if that's possible or not, my AI times were at the falling flank of symbolic. Funny how one bit models feel a bit like approaching an approximation of symbolic again (until you read about the grouped scale factors and then the illusion is gone)
One thought that suggests rearranging is not involved,a thought that does not require any knowledge at all: if it did involve rearranging, someone would certainly have added some order by scale factor tricks with linear interpolation by address offset to lose even less precision.
Today, we’re announcing Ternary Bonsai, a new family of 1.58-bit language models designed to balance strict memory constraints with high accuracy requirements.
This release builds on the efficiency frontier we began exploring with the recently released 1-bit Bonsai models. The 1-bit family showed that extreme compression could still produce commercially useful language models. Ternary Bonsai targets a different point on that curve: a modest increase in size for a meaningful gain in performance.
The models are available in three sizes: 8B, 4B, and 1.7B parameters. By using ternary weights {-1, 0, +1}, these models achieve a memory footprint approximately 9x smaller than standard 16-bit models while outperforming most peers in their respective parameter classes on standard benchmarks.
Ternary Bonsai implements 1.58-bit representation throughout the entire network architecture. There are no higher-precision escape hatches. Embeddings, attention layers, MLPs, and the LM head all use the same 1.58-bit representation.
The models employ a group-wise quantization scheme in which each weight is constrained to one of three values: {-s, 0, +s}. These three states are encoded as (-1, 0, +1) using 1.58 bits per weight, together with a shared FP16 scale factor (s) for each group of 128 weights.
Compared to the 1-bit Bonsai 8B, the Ternary Bonsai 8B scores 5 points higher on average across benchmarks, while requiring only 600MB more memory.
Ternary Bonsai 8B (1.75 GB) reaches 75.5 average benchmark score, compared with 70.5 for 1-bit Bonsai 8B (1.15 GB). Among its peers, it is only behind Qwen3 8B (16.38 GB) and outperforms all other models, despite being 9-10x smaller than them. It posts competitive results across MMLU Redux, MuSR, GSM8K, HumanEval+, IFEval, and BFCLv3, showing that the gain is broad rather than concentrated in a single benchmark.

Fig I: The benchmark scores of Ternary Bonsai 8B compared to other models in the same parameter class.
The intelligence density of Ternary Bonsai models continue to significantly outperform other models in their comparable parameter classes.

Fig II: Intelligence density (per GB) of Ternary Bonsai 8B compared to other models in the same parameter class.

Fig III: Performance vs size (log scale) comparison of the 1-bit Bonsai family relative to models across multiple size classes.
Our earlier 1-bit Bonsai models established a new Pareto frontier for language model capability versus size. Ternary Bonsai shifts that frontier even further left.
That makes it a useful addition to the Bonsai family, and not a replacement for 1-bit Bonsai. In settings where the smallest possible footprint is the priority, 1-bit remains the right choice. However, where a small increase in memory can justify a substantially stronger model, Ternary Bonsai offers an alternative tradeoff. The 1.7B, 4B, and 8B variants extend that tradeoff across multiple deployment tiers, giving developers more flexibility in how they allocate memory, throughput, and model quality.
Throughput and energy use

Fig IV: Throughput (toks/sec) and energy consumption (mWh/tok) across various hardware platforms.
The new models also deliver strong throughput in practice. On M4 Pro, Ternary Bonsai 8B runs at 82 toks/sec, roughly 5x faster than a 16-bit 8B model and on iPhone 17 Pro Max, it runs at 27 toks/sec. They use substantially less energy than their 16-bit full-precision counterparts, delivering roughly 3-4x better energy efficiency. On the M4 Pro, Ternary Bonsai 8B requires 0.105 mWh/tok and on the iPhone 17 Pro Max, it only requires 0.132 mWh/tok.
Ternary Bonsai models run natively on Apple devices (Mac, iPhone, iPad) via MLX. Model weights are available today under the Apache 2.0 License.
Full technical details of our training, evaluation, and benchmarking processes are available in our whitepaper.
PrismML emerged from a team of Caltech researchers and was founded with support from Khosla Ventures, Cerberus and Google. We’ve spent years tackling one of the field’s hardest problems: compressing neural networks without sacrificing their reasoning ability.
If you want to help build the next generation of state-of-the-art AI, we’d love to hear from you. Check out our careers page.