Seems like that could end up as a situation where a fractional number of bits or bytes per parameter might make sense. Particularly with adverbs and adjectives, negators.
The engineering/optimization work is nice, but this is not what people have been waiting for, as much as, can’t the Bitnet idea that seemed promise really deliver in a competitive way.
With how much RAM? How much storage does it requires?
There's a lot that you can do when the model size is that small, yet still powerful.
Our next step is that we want to put up a content distribution network for it where people can also share their diffs for their own fine-tuned model. I'll post the project if we finish all the parts.
[1] https://www.youtube.com/live/x791YvPIhFo?is=NfuDFTm9HjvA3nzN
My disappointment is immeasurable and my day is ruined.
One bit or one trit? I am confused!
demo shows a huge love for water, this AI knows its home
``` Ecosystem Services and their impact on the Ecosystem
Ecosystem services refer to the services provided by ecosystems to the human society. These services include water, air, energy, nutrients, and soil (Jenkins, 2010). For instance, water is the most important service provided by an ecosystem and it helps in the conservation of water, irrigation and sanitation (Jenkins, 2010). On the other hand, air provides the oxygen needed for life.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans. ```
It seems to keep repeating that the water cycle is the main source of energy for all living things on the planet and then citing Jenkins 2010. There are also a ton of sentence beginning with “It also…”
I don’t even think it’s correct. The sun is the main source of energy for most living things but there’s also life near hydrothermal vents etc.
I don’t know who Jenkins is, but this model appears to be very fond of them and the particular fact about water.
I suppose fast and inaccurate is better than slow and inaccurate.
The "new" on huggingface banner has weights that were uploaded 11 months ago, and it's 2B params. Work on this in the repo is 2 years old.
The amount of publicity compared to the anemic delivery for BitNet is impressive.
If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?
Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?
So it's not a inference framework for 1-bit models (two states per parameter) but for 1.58 bit models (three states per parameter). Annoying that they try to mix up the two.
I'm hoping that today's complaints are tomorrow's innovations. Back when 1Mb hard drive was $100,000, or when Gates said 640kb is enough.
Perhaps some 'in the (chip) industry' can comment on what RAM manufacturers are doing at the moment - better, faster, larger? Or is there not much headroom left and it's down to MOBO manufacturers, and volume?
You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making
A claude 4.6 they are most certainly not, but if you get through the janky AF software ecosystem they can run small LLMs reasonably well with basically zero CPU/GPU usage
I was under the impression that they were primarily designed for low power use.
I happened to "live" on 7.0-7.5 tok/sec output speed for a while, and it is an annoying experience. It is the equivalent of walking behind someone slightly slower on a footwalk. I dealt with this by deliberately looking away for a minute until output was "buffered" and only then started reading.
For any local setup I'd try to reach for 10 tok/sec. Sacrifice some kv cache and shove a few more layers on your GPU, it's worth it.
How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.
But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.
In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction
For something more practical, you can pack five three-state values within a byte because 3^5 = 243, which is smaller than 256. To unpack, you divide and modulo by 3 five separate times. This encodes data in bytes at 1.6 bits per symbol.
But the packing of 5 symbols into a byte was not done here. Instead, they packed 4 symbols into a byte to reduce computational complexity (no unpacking needed)
But it doesn't mean, idea is worthless.
You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.
Interestingly, a trit x float multiplier is cheaper than a trit x integer multiplier in hardware if you're willing to ignore things like NaNs.
0 and 1 are trivial, just a mux for identity and zero. But because floats are sign-magnitude, multiply by -1 is just an inverter for the sign bit, where as for integers you need a bitwise inverter and full incrermenter.
(only suggesting that it's intentional because it's been there so long)
And I don't think that LLM could just Google or check Wikipedia.
But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.
If you have an existing network, making an int4 quant is the better tradeoff. 1.58b quants only become interesting when you train the model specifically for it
On the other hand maybe it works much better than expected because llama3 is just a terrible baseline
The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture
Try it out via this demo, or build and run it on your own CPU or GPU.
bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the technical report for more details.
Latest optimization introduces parallel kernel implementations with configurable tiling and embedding quantization support, achieving 1.15x to 2.1x additional speedup over the original implementation across different hardware platforms and workloads. For detailed technical information, see the optimization guide.
A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:
https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1
This project is based on the llama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in T-MAC. For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.
| Model | Parameters | CPU | Kernel | ||
|---|---|---|---|---|---|
| I2_S | TL1 | TL2 | |||
| BitNet-b1.58-2B-4T | 2.4B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
❗️We use existing 1-bit LLMs available on Hugging Face to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.
| Model | Parameters | CPU | Kernel | ||
|---|---|---|---|---|---|
| I2_S | TL1 | TL2 | |||
| bitnet_b1_58-large | 0.7B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
| bitnet_b1_58-3B | 3.3B | x86 | ❌ | ❌ | ✅ |
| ARM | ❌ | ✅ | ❌ | ||
| Llama3-8B-1.58-100B-tokens | 8.0B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
| Falcon3 Family | 1B-10B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
| Falcon-E Family | 1B-3B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
For Windows users, install Visual Studio 2022. In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):
For Debian/Ubuntu users, you can download with Automatic installation script
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
[!IMPORTANT] If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands. Please refer to the FAQs below if you see any issues.
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
# (Recommended) Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
pip install -r requirements.txt
# Manually download the model and run with local path
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]
[--use-pretuned]
Setup the environment for running inference
optional arguments:
-h, --help show this help message and exit
--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
Model used for inference
--model-dir MODEL_DIR, -md MODEL_DIR
Directory to save/load the model
--log-dir LOG_DIR, -ld LOG_DIR
Directory to save the logging info
--quant-type {i2_s,tl1}, -q {i2_s,tl1}
Quantization type
--quant-embd Quantize the embeddings to f16
--use-pretuned, -p Use the pretuned kernel parameters
# Run inference with the quantized model
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv
usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]
Run inference
optional arguments:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Path to model file
-n N_PREDICT, --n-predict N_PREDICT
Number of tokens to predict when generating text
-p PROMPT, --prompt PROMPT
Prompt to generate text from
-t THREADS, --threads THREADS
Number of threads to use
-c CTX_SIZE, --ctx-size CTX_SIZE
Size of the prompt context
-temp TEMPERATURE, --temperature TEMPERATURE
Temperature, a hyperparameter that controls the randomness of the generated text
-cnv, --conversation Whether to enable chat mode or not (for instruct models.)
(When this option is turned on, the prompt specified by -p will be used as the system prompt.)
We provide scripts to run the inference benchmark providing a model.
usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]
Setup the environment for running the inference
required arguments:
-m MODEL, --model MODEL
Path to the model file.
optional arguments:
-h, --help
Show this help message and exit.
-n N_TOKEN, --n-token N_TOKEN
Number of generated tokens.
-p N_PROMPT, --n-prompt N_PROMPT
Prompt to generate text from.
-t THREADS, --threads THREADS
Number of threads to use.
Here's a brief explanation of each argument:
-m, --model: The path to the model file. This is a required argument that must be provided when running the script. -n, --n-token: The number of tokens to generate during the inference. It is an optional argument with a default value of 128. -p, --n-prompt: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512. -t, --threads: The number of threads to use for running the inference. It is an optional argument with a default value of 2. -h, --help: Show the help message and exit. Use this argument to display usage information.For example:
python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4
This command would run the inference benchmark using the model located at /path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads.
For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:
python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M
# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128
.safetensors Checkpoints# Prepare the .safetensors model file
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16
# Convert to gguf model
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16
A: This is an issue introduced in recent version of llama.cpp. Please refer to this commit in the discussion to fix this issue.
A: Before building the project, verify your clang installation and access to Visual Studio tools by running:
clang -v
This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as:
'clang' is not recognized as an internal or external command, operable program or batch file.
It indicates that your command line window is not properly initialized for Visual Studio tools.
• If you are using Command Prompt, run:
"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64
• If you are using Windows PowerShell, run the following commands:
Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64"
These steps will initialize your environment and allow you to use the correct Visual Studio tools.
However this user uses — in almost all his posts and he had a speed of 1 comment per minute or so on multiple different topics.
It's not a question of if there are other bots out there, but only what % of comments on HN right now and elsewhere are bot generated. That number is only going to increase if nothing is done.
A successful ternary model would basically erase all that value overnight. In fact, the entire stock market could crash!
Think about it: This is Microsoft we're talking about! They're a convicted monopolist that has a history of manipulating the market for IT goods and services. I wouldn't put it past them to refuse to invest in training a ternary model or going so far as to buy up ternary startups just to shut them down.
Want to make some easy money: Start a business training a ternary model and make an offer to Microsoft. I bet they'll buy you out for at least a few million even if you don't have a product yet!
So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.
>packed 4 symbols into a byte
microslop, typical bunch of two-bit frauds!
While I understand some of the fundamental thoughts behind that comparison, it's slightly wonky... I'm not asking "compress wikipedia really well", but instead "can a 'model' reason its way through wikipedia" (and what does that reasoning look like?).
Theoretically with wikipedia-multi-lang you should be able to reasonably nail machine-translation, but if everyone is starting with "only wikipedia" then how well can they keep up with the wild-web-trained models on similar bar chart per task performance?
If your particular training technique (using only wikipedia) can go from 60% of SOTA to 80% of SOTA on "Explain why 6-degrees of Kevin Bacon is relevant for tensor operations" (which is interesting to plug into Google's AI => Dive Deeper...), then that's a clue that it's not just throwing piles of data at the problem, but instead getting closer to extracting the deeper meaning (and/or reasoning!) that the data enables.
Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.
Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.
typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.
I don't think you can, Google looked at the research results, and continued researching Transformers and related technologies, because they saw the value for it particularly in translations. It's part of the original paper, what direction to take, give it a read, it's relatively approachable for being a machine learning paper :)
Sure, it took OpenAI to make it into an "assistant" that answered questions, but it's not like Google was completely sleeping on the Transformer, they just had other research directions to go into first.
> But it doesn't mean, idea is worthless.
I agree, they aren't, hope that wasn't what my message read as :) But, ideas that don't actually pan out in reality are slightly less useful than ideas that do pan out once put to practice. Root commentator seems to try to say "This is a great idea, it's all ready, only missing piece is for someone to do the training and it'll pan out!" which I'm a bit skeptical about, since it's been two years since they introduced the idea.
> We evaluated bitnet.cpp in terms of both inference speed and energy cost. Comprehensive tests were conducted on models with various parameter sizes, ranging from 125M to 100B. specific configurations for each model are detailed in the Appendix A.
Except on GSM8K and math...
I suspect that they are trying to fake engagement prior to making their first "show" post as well.
Residential Treatment Facility for Adults? Red Tail Flight Academy?
And I did not speak out
Because I was not using em dashes
Then they claimed that if you're crammar is to gud you r not hmuan
And I did not spek aut
Because mi gramar sukcs
Then they claimed that if you actually read the article that you are trying to discuss you are not human...
The core insight necessary for chatgpt was not scaling (that was already widely accepted): the insight was that instead of finetuning for each individual task, you can finetune once for the meta-task of instruction following, which brings a problem specification directly into the data stream.
Not only are we losing the ability to communicate clearly without the assistance of computers, those who can are being punished for it.
Created confusion and frustration will make it much harder to separate signal from the noise for most people.
Boy would I love to give my agent access to my Quickbooks. They pushed out an incomplete MCP and haven't touched it since.
actv = A[_:1] & B[_:1]
sign = A[_:0] ^ B[_:0]
dot = pop_count(actv & !sign) - pop_count(actv & sign)
It can probably be made more efficient by taking a column-first format.
Since we are in CPU land, we mostly deal with dot products that match the cache size, I don't assume we have a tiled matmul instruction which is unlikely to support this weird 1-bit format.