To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get

  25t/s prompt processing 
  63t/s token generation

Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.

Steps to reproduce:

  git clone https://github.com/ggml-org/llama.cpp.git
  cmake -B build
  cmake --build build --config Release -j 12 --clean-first
  # download model and mmproj files...
  build/bin/llama-server \
    --model gemma-3-4b-it-Q4_K_M.gguf \
    --mmproj mmproj-model-f16.gguf

Then open http://127.0.0.1:8080/ for the web interface

Note: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.

I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.

To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get

  25t/s prompt processing 
  63t/s token generation

Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.

Steps to reproduce:

  git clone https://github.com/ggml-org/llama.cpp.git
  cmake -B build
  cmake --build build --config Release -j 12 --clean-first
  # download model and mmproj files...
  build/bin/llama-server \
    --model gemma-3-4b-it-Q4_K_M.gguf \
    --mmproj mmproj-model-f16.gguf

Then open http://127.0.0.1:8080/ for the web interface

Note: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.

I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.

It works super well!

You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.

I made some quants with vision support - literally run:

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1

Then load the image with /image image.png inside the chat, and chat away!

EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.

We also support SmolVLM series which delivers light-speed response thanks to its mini size!

This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!

    llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

This is the most useful documentation I've found so far to help understand how this works: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd...

I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.

Very nice for something that's self hosted.

llama.cpp offers compiled releases for multiple platforms. This release has the new vision features: https://github.com/ggml-org/llama.cpp/releases/tag/b5332

On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:

  unzip llama-b5332-bin-macos-arm64.zip
  cd build/bin
  sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib

Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R)

  ./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99

Or start the localhost 8080 web server (with a UI and API) like this:

  ./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99

I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/

What has changed in laymans terms? I tried llama.cpp a few months ago and it could already do image description etc?

How does this compare to using a multimodal model like gemma3 via ollama?

Any benefit on a Mac with apple silicon? Any experiences someone could share?

Are there any tools that leverage vision for UI development?

Use case: I am working on a hobby project that uses TS/React as frontend. I can use local or cloud LLMs in VSCode but even those with vision require that I take a screenshot and paste it to a chat. Ideally, I would want it all automated until some stop criterion is met (even if only n-iterations). But even an extension that would screenshot a preview and paste it to chat (triggered by a keyboard shortcut) would be a big time-saver.

This is excellent. I've been pulling and rebuilding periodically, and watching the commit notes as they (mostly ngxson, I think) first added more vision models, each with their own CLI program, then unified those under a single CLI program and deprecated the standalone one, while bug fixing and improving the image processing. I'd been hoping that meant they'd eventually add support to the server again, and now it's here! Thanks!

Seems like another step change. The first time I ran a local LLM on my phone and carried on a fairly coherent conversation, I imagined edge inference would take off really quickly at least with e.g. personal assistant/"digital waifu" business cases. I wonder what the next wave of apps built on Llama.cpp and its downstream technologies will do to the global economy in the next three months.

Finally! Open source multimodal is so far behind closed source options that people don’t even try to benchmark

They’re still doing text and math tests on every new model because it’s so bad

didn't llama.cpp use to have vision support last year or so?

Is it possible to run multimodal LLMs using their Vulkan backend? I have a ton of 4gb gpus laying around that only support vulkan.

It was really sad when vision was removed back a while ago. It's great to see it restored. Many thanks to everyone involved!

so image processing there but image generation isn't ?

just trying to understand, awesome work so far.

great news ! sidenote : Does vision include the ability to read a pdf ?

Didn't we already have vision via llava?

finally! very important use-case! glad they added it!

Someone ELI5 please or tldr

Man, the ngl abbreviation gets me every time too. Kinda cool seeing all the tweaks folks do to make this stuff run faster on their Macs. You think models hitting these speed boosts will mean more people start playing with vision stuff at home?

For every image I try, I get the same response:

> This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.

No, none of these things are in the images.

I don't even know how to begin debugging that.

hmm, I'm getting the same results - but I see on M1 with a 7b model we should expect ~10x faster prompt processing

https://github.com/ggml-org/llama.cpp/discussions/4167

I wonder if it's the encoder that isn't optimized?

Are those numbers for the 4/8 bit quants or the full fp16?

do you have any example images it generated based on your prompts?

want to have a look before I try

What has changed in laymans terms? I tried llama.cpp a few months ago and it could already do image description etc?

Are there any tools that leverage vision for UI development?

For every image I try, I get the same response:

> This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.

No, none of these things are in the images.

I don't even know how to begin debugging that.

I get the same as well, instead I get this message, no matter which image I upload: "This is a humorous meme that uses the phrase "one does not get it" in a mocking way. It's a joke about people getting frustrated when they don’t understand the context of a joke or meme."

Not sure why it's not working

Means it can't see the actual image. It's not loading for some reason.

It works super well!

You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.

I made some quants with vision support - literally run:

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1

Then load the image with /image image.png inside the chat, and chat away!

EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.

If it helps, I updated https://docs.unsloth.ai/basics/gemma-3-how-to-run-and-fine-t... to show you can use llama-mtmd-cli directly - it should work for Mistral Small as well

If you install llama.cpp via Homebrew, llama-mtmd-cli is already included. So you can simply run `llama-mtmd-cli <args>`

Ok it's actually better to use -ngl 99 and not -ngl -1. -1 might or might not work!

I can't see the letters "ngl" anymore without wanting to punch something.

We also support SmolVLM series which delivers light-speed response thanks to its mini size!

This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!

    llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

I've been noticing your commits as I skim the latest git commit notes whenever I periodically pull and rebuild. Thank you for all your work on this (and llama.cpp in general)!

Thanks for landing the mtmd functionality in the server. Like the other commenter I kept poring over commits in anticipation.

This is the most useful documentation I've found so far to help understand how this works: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd...

Very nice for something that's self hosted.

llama.cpp offers compiled releases for multiple platforms. This release has the new vision features: https://github.com/ggml-org/llama.cpp/releases/tag/b5332

On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:

  unzip llama-b5332-bin-macos-arm64.zip
  cd build/bin
  sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib

Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R)

  ./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99

Or start the localhost 8080 web server (with a UI and API) like this:

  ./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99

I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/

How does this compare to using a multimodal model like gemma3 via ollama?

Any benefit on a Mac with apple silicon? Any experiences someone could share?

Finally! Open source multimodal is so far behind closed source options that people don’t even try to benchmark

They’re still doing text and math tests on every new model because it’s so bad

It was really sad when vision was removed back a while ago. It's great to see it restored. Many thanks to everyone involved!

finally! very important use-case! glad they added it!

Someone ELI5 please or tldr

hmm, I'm getting the same results - but I see on M1 with a 7b model we should expect ~10x faster prompt processing

https://github.com/ggml-org/llama.cpp/discussions/4167

I wonder if it's the encoder that isn't optimized?

If it helps, I updated https://docs.unsloth.ai/basics/gemma-3-how-to-run-and-fine-t... to show you can use llama-mtmd-cli directly - it should work for Mistral Small as well

Is there a simple GUI available for running LLaMA on my desktop that I can access from my laptop?

If you install llama.cpp via Homebrew, llama-mtmd-cli is already included. So you can simply run `llama-mtmd-cli <args>`

Ok it's actually better to use -ngl 99 and not -ngl -1. -1 might or might not work!

I can't see the letters "ngl" anymore without wanting to punch something.

That's your problem. Hope you do something about that pent up aggressivity.

Oh it's shorthand for number of layers to offload to the GPU for faster inference :) but yes it's probs not the best abbreviation.

I've been noticing your commits as I skim the latest git commit notes whenever I periodically pull and rebuild. Thank you for all your work on this (and llama.cpp in general)!

Thanks for landing the mtmd functionality in the server. Like the other commenter I kept poring over commits in anticipation.

Ok but what's the quality of the high speed response? Can the sub-2.2B ones output a coherent sentence?

It’s interesting that they decided to move all of the architecture-specific image-to-embedding preprocessing into a separate library.

Similar to how we ended up with the huggingface/tokenizers library for text-only Tranformers.

That's pretty neat. Do you essentially loop over a list of images and run the prompt for each, then store the result somewhere (metadata, sqlite)?

is gemma 4b good enough for this? I was playing with larger versions of gemma because I didn't think 4b would be any good.

For brew users, you can specify --HEAD when installing the package. This way, brew will automatically build the latest master branch.

Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!

I'm also extremely pleased with convert_hf_to_gguf.py --mmproj - it makes quant making much simpler for any vision model!

Llama-server allowing vision support is definitely super cool - was waiting for it for a while!

And btw, -ngl is automatically set to max value now, you don't need to -ngl 99 anymore!

Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl

Two things:

1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.

For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.

2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm

The “global economy in three month is writing some checks that I don’t know all of the recent AI craze has been able to cash in three years.

didn't llama.cpp use to have vision support last year or so?

Yes they always did, but they moved it all into 1 umbrella called "llama-mtmd-cli"!

Yes, but this is generalized so it was able to be added to the llama-server GUI as well.

Is it possible to run multimodal LLMs using their Vulkan backend? I have a ton of 4gb gpus laying around that only support vulkan.

Yes, llama.cpp has very good Vulkan support.

so image processing there but image generation isn't ?

just trying to understand, awesome work so far.

As far as I'm aware there are no open source LLMs that can generate images. There's image generation models like Stable Diffusion but those are not transformer language models so they'd be out of scope for the project

Do the underlying models support generation? If the support isn't there to begin with, the llama.cpp folks can't do anything about that.

Generating images using chat seems cumbersome when you can do it directly with something like stable diffusion

great news ! sidenote : Does vision include the ability to read a pdf ?

Vision = visual, while PDF is a container of sorts, usually containing images and text. So I guess the short answer is: 50% yes, the other part you can use any LLM for.

Didn't we already have vision via llava?

no, it did not work in llama.cpp

For sure! Llama.cpp runs great on my 10 year old pc and m1 mac!

Not sure why it's not working

Ok, following the following comment in this thread fixed the issue: https://news.ycombinator.com/item?id=43943624

Are those numbers for the 4/8 bit quants or the full fp16?

It is a 4-bit quant gemma-3-4b-it-Q4_K_M.gguf. I just use "describe" as prompt or "short description" if I want less verbose output.

As you are a photographer, using a picture from your website gemma 4b produces the following:

"A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."

This description is pretty spot on.

The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.

n.b. the image processing is by a separate model, basically has to load the image and generate ~1000 tokens

(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)

(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)

do you have any example images it generated based on your prompts?

want to have a look before I try

To be clear, this model isn't generating images, it's describing images that are sent to it.

Yes they always did, but they moved it all into 1 umbrella called "llama-mtmd-cli"!

Yes, but this is generalized so it was able to be added to the llama-server GUI as well.

Yes, llama.cpp has very good Vulkan support.

Do the underlying models support generation? If the support isn't there to begin with, the llama.cpp folks can't do anything about that.

Generating images using chat seems cumbersome when you can do it directly with something like stable diffusion

For sure! Llama.cpp runs great on my 10 year old pc and m1 mac!

Ok, following the following comment in this thread fixed the issue: https://news.ycombinator.com/item?id=43943624

Multimodal

llama.cpp supports multimodal input via libmtmd. Currently, there are 2 tools support this feature:

llama-mtmd-cli
llama-server via OpenAI-compatible /chat/completions API

Currently, we support image and audio input. Audio is highly experimental and may have reduced quality.

To enable it, you can use one of the 2 methods below:

Use -hf option with a supported model (see a list of pre-quantized model below)
- To load a model using -hf while disabling multimodal, use --no-mmproj
- To load a model using -hf while using a custom mmproj file, use --mmproj local_file.gguf
Use -m model.gguf option with --mmproj file.gguf to specify text and multimodal projector respectively

By default, multimodal projector will be offloaded to GPU. To disable this, add --no-mmproj-offload

For example:

# simple usage with CLI
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF

# simple usage with server
llama-server -hf ggml-org/gemma-3-4b-it-GGUF

# using local file
llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.gguf

# no GPU offload
llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload

[!IMPORTANT]

OCR models are trained with specific prompt and input structure, please refer to these discussions for more info:

PaddleOCR-VL: https://github.com/ggml-org/llama.cpp/pull/18825

GLM-OCR: https://github.com/ggml-org/llama.cpp/pull/19677

Deepseek-OCR: https://github.com/ggml-org/llama.cpp/pull/17400

Dots.OCR: https://github.com/ggml-org/llama.cpp/pull/17575

HunyuanOCR: https://github.com/ggml-org/llama.cpp/pull/21395

Pre-quantized models

These are ready-to-use models, most of them come with Q4_K_M quantization by default. They can be found at the Hugging Face page of the ggml-org: https://huggingface.co/collections/ggml-org/multimodal-ggufs-68244e01ff1f39e5bebeeedc

Replaces the (tool_name) with the name of binary you want to use. For example, llama-mtmd-cli or llama-server

NOTE: some models may require large context window, for example: -c 8192

Vision models:

# Gemma 3
(tool_name) -hf ggml-org/gemma-3-4b-it-GGUF
(tool_name) -hf ggml-org/gemma-3-12b-it-GGUF
(tool_name) -hf ggml-org/gemma-3-27b-it-GGUF

# SmolVLM
(tool_name) -hf ggml-org/SmolVLM-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM-256M-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM-500M-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

# Pixtral 12B
(tool_name) -hf ggml-org/pixtral-12b-GGUF

# Qwen 2 VL
(tool_name) -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF

# Qwen 2.5 VL
(tool_name) -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF

# Mistral Small 3.1 24B (IQ2_M quantization)
(tool_name) -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF

# InternVL 2.5 and 3
(tool_name) -hf ggml-org/InternVL2_5-1B-GGUF
(tool_name) -hf ggml-org/InternVL2_5-4B-GGUF
(tool_name) -hf ggml-org/InternVL3-1B-Instruct-GGUF
(tool_name) -hf ggml-org/InternVL3-2B-Instruct-GGUF
(tool_name) -hf ggml-org/InternVL3-8B-Instruct-GGUF
(tool_name) -hf ggml-org/InternVL3-14B-Instruct-GGUF

# Llama 4 Scout
(tool_name) -hf ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF

# Moondream2 20250414 version
(tool_name) -hf ggml-org/moondream2-20250414-GGUF

# Gemma 4
(tool_name) -hf ggml-org/gemma-4-E2B-it-GGUF
(tool_name) -hf ggml-org/gemma-4-E4B-it-GGUF
(tool_name) -hf ggml-org/gemma-4-26B-A4B-it-GGUF
(tool_name) -hf ggml-org/gemma-4-31B-it-GGUF

Audio models:

# Ultravox 0.5
(tool_name) -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF
(tool_name) -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF

# Qwen2-Audio and SeaLLM-Audio
# note: no pre-quantized GGUF this model, as they have very poor result
# ref: https://github.com/ggml-org/llama.cpp/pull/13760

# Mistral's Voxtral
(tool_name) -hf ggml-org/Voxtral-Mini-3B-2507-GGUF

# Qwen3-ASR
(tool_name) -hf ggml-org/Qwen3-ASR-0.6B-GGUF
(tool_name) -hf ggml-org/Qwen3-ASR-1.7B-GGUF

Mixed modalities:

# Qwen2.5 Omni
# Capabilities: audio input, vision input
(tool_name) -hf ggml-org/Qwen2.5-Omni-3B-GGUF
(tool_name) -hf ggml-org/Qwen2.5-Omni-7B-GGUF

# Qwen3 Omni
# Capabilities: audio input, vision input
(tool_name) -hf ggml-org/Qwen3-Omni-30B-A3B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen3-Omni-30B-A3B-Thinking-GGUF

# Gemma 4
# Capabilities: audio input, vision input
(tool_name) -hf ggml-org/gemma-4-E2B-it-GGUF
(tool_name) -hf ggml-org/gemma-4-E4B-it-GGUF

Finding more models:

GGUF models on Huggingface with vision capabilities can be found here: https://huggingface.co/models?pipeline_tag=image-text-to-text&sort=trending&search=gguf

It is a 4-bit quant gemma-3-4b-it-Q4_K_M.gguf. I just use "describe" as prompt or "short description" if I want less verbose output.

As you are a photographer, using a picture from your website gemma 4b produces the following:

This description is pretty spot on.

The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.

I'm can neither claim to be a photographer nor that https://www.dansmithphotography.com/ my website, but I appreciate the example! The specific photo for other's reference, based on the filename: https://payload.cargocollective.com/1/15/509333/14386490/L-o...

That said I'm not as impressed of the description. The structure has some wood but it's certainly not just wooden, there are distant mountains but not much in the way of rolling hills to speak of. The dress is flowing but the waist is not knotted - the more striking note might have been the sleeves.

For 4 GB of model I'm not going to ding it too badly though. The question on which quant was mainly around the tokens/second angle (q4 requires 1/4th the memory bandwidth as the full model would) rather than quality angle. As a note: a larger multimodal model gets all of these points accurately (e.g. "wooden and stone rustic structure"), they aren't just things I noted myself.

n.b. the image processing is by a separate model, basically has to load the image and generate ~1000 tokens

(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)

(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)

wait sorry, can you explain how this works? I thought gemma3 used siglip, which can output all 256 embeddings in parallel

(also, would you mind sharing a code pointer if you have any handy? I found this https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd... but not sure if that's the codepath taken)

To be clear, this model isn't generating images, it's describing images that are sent to it.

Is there a simple GUI available for running LLaMA on my desktop that I can access from my laptop?

If you are on a Mac, give https://recurse.chat/ a try. As simple as download the model and start chatting. Just added the new multimodal support in LLaMA.cpp.

Give https://docs.openwebui.com/ a look, you'll be able to access it by using your desktops IP while on your laptop (providing you're on the same network).

isnt that ollama + any client supporting it?

using tailscale for the internal network works really well

That's your problem. Hope you do something about that pent up aggressivity.

Oh it's shorthand for number of layers to offload to the GPU for faster inference :) but yes it's probs not the best abbreviation.

It probably isn't, not gonna lie.

The “global economy in three month is writing some checks that I don’t know all of the recent AI craze has been able to cash in three years.

AI is fundamentally learning the entire conditional probability distribution of our collective knowledge; but sampling it over and over is not going to fundamentally enhance it, except to, perhaps, reinforce a mean, or surface places we have insufficiently sampled. For me, even the deep research agents aren't the best when it comes to surfacing truth, because the nuance of that is lost on the distribution.

I think that if we're realistic with ourselves, AI will become exponentially more expensive to train, but without additional high quality data (not you, synthetic data), we're back to 1980s era AI (expert systems), just with enhanced fossil fuel usage to keep up with the TPUs. What's old is new again, I suppose!

I sincerely hope to be proven wrong, of course, but I think recent AI innovation has stagnated in terms of new things it can do. It's a great tool, when you use it to leverage that distribution (eg, semantic search), but it might not fundamentally be the approach to AGI (unless your goal is to replicate what we can, but less spikey)

Vision = visual, while PDF is a container of sorts, usually containing images and text. So I guess the short answer is: 50% yes, the other part you can use any LLM for.

i'm asking because openai api has a special endpoint to deal with pdf, different from images.

Which part of a pdf file can you use LLMs for ? Pdf is a binary format..

no, it did not work in llama.cpp

Slight correction: It worked in llama.cpp via the CLI tools, but not in the llama-server (OpenAI API compatible interface).

I remember it distinctly working.

Means it can't see the actual image. It's not loading for some reason.

I’m having a hard time imagining how failure to see an image would result in such a misleadingly specific wrong output instead of e.g. “nothing” or “it’s nonsense with no significant visual interpretation”. That sounds awful to work with.