Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

    mmlu: 87.86%

    gpqa diamond: 82.32%

    gsm8k: 86.43%

    ifeval: 75.90%

More details of my experience:

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

Overall an excellent model to have for offline inference.

Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

    mmlu: 87.86%

    gpqa diamond: 82.32%

    gsm8k: 86.43%

    ifeval: 75.90%

More details of my experience:

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

Overall an excellent model to have for offline inference.

Reading the details, he is using 2-bit quantization and reduced the number of experts per token from 10 down to 4 to get 5 tokens/sec. Cool proof of concept but it’s far from the quality and performance of the 397B model as normally used. Dropping the number of experts is particularly misleading.

This is some interesting work, but applying such extreme measures to LLMs to get them to run severely degrades quality. I know he claims negligible quality loss, but in my experience 2-bit quantizations are completely useless for real work. You can get them to respond to prompts, but they lose their intelligence and will go around in circles.

He also shows 5-6 tokens per second. Again that’s impressive for a large model on limited hardware but it’s very slow. Between the severely degraded model abilities and the extremely slow output the 397B result should be considered an attempt at proving something can technically run, not evidence that it can run well and produce output you’d expect from a 397B model.

It seem strange to me that the only way to use an llm is to fit it entirely in volatile memory from the get go.

To render movies we happily wait for the computer to calculate how lights bounce around, for hours even days.

So why not do the same with AIs? Ask big question to big models and get the answer to the universe tomorrow?

> Metal Compute Shaders — Hand-written Metal kernels

Hand written... by GPT? ;)

The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box? Alternatively, one might use a simple mmap and then something like posix_fadvise to set up prefetching of the data.

The quality degradation at 2-bit is a real issue. For actual work tasks, a well-tuned 30B at 4-bit usually outperforms a 70B+ at 2-bit in my experience. The expert reduction on top of that compounds things - you're essentially running a fairly different model. Still interesting to see the upper bound of what consumer hardware can attempt, even if the result isn't production-ready.

TLDR I took a stab at leveraging Dan's work and making it more practical:

https://github.com/matt-k-wong/mlx-flash

2 bit quantization lobotomizes the model but is impressive nonetheless! Maybe one day we'll be able to have intelligent 2 bit quants... I wonder.

my version supports - 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility, tested on Mamba2, and lets up the framework for LM Studio integration

I leveraged this work (Credit to Danveloper) and am in the middle of making this work on more practical models and quants. It still uses flash streaming, but done so with a control knob so you can choose how much ram and how little ram to use. In the craziest case, it uses as little ram as possible but is very slow, however, in the balanced case you use some ram and it's much faster.

I designed it around the intelligence dense Nemotron 3 Nano 30B and Nemotron Cascade 2 30B models (which are smaller, more intelligence density) and can run on low end 16GB machines, though you can run arbitrarily large models on larger machines (designed for very low end, but capable of high end).

Everyone is focused on the bad 2 bit result but who cares? He says don’t use it because it’s bad.

This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?

Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?

Really interesting approach. Curious how the 2-bit quantization affects the model's reasoning ability on longer chains of thought vs shorter prompts. The benchmarkslook solid but real-world usage seems like a different story based on the comments here.

As frontier models get closer and closer to consumer hardware, what’s the most for the API-driven $trillion labs?

Can you add a license to the report? Legally we couldn't run any code without a license attached to it.

yeah 4tok/s is kinda unusable though

Does this mean that it should be possible to load up a system with ~10 (seems to me at least the number of active experts) SSDs to get 40 tok/s even on truly gigantic models?

How large is the KV cache?

impressive, i wish someone takes a stab at using this technique on mobile gpu's even if it does not use storage it would still be a win. I am running llama.cpp on adreno 830 with oepncl and i am getting pathetic 2-3t/s for output tokens

seems promising , this is the way , can someone benchmark this

The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.

>...at 4.4+ tokens/second

That is even when it is using 4-bit quantization and it is still at that speed.

> The entire 209GB model streams from SSD through a custom Metal compute pipeline.

This is my main problem.

If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.

As frontier models get closer and closer to consumer hardware, what’s the most for the API-driven $trillion labs?

Can you add a license to the report? Legally we couldn't run any code without a license attached to it.

PCIe 5 doubles the maximum throughout. That’s why the numbers for newer SSDs are about double what you recall for the old maximum.

IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.

Outside of that the SSD is idling.

Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.

I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.

Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.

Edit: Typos

MacBook Pro M5 Pro and M5 Max have such SSD speed

Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?

Most definitely - the popular engines have extensive support for doing this and controlling exactly which weights end up where (llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/tools/cli/... , vllm: https://docs.vllm.ai/en/stable/configuration/engine_args/#of... , sglang (haven't tried this): https://docs.sglang.io/advanced_features/server_arguments.ht...).

Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.

Using system memory and CPU compute for some of the layers that don’t fit into GPU memory is already supported by common tools.

It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass.

Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.

My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs

Does this mean that it should be possible to load up a system with ~10 (seems to me at least the number of active experts) SSDs to get 40 tok/s even on truly gigantic models?

SSD bandwidth will ultimately be limited by the amount of PCIe lanes you have available (for something other than the Apple Silicon internal storage). So the approach has inherent limitations. You can of course scale out to multiple systems to get more throughput.

You can use this approach with Intel Optane, which is wearout-resistant unlike NAND and can thus substitute for RAM. Last I checked, it was available quite cheap on the secondary market, ~$1/GB as opposed to ~$15/GB or more for DRAM. (Of course that's nowhere near as cheap as NAND, which is around ~$0.1/GB but quite wearout-prone with heavy writes.)

How large is the KV cache?

0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.

seems promising , this is the way , can someone benchmark this

I'm getting 6.55t/s using the Qwen3.5-397B-A17B-4bit model with the command: ./infer --prompt "Explain quantum computing" --tokens 100

MacBook Pro M5 Pro (64GB RAM)

Oh Mac, unified. Sometimes it takes a downvote

The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.

>...at 4.4+ tokens/second

That is even when it is using 4-bit quantization and it is still at that speed.

> The entire 209GB model streams from SSD through a custom Metal compute pipeline.

This is my main problem.

If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.

Does an SSD meaningfully degrade by read only workloads?

> If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.

Eh. I mean, 4 tokens a second works fine if you're patient. Go do something else while you wait.

I feel like whenever I'm trying to find information on which local models will work on my hardware, I have to overestimate because people don't know how to wait for things.

Also, reading data doesn't cause SSD wear.

If you want decent throughput and do not care about burning SSD write cycles on a box that was never meant to act like a tiny inference server, a used server with actual RAM is still the cheaper and less silly option. I woudn't expect Apple's warranty team to be much help.

PCIe 5 doubles the maximum throughout. That’s why the numbers for newer SSDs are about double what you recall for the old maximum.

Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.

Using system memory and CPU compute for some of the layers that don’t fit into GPU memory is already supported by common tools.

My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs

0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.

Oh Mac, unified. Sometimes it takes a downvote

Does an SSD meaningfully degrade by read only workloads?

Nope, reads don’t cause wear

> If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

Eh. I mean, 4 tokens a second works fine if you're patient. Go do something else while you wait.

I feel like whenever I'm trying to find information on which local models will work on my hardware, I have to overestimate because people don't know how to wait for things.

Also, reading data doesn't cause SSD wear.

IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.

Outside of that the SSD is idling.

Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.

I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.

Edit: Typos

The github page mentions that you can't overlap SSD traffic and GPU compute on Apple Silicon, you get heavy contention for the shared hardware resources.

MacBook Pro M5 Pro and M5 Max have such SSD speed

I have an MBP M4 Pro and a WD Black SN850x in an external TB5 enclosure and I easily get 6-7 GB/s

I'm getting 6.55t/s using the Qwen3.5-397B-A17B-4bit model with the command: ./infer --prompt "Explain quantum computing" --tokens 100

MacBook Pro M5 Pro (64GB RAM)

Appreciate the data point. M5 Max would also be interesting to see once available in desktop form.

can you post the final result (or as far as you got before you killed it) to show us how cohesive and good it is? I'd like to see an example of the output of this.

I have an MBP M4 Pro and a WD Black SN850x in an external TB5 enclosure and I easily get 6-7 GB/s

Appreciate the data point. M5 Max would also be interesting to see once available in desktop form.

Nope, reads don’t cause wear

Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.

Is it doing a bunch of ssd writes?

The github page mentions that you can't overlap SSD traffic and GPU compute on Apple Silicon, you get heavy contention for the shared hardware resources.

can you post the final result (or as far as you got before you killed it) to show us how cohesive and good it is? I'd like to see an example of the output of this.

Since the output is quite long, here is a link: https://pastebin.com/k76wiVGP

Is it doing a bunch of ssd writes?

TLDR I took a stab at leveraging Dan's work and making it more practical:

https://github.com/matt-k-wong/mlx-flash

2 bit quantization lobotomizes the model but is impressive nonetheless! Maybe one day we'll be able to have intelligent 2 bit quants... I wonder.

my version supports - 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility, tested on Mamba2, and lets up the framework for LM Studio integration

No appreciable wear of course, but read disturb (requiring occasional rewrites) becomes more of an issue as NAND fabrication advances.

> Metal Compute Shaders — Hand-written Metal kernels

Hand written... by GPT? ;)

He’s very clear that it was written by AI.

yeah 4tok/s is kinda unusable though

Everyone is focused on the bad 2 bit result but who cares? He says don’t use it because it’s bad.

If you don’t care about the output, why not reduce to 1-bit and only 1 active expert? It will be completely useless but it will be faster!

It seem strange to me that the only way to use an llm is to fit it entirely in volatile memory from the get go.

To render movies we happily wait for the computer to calculate how lights bounce around, for hours even days.

So why not do the same with AIs? Ask big question to big models and get the answer to the universe tomorrow?

If you don’t care about turnaround time you can do that.

Most LLM use cases are about accelerating workflows. If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over.

I don’t let LLMs write my code but I do a lot of codebase exploration, review, and throwaway prototyping. I have hundreds to maybe thousands of turns in the LLM conservation each day. If I had to wait 10X or 100X as long then it wouldn’t be useful. I’d be more productive ignoring a slow LLM and doing it all myself.

There's definitely use cases for this for long running tasks, like doing research, but for typical use cases they require way too much constant supervision and interaction

Since the output is quite long, here is a link: https://pastebin.com/k76wiVGP

Why does this G character appear to prefix most of the output? ("Ġlike")

No appreciable wear of course, but read disturb (requiring occasional rewrites) becomes more of an issue as NAND fabrication advances.

Why does this G character appear to prefix most of the output? ("Ġlike")

It is a tokenizer artifact most likely (https://github.com/huggingface/transformers/issues/4786). So the output is not properly decoded in this case, it should just be a space.

The original tokens have Ġ instead of space. I had this issue too when writing an inference engine for Qwen. You have to "normalize" those special characters.

It is a tokenizer artifact most likely (https://github.com/huggingface/transformers/issues/4786). So the output is not properly decoded in this case, it should just be a space.

The original tokens have Ġ instead of space. I had this issue too when writing an inference engine for Qwen. You have to "normalize" those special characters.

Flash-MoE: Running a 397B Parameter Model on a Laptop

Read the paper — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.

Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.

The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.

Results

Progress

Configuration	tok/s	Quality	Notes
4-bit experts, FMA kernel	4.36	Excellent	Current best. Full tool calling. 209GB on disk.
4-bit experts, baseline	3.90	Excellent	Before FMA kernel optimization.
2-bit experts, trust OS	5.74	Good*	120GB on disk. *Breaks JSON/tool calling.
2-bit peak single token	7.05	Good*	Warm cache burst. *Not suitable for tool use.

*2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.

Hardware

Machine: MacBook Pro, Apple M3 Max
Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE
Memory: 48 GB unified (~400 GB/s bandwidth)
SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured)
macOS: 26.2 (Darwin 25.2.0)

Architecture

The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.

Key Techniques

SSD Expert Streaming — Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed ("Trust the OS" principle). Inspired by Apple's "LLM in a Flash" paper.
FMA-Optimized Dequant Kernel — The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction. 12% faster than the naive formulation.
Metal Compute Shaders — Hand-written Metal kernels for:
- 4-bit and 2-bit dequantized matrix-vector multiply (tiled, SIMD-reduced, shared input cache, FMA-optimized)
- Fused SwiGLU activation
- RMS normalization (two-pass: sum-of-squares reduction + apply)
- Batched GPU attention (Q@K^T, softmax, scores@V) for full attention layers
- GPU RoPE (fused with Q deinterleave and K normalization)
- MoE combine + residual + sigmoid gate (fused kernel)
Deferred GPU Expert Compute — CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer's attention projections.
Accelerate BLAS for Linear Attention — The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update. 64% faster than scalar code.
Trust the OS — No custom expert cache. The OS page cache (~35GB) manages expert data caching via standard LRU. Every custom caching approach we tested (Metal LRU, malloc cache, LZ4 compressed cache) was slower due to GPU memory pressure or overhead. The page cache achieves ~71% hit rate naturally.

Pipeline Per Layer (4.28ms average at 4-bit)

CMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]
           → CPU: flush results                       [0.01ms CPU]
           → CMD2: o_proj + norm + routing + shared    [0.55ms GPU]
           → CPU: softmax + topK routing               [0.003ms]
           → I/O: parallel pread K=4 experts           [2.41ms SSD]
           → CMD3: expert forward + combine + norm     [0.04ms encode, DEFERRED]

Unified Memory Constraint

On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.

Quick Start

cd metal_infer
make
# 4-bit inference (needs packed_experts/ directory)
./infer --prompt "Explain quantum computing" --tokens 100

# 2-bit inference (faster but breaks tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit

# Interactive chat with tool calling
./chat

# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing

Project Structure

metal_infer/
  infer.m              # Complete inference engine (~7000 lines)
  shaders.metal        # Metal compute kernels (~1200 lines)
  chat.m               # Interactive chat TUI with tool calling
  tokenizer.h          # C BPE tokenizer (single-header, 449 lines)
  main.m               # MoE-only benchmark
  Makefile             # Build system
  extract_weights.py   # Creates model_weights.bin from safetensors
  repack_experts_2bit.py  # 4-bit → 2-bit expert requantization
  train_predictor.py   # Expert routing prediction analysis
  model_weights.bin    # Non-expert weights (5.5GB, mmap'd)
  model_weights.json   # Tensor manifest
  vocab.bin            # Vocabulary for token decoding
  tokenizer.bin        # Pre-exported BPE tokenizer data

repack_experts.py      # 4-bit expert packing from safetensors
progress.py            # Results visualization (Q2/Q4 tracks)
results.tsv            # Experiment log (58 experiments)

What We Tried (and What Worked)

Kept

Approach	Result	Impact
FMA dequant kernel	GPU compute -12%	+12% tok/s
Trust OS page cache	Deleted Metal LRU → +38%	Foundational
GPU combine+norm in CMD3	Eliminates CPU round-trip	Pipeline
BLAS delta-net (Accelerate)	cpu_attn 0.78→0.28ms	+64% attn
F_NOCACHE for 2-bit	+3% from avoiding page thrash	2-bit only
GPU fused attention (RoPE)	+2% for full-attn layers	Small
C BPE tokenizer	180ms vs 3500ms startup	20x startup
Deferred CMD3 execution	GPU/CPU overlap	Pipeline

Discarded (58 experiments, highlights)

Approach	Result	Why
LZ4 expert compression	-13%	Decompress overhead > warm cache savings
F_RDADVISE prefetch	net 0%	Unified memory: SSD DMA slows GPU -73%
Temporal expert prediction	-18%	25% hit rate, SSD bandwidth waste
MLP routing predictor	31% accuracy	Worse than temporal baseline
GPU LUT dequant kernel	-2%	Indirect register access serializes
GPU private buffer compression	-20% pipeline	Blit cost 4×7MB > matvec savings
Spin-poll GPU wait	-23%	CPU thermal competes with GPU
Expert file clustering	0%	NVMe ignores scatter at 7MB granularity
dispatch_io	-70%	dispatch_data management overhead
mmap expert files	-5x	Per-page fault overhead on cold data
Speculative early routing	-38%	Cache pollution + overhead
MTP speculative decoding	break-even	MoE I/O scales per-token (unlike dense)

Safety

This is a primary development machine. The engine explicitly controls memory:

Non-expert weights: 5.5GB (mmap'd, read-only)
Metal scratch buffers: ~200MB
Total: ~6GB, leaving 42GB for OS + page cache
No OOM risk. Expert data streams from SSD on demand.
No custom caches. Trust the OS.

If you don’t care about turnaround time you can do that.

> If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over.

If you have to wait overnight because the model is offloading to disk, that's a model you wouldn't have been able to run otherwise without very expensive hardware. You haven't really lost anything. If anything, it's even easier to check on what a model is doing during a partial inference or agentic workload if the inference process is slower.

"If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over."

This exact problem exist for rendering, when you realize that after a long render an object was missing in the background and the costly frame is now useless. To counter that you make multiple "draft" renders first to make sure everything is in the frame and your parameters are properly tuned.

If you don’t care about the output, why not reduce to 1-bit and only 1 active expert? It will be completely useless but it will be faster!

He’s very clear that it was written by AI.

There's definitely use cases for this for long running tasks, like doing research, but for typical use cases they require way too much constant supervision and interaction

Hacker Times

Hacker Times

Flash-MoE: Running a 397B Parameter Model on a Laptop

Discussion

Discussion

Flash-MoE: Running a 397B Parameter Model on a Laptop

Results

Hardware

Architecture

Key Techniques

Pipeline Per Layer (4.28ms average at 4-bit)

Unified Memory Constraint

Quick Start

Project Structure

What We Tried (and What Worked)

Kept

Discarded (58 experiments, highlights)

Safety