I play with local LLMs a lot. I've spent more on hardware than I should. I'm friends with a local group of people who have spent a lot more than I have.

The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.

Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware. You will read a lot of claims that 4-bit quantization is lossless, but those claims come from KL divergence measurements on a small corpus. Use one of these 4-bit models on long context coding tasks and the quality will be noticeably less. Even for non-coding tasks like dataset analysis, I can measure a substantial quality difference between 4-bit models, 8-bit quants, and even some times the full 16-bit source.

This article is also encouraging the use of a REAP model, which means someone has cut out some of the weights to make it smaller. The idea is to remove weights that are less useful for certain tasks, but again this is going to reduce the overall quality of the output.

The trap is that people say "I'm running GLM-5.2 locally!" and it sounds amazing when you look at the GLM-5.2 benchmarks. However they're not actually running GLM-5.2, they're running a model derived from GLM-5.2 that discards most of the bits and drops some of the experts. It does not perform the same as what you see in the benchmarks. In my experience, the divergence between a quantized/REAP model and the parent model is unnoticeable when you try it on very small tasks or chat, but becomes painful when you start trying to use it on long-horizon tasks where little errors start compounding.

Then you get into the slippery slope of thinking you're $50K deep into this project, but what you really need is just one or two more of those $12K GPUs to use the next level of quantization that might improve the quality a little more and make your investment worthwhile...

For qwen3.6-27b you can also run the q4 variant with full ~250K context on one 3090. It's fast enough to not be frustrating so the speed gains with 2x 3090s wouldn't be worth it to me. Running a q6 on 2x 3090s at half the speed with a smaller context is an option, but you're really not going to compete with SOTA models there anyway so unless you already have 2x 3090s, I would say 1 is the best investment given current prices. It's good enough to do a lot, especially with a well-configured harness.

"A great way to go is 2x RTX 3090s for a total of 48GB VRAM total. You can then run Qwen3.6-27B, which is an awesome model."

Just want to note that for $3k you can get an M5 macbook pro with 48gb of shared memory, and it will not be a giant box. Also, consider committing to spending that money on a cloud hosting provider, which will be at least somewhat cheaper if not significantly cheaper. It is awesome being able to run models locally though.

I am somewhere in the middle, where I want something with more than 48GB/$2k of VRAM, but less than 384GB/$40k.

I'm curious if GMKtec's EVO-X2, with ~96GB of usable VRAM, is still a good solution for something like this for $3,399.

>$40k gets you almost-Opus

GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k).

They suggest using this modified model:

>A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters.

I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding.

Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context P.S. found in the repos, 240k context

You can get amazing local STT using parakeet which can use as little as 600mb of vram. Better or as good as whisper v3 large

I agree that local LLMs are the likely future and worth investing in… but at $40k for possible-SOTA right now, this isn’t worth it for the average consumer.

I’m pretty bullish that Apple will deliver something very competitive for the average consumer in the next couple years.

Apple M series chips deserve a mention as another option, especially since you get a whole Mac laptop or desktop workstation too.

They have unified memory and respectable inference performance, and for some variations can be cheaper than video cards, especially if you get an older-gen high-end M series with a lot of RAM used or refurbished.

I've read that Apple has plans once the RAM bottleneck passes to offer more RAM in all their models, and that future M series GPUs and NPUs will be even better for local inference, so in the future I expect Apple to be a serious offering for local inference and AI research workstations.

And what about AMD and Intel Arc GPUs? They don't get as much love but I've heard they can be compelling for certain shapes of a local LLM configuration.

At this point though, I think we may be in a "renters market" for LLM compute. If you want privacy it might be better to rent GPU time in raw form or use spot pricing at various providers. It probably only makes sense to build if you have extreme privacy/security needs or just want to do it cause it's cool.

did he call Qwen a SOTA model?

I am somewhere in the middle, where I want something with more than 48GB/$2k of VRAM, but less than 384GB/$40k.

I'm curious if GMKtec's EVO-X2, with ~96GB of usable VRAM, is still a good solution for something like this for $3,399.

I picked up the 128gb version when it was $2,199 and it runs Qwen 3.6 reasonably well with a 128kb context. Not very useful for complex tasks but it can handle some web stuff.

It has lower memory bandwidth than most comparable Macs.

I picked up the 128gb version when it was $2,199 and it runs Qwen 3.6 reasonably well with a 128kb context. Not very useful for complex tasks but it can handle some web stuff.

It has lower memory bandwidth than most comparable Macs.

I play with local LLMs a lot. I've spent more on hardware than I should. I'm friends with a local group of people who have spent a lot more than I have.

Well you could make a REAP with better input prompts on longer context then. It’ll improve the REAP quality

All very true. Right now, running GLM 5.2 at its full BF16 quantization level needs 1.5 TB of VRAM. You can't run this locally at a usable speed for less than $250K or so, and frankly I'd be surprised if it could be done for less than $500K.

The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.

It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.

Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.

I agree that local LLMs are the likely future and worth investing in… but at $40k for possible-SOTA right now, this isn’t worth it for the average consumer.

I’m pretty bullish that Apple will deliver something very competitive for the average consumer in the next couple years.

>$40k gets you almost-Opus

GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k).

They suggest using this modified model:

>A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters.

How does this work with scaling?

I assume you can then somehow run several hundreds of prompts concurrently?

Apple M series chips deserve a mention as another option, especially since you get a whole Mac laptop or desktop workstation too.

And what about AMD and Intel Arc GPUs? They don't get as much love but I've heard they can be compelling for certain shapes of a local LLM configuration.

did he call Qwen a SOTA model?

You can get amazing local STT using parakeet which can use as little as 600mb of vram. Better or as good as whisper v3 large

> once the RAM bottleneck passes

Do we have evidence that this will actually happen? Maybe the belief that it won't pass is what requires evidence, but I think there's a widespread feeling right now that things are just getting permanently worse and this is one example.

No, he’s running GLM 5.2, which is closer to SOTA.

That math (250k context, Q4 model, 24GB VRAM) only checks out at q4 quant for the K/V cache, which is probably not the best idea.

"A great way to go is 2x RTX 3090s for a total of 48GB VRAM total. You can then run Qwen3.6-27B, which is an awesome model."

I’m an idiot who is unable to project itself in situations I’ve never experienced before.

So, I always thought local LLMs were toys not worth pursuing.

Only once have I tried something decent like Gemma 4 31B and Qwen 3.6 27B did I realize how incredibly useful they are.

You stop fearing you are sharing sensitive information.

You stop fearing you will run out of tokens.

You stop fearing about the availability of the remote AI.

Local LLMs are extremely valuable.

I'm running Qwen3.6-27B on a single 24GB GPU at 80 tok/s, you don't even need 2 of them

That's a reasonable option, just be aware that you get about 1/3 as much memory bandwidth with the M5 Pro, or 2/3 with the M5 Max [now you're at $4100 for the lowest-end]. So both your prefill (flops-bound, M5 has a lot less) and decode (bw-bound) will be slower.

I have an M5 MacBook Pro and I also have a separate GPU setup for running models. The difference in speed is significant. It's not just token generation speed, but time to first token (prompt processing).

The M5 hardware is amazing for what it is, but GPUs are still so much faster.

Running the models on the GPU box also means I can use the laptop on my lap instead of turning it into a hot plate.

The standalone mini/studio is better if you dont want to have a constantly hot laptop

Get a regular laptop and use the network to access the LLM

You can also buy a Jetson Orin with 64GB of unified memory.

Well you could make a REAP with better input prompts on longer context then. It’ll improve the REAP quality

How does this work with scaling?

I assume you can then somehow run several hundreds of prompts concurrently?

No, he’s running GLM 5.2, which is closer to SOTA.

> once the RAM bottleneck passes

I'm running Qwen3.6-27B on a single 24GB GPU at 80 tok/s, you don't even need 2 of them

You can also buy a Jetson Orin with 64GB of unified memory.

I’m an idiot who is unable to project itself in situations I’ve never experienced before.

So, I always thought local LLMs were toys not worth pursuing.

Only once have I tried something decent like Gemma 4 31B and Qwen 3.6 27B did I realize how incredibly useful they are.

You stop fearing you are sharing sensitive information.

You stop fearing you will run out of tokens.

You stop fearing about the availability of the remote AI.

Local LLMs are extremely valuable.

> It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers.

The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter or renting a server as needed. Compare apples to apples.

You will almost certainly never break even compared to paying per token.

Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.

jamesob's guide to running SOTA LLMs locally

Note: nothing in this README aside from the tables was written by AI.

Have $2k burning a hole in your pocket and want some local, state-of-the-art machine intelligence?

How about $40k?

If Dario and Altman are giving you heartburn (they should be), read on to figure out how to run this new kind of computing locally.

In this repo you'll find

the hardware I use to run SOTA locally,
- why I bought what and little-known secrets for configuring it,
how I run speech-to-text (STT) locally,
ready-to-run configuration for running models I think are good within Docker containers.

Section	TL;DR
How much are you willing to spend?	$2k gets you Qwen and good STT (pretty far!); $40k gets you almost-Opus
Base system	Last-gen EPYC + eBay DDR4 for $5.6k
GPUs	4× RTX PRO 6000, 384GB VRAM (where the money went)
c-payne switch sub-BOM	Indie PCIe switching from c-payne.com so GPUs talk peer-to-peer
GPU mount	A day of carpentry
Making the switch behave	BIOS bifurcation, link speed, ASPM
Kernel / GRUB params	`iommu=off` or NCCL hangs
ACS disable	Keep P2P traffic inside the switch fabric
GPU power limiting	Running $46k of silicon on a 110V circuit
Result	Gen4 line rate: 27.5/50.4 GB/s, sub-µs latency
`runners/`	Ready-to-run serving configs: GLM-5.2-594B: vLLM docker-compose, DCP4+MTP5, ~80 t/s @ 240k ctx
`runners/stt`	Ready-to-run speech-to-text config with `whisper-large-v3`
`tools/`	`measure-gpu-speed.sh`: P2P bandwidth/latency benchmark
Resources	rtx6kpro repo, c-payne

My setup

I was lucky/dumb enough to buy 4x RTX Pro 6000s back when they were cheaper. Because RAM is now so expensive, I opted to build a last-gen DDR4 system to host these cards, the parts for which I got off eBay. This allowed me to keep base system cost reasonable while still getting a lot of VRAM.

my rig

Another somewhat unusual thing I did was to use PCIe4 switches (from c-payne.com). This allows the GPUs to communicate to one another "directly" at wire speeds during the allreduce step in tensor parallelism, rather than having to send all data through the PCI root complex. The upshot of this is reduced latency between the cards with less of a need for expensive PCIe5 hardware.

switch

Consequently, I'm spending money on VRAM (where it counts) rather than on a PCIe5/DDR5 base system, which is terrifically expensive as of July 2026.

My particular BOM is detailed below.

How much are you willing to spend?

~$2k

A great way to go is 2x RTX 3090s for a total of 48GB VRAM total. You can then run Qwen3.6-27B, which is an awesome model.

You can also run SOTA speech-to-text (STT) with whisper-large-v3, which I find very useful. That's the model - you'd then access it with my cross-platform stt harness.

I've found local STT surprisingly useful - and I feel comfortable using it, unlike a hosted equivalent. You can find a ready-to-run config in ./runners/stt that only assumes the presence of ~11GB of VRAM on an Nvidia GPU.

~$40k

At this price level, you get the next step up in model intelligence. Something pretty close to Claude Opus.

You'd buy 4x RTX 6000 Pros for a total of 384GB of VRAM.

Current best models for 4x RTX6kPRO

Date	Best model	My config
2026-07	`GLM-5.2-Int8Mix-NVFP4-REAP-594B`	Runner config

Other approaches

Note: these are my recommendations, but there are other completely valid ways to spend your money. For example, there's probably also some regime where rather than getting 4 rtx6kpros, you allocate most of your money to building out a linked 4x DGX Spark cluster for a total of 512GB VRAM and use that as the slow, big brain to drive Qwen3.7-27b to do the rote tasks quickly.

Hardware

Here's the hardware I wound up purchasing for the 4x RTX 6000 pro machine.

enclousure

Base system

A modest, last-gen EPYC system purchased in parts almost entirely from eBay.

Component	Spec	Price
Motherboard	ASRock Rack ROMED8-2T (SP3, 7× PCIe 4.0 x16, dual 10GbE)	$715
CPU	AMD EPYC Milan 7313P (16-core 3.0GHz)	$504
RAM	8× 16GB Crucial CT16G4RFD4213 DDR4 ECC RDIMM (128GB total, eBay)	$642
CPU Cooler	Dynatron T17 SP3 tower, 280W TDP	$40
Case	AAAWave Sluice V2 open frame	$100
PSUs	2× Super Flower 1700W	$750
PCIe Switch	c-payne Microchip Switchtec PM40100 Gen4 (see sub-BOM below)	~$1,330
Boot NVMe	4TB M.2	$291
Storage NVMe	(2x) 8TB M.2 (model weights)	$1,200
Fans	3× 120mm PWM	$15
Total		$5,587

GPUs

Component	Spec	Price
GPUs	4× NVIDIA RTX PRO 6000 Blackwell Workstation (96GB each, 384GB VRAM total)	~$46,000

c-payne PCIe Gen4 Switch Sub-BOM (c-payne.com)

Part	Qty	Unit (€)	Notes
PCIe gen4 Switch 5× x16 — Microchip Switchtec PM40100	1	1.050	2× SlimSAS 8i upstream, 5× x16 quad-width-spaced downstream, aux x4 SlimSAS, 3× 8-pin EPS power
SlimSAS PCIe gen4 Host Adapter x16 — REDRIVER AIC (DS160PR810)	1	140	Plugs into ROMED8-2T x16 slot, feeds switch upstream
SlimSAS SFF-8654 8i cable — PCIe gen4	2	~30	Each carries x8; pair = x16 upstream
Total			~~€1,220 (~~$1,330 USD)

GPU mount

I had to custom fabricate a wood enclosure for the PCI switch and GPUs, which took about a day.

carpentry

I found the PCI switch's builtin fan very loud and seemingly useless, so I simply unplugged that from the board.

Hoarding model weights

I save all model weights locally on a ZFS filesystem that's replicated across the two 8TB drives, which is mounted at ~/storage.

For any model I want to run, I first download the model using

hf download <model-name> --local-dir ~/storage/<model-name>

Running models

Once the model weights are cached locally, I have a specific directory for each model that contains a docker-compose.yml file that cordones off the running of each model to its own Docker container.

You can find these configurations in ./runners/.

Each container mounts in ~/storage/models in read-only mode to obtain the weights that I've cached locally.

I then use opencode hosted on a VM on another machine to access the models once they're serving on http://clank.j.co:5000.

I use a network-internal DNS server to point clank.j.co to the LLM machine, but you could simply do http://<llm-machine-ip>:5000 too.

The harness itself

I created a VM and clanked up an application that basically just creates a tmux session for each directory within the VM's ~/src tree, which then runs an opencode instance that backs up to the inference machine's HTTP API (http://clank.j.co:5000).

clankhouse

One key to making the opensource models good is tooling them properly; a summary of my skills/ is:

camofox, kagi.com API key, and searXNG for web browsing and search,
Telegram bot for communication and alerting,
a local private Gitea instance for collaborating on source code.

The clanker will either work with me interactively in a session, or can be farmed off to work on Gitea issues and file PRs there.

All this happens in a sandboxed VM where the only communication back to the host system happens via a shared filesystem mount, so the thing can go ham and install whatever it wants.

Getting the PCI switches to work properly

There was a lot of fiddling with the BIOS in order to make sure the motherboard wasn't downregulating the PCI switch speeds.

BIOS Configuration (ROMED8-2T)

Setting	Value	Why
`Chipset Configuration → AMD PCIE Link Width` (switch slot)	x16 (was x8/x8)	Bifurcation was splitting the slot; upstream link trained at Gen4 x8. Requires both SlimSAS 8i cables connected (each carries x8).
PCIe Link Speed (switch slot)	Gen4 (not Auto)	Blackwell Gen5 devices auto-negotiating down through the Gen4 switch could fail training and fall to Gen1. Forcing Gen4 stabilizes it.
ASPM	Disabled	ASPM L1 drops idle links to 2.5GT/s. This turned out to be the explanation for the "Gen1 downgraded" lspci readings — links were actually running Gen4 under load (verified via p2pBandwidthLatencyTest), but disabling ASPM removes the cosmetic scare and any re-train latency.
Re-Size BAR	Enabled	Required for full 96GB VRAM BAR exposure and GPU P2P.
SR-IOV	Disabled	Bare-metal inference; avoids IOMMU overhead and P2P interference.
Preferred IO	Auto	Optionally set Manual → bus `81` (the c-payne switch) for marginal latency gains, but left at Auto — it's a squeeze-more optimization, not a fix, and bus numbers shift after BIOS changes.

Kernel / GRUB Parameters

# /etc/default/grub
GRUB_CMDLINE_LINUX="iommu=off amd_iommu=off nomodeset"
sudo update-grub

# nvidia_uvm P2P fix
echo 'options nvidia_uvm uvm_disable_hmm=1' | sudo tee /etc/modprobe.d/uvm.conf
sudo update-initramfs -u

Without iommu=off, NCCL hangs on multi-GPU P2P.

ACS Disable (critical for switch P2P)

With ACS enabled (default), P2P traffic gets bounced through the CPU root port instead of staying inside the switch fabric, negating the switch entirely. pcie_acs_override requires a patched kernel, so we disable via setpci at runtime.

# /usr/local/bin/disable-acs.sh
#!/bin/bash
if [ "$EUID" -ne 0 ]; then
  echo "ERROR: must be run as root"
  exit 1
fi

for BDF in $(lspci -d "*:*:*" | awk '{print $1}'); do
  setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
  if [ $? -ne 0 ]; then
    continue
  fi
  echo "Disabling ACS on $(lspci -s ${BDF})"
  setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done

Run on every boot via systemd oneshot:

# /etc/systemd/system/disable-acs.service
[Unit]
Description=Disable PCIe ACS for GPU P2P
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/disable-acs.sh

[Install]
WantedBy=multi-user.target

Verify: lspci -vvv | grep ACSCtl should show all minus signs, and nvidia-smi topo -m should show PIX between all four GPUs (not PHB/NODE).

Use ./tools/measure-gpu-speed.sh to measure this easily.

GPU Power Limiting

In order to avoid installing a 220V circuit, I (probably unwisely) run this rig on a single 110V circuit, but I power regulate the cards.

Persistence mode + power cap applied at boot via systemd (install-gpu-power-limit.sh):

sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 350    # 350W per GPU (default 600W)

350W/GPU = 1,400W GPU load, sized for the PSU budget. During the interim single-1700W-PSU phase (before the 240V circuit), cards ran at ~260W (4×260 = 1,040W GPUs + ~280W system ≈ 1,320W total).

Verify: nvidia-smi --query-gpu=index,power.limit,power.draw --format=csv

Result

Upstream: Gen4 x16 (~30 GB/s to CPU). P2P through switch: 27.5 GB/s unidirectional / 50.4 GB/s bidirectional, 0.37–0.45 µs latency, i.e. Gen4 line rate. Note: lspci may still show downstream GPU links as "2.5GT/s (downgraded)" at idle if ASPM is active anywhere; this is cosmetic. Links retrain to Gen4 under load.

Resources

A frequently updated repo on getting the most out of 4, 6, or 8 RTX 6000 Pro cards: https://github.com/local-inference-lab/rtx6kpro
Indie PCI switches that I use: https://c-payne.com

The standalone mini/studio is better if you dont want to have a constantly hot laptop

Get a regular laptop and use the network to access the LLM

That math (250k context, Q4 model, 24GB VRAM) only checks out at q4 quant for the K/V cache, which is probably not the best idea.

The M5 hardware is amazing for what it is, but GPUs are still so much faster.

Running the models on the GPU box also means I can use the laptop on my lap instead of turning it into a hot plate.

What is your GPU setup?

> It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers.

The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter or renting a server as needed. Compare apples to apples.

You will almost certainly never break even compared to paying per token.

Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.

Or if you want to hedge against the various tail risks of third-party providers raising prices or denying you service or somehow abusing your data...

Also agreed, it's definitely a sucker's game to run a high-end model locally, by any objective measure.

Still... if it's not your weights, running on your box, you're always going to be behind somebody else's 8-ball. Everybody has to decide for themselves where their priorities lie.

What is your GPU setup?

Also agreed, it's definitely a sucker's game to run a high-end model locally, by any objective measure.

Still... if it's not your weights, running on your box, you're always going to be behind somebody else's 8-ball. Everybody has to decide for themselves where their priorities lie.

Or if you want to hedge against the various tail risks of third-party providers raising prices or denying you service or somehow abusing your data...

> hedge against the various tail risks of third-party providers raising prices

They could 10X the prices and you’d still be better off. It’s also unlikely that prices go up enough to warrant a $100K local investment to prevent paying a couple bucks per million tokens.

> or denying you service

I guess you’re not familiar with OpenRouter? There are many providers there. There are providers outside of OpenRouter. There will always be someone to take your business.

> or somehow abusing your data...

If data security is your concern then you’re better renting a server as needed still.

If you cannot tolerate any data leaving, then local models are the only way. You pay a high premium for it!

Raising prices is not a tail risk, anything a local LLM setup can do for you can be done by any cloud provider, with the same capex as yours (or less), there is no moat here, so it is highy price competitive and will remain so. If you want to speculate on hardware shortages, that is a different business altogether and you need no janky garage setup to profit.

> hedge against the various tail risks of third-party providers raising prices

They could 10X the prices and you’d still be better off. It’s also unlikely that prices go up enough to warrant a $100K local investment to prevent paying a couple bucks per million tokens.

> or denying you service

I guess you’re not familiar with OpenRouter? There are many providers there. There are providers outside of OpenRouter. There will always be someone to take your business.

> or somehow abusing your data...

If data security is your concern then you’re better renting a server as needed still.

If you cannot tolerate any data leaving, then local models are the only way. You pay a high premium for it!

Hacker Times

Hacker Times

Jamesob's guide to running SOTA LLMs locally

Discussion

Discussion

jamesob's guide to running SOTA LLMs locally

Contents

My setup

How much are you willing to spend?

~$2k

~$40k

Current best models for 4x RTX6kPRO

Other approaches

Hardware

Base system

GPUs

c-payne PCIe Gen4 Switch Sub-BOM (c-payne.com)

GPU mount

Hoarding model weights

Running models

The harness itself

Getting the PCI switches to work properly

BIOS Configuration (ROMED8-2T)

Kernel / GRUB Parameters

ACS Disable (critical for switch P2P)

GPU Power Limiting

Result

Resources