I have two 128gb Strix Halos and have been extremely excited about Antirez's (Redis author) work on DS4, especially with 4bit quant using two machines: https://github.com/antirez/ds4

Right now the speed isn't good for GLM 5.2, Deepseek V4 Flash speed is okay for me (actually reading the output) and quite usable. See kyuz0's great recent video here: https://www.youtube.com/watch?v=PkKXm_mKCCM

With a bit more speed and model improvements, local AI becomes a reasonable practical thing! The biggest problem is all the tech companies making consumer hardware completely unaffordable, and I don't think this is accidental. Look at Micron's profits and share price lately...

I got my Strix machines for ~2k eur each, best computers this 90s kid has ever owned, but those days are gone :(

This is amazing!

I'm working on a three node strix halo agentic OS factory designed to be maintained by local agents: https://github.com/projectbluefin/testing-lab

This memory bandwidth combo is amazing for homelabbers. kyuz0's work on these containers has made the investment in this kit so valuable I hope Framework is sending you hardware!

https://projectbluefin.io/server/ is what I'm hoping to ship, designed to just ship setups like this ootb and things like this would be so much harder without kyuz0!

(Note: The 64GB ones are going for $1700-ish empty, the prices on the 128's are outrageous we can just keep making the labs more deterministic over time!)

This is amazing work - RDMA on these smaller unified memory boxes (somewhat) bridges the gap for consumers from the ~24GB 3090/4090/7900 card that are around to 128GB/256GB! Still not cheap, especially now, but... obtainable?

I do hope that apple opens up RDMA for their TB4 machines... ds4 using TB5 macs works great - but there are a lot of capable tb4 (M2/1) machines out there and afaik there's no hardware limitation preventing RDMA from working (at lower bandwidth, but with the latency gains!) on the older stuff.

Benchmarks are here: https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/

Would love to see DeepSeek V4 flash/pro and MiniMax M3 benchmarks but already these are pretty impressive, first strix Halo setup I've seen with some serious performance.

EDIT: Apologies - I think I misunderstood these benchmarks - it seems this is actually very slow when compared to a M4 or M5 chip with a good amount of memory. Looking at the creators video here: https://youtu.be/Cfl3TS7ME5s?t=734 -- it seems the performance of strix halo is much much slower than I get on my M4 MBP - which gets ~400 prefill and ~20 tok/s generation

Hmm, coing PCIe -> NIC -> NIC -> PCIe seems a bit silly, couldn't both devices communicate directly over PCIe?

What‘s the advantage of using ConnectX-5 Ex VPI NICs instead of much cheaper ConnectX-3 VPI NICs to connect two machines directly, other than PCIe 4.0 instead of PCIe 3.0? Can they offload more tasks when doing RDMA? Solid information is hard to come by.

This is exactly the type of technical depth that makes a difference. I've been following all the work you have been doing.

So this is kind of fascinating. The main hardware costs here seem to be:

- 2x Framework Desktop AI Mainboards with 128GB of RAM for $3150 each

- 2x 100G Ethernet controllers for ~$500 each

So the Framework board has a single PCI-e 4.0 x4 slot, which amounts to 8GB/s or 64Gbps theoretical so you're not getting 100G. Also, the 100G cards all seem to be PCI-e x16 slots for obvious reasons so you need a riser or an adapter or something to even get them to work.

I don't know how hot a 100GbE copper NIC runs but, from experience, 10GbE NICs have been basically giant heatsinks, basically. So fiber might be advisable and I expect short fiber cables here probably aren't cost-prohibitive given everything else.

As an aside, if you are using Ethernet for clustering and you're clustering 2 devices, in an ideal world you'd be using simplex Ethernet but that's not an option here.

I wonder if the author considered USB 4.0 for clustering? I ask because I know people who have clustered Mac Studios over TB5 and that bandwidth is up to 120Gbps. The version of USB4 on the Ryzen AI 395 seems to be 40Gbps, which isn't that far off 8GB/s over PCI-e 4.0 x4.

But the limiting factor with Strix Halo (and DGX Spark for that matter) is memory bandwidth, both under 300GB/s. The obvious comparison is to the Mac Studio. Unfortunately the largest spec they currently sell is 96GB. It had been as high as 512GB. And 96GB is $6700+ but you're also getting way better performance AFAICT eg [1]. The M3 Ultra has ~900GB/s memory bandwidth.

You can alternatively buy a Macbook Pro with M5 Max and 128GB of RAM (now $8000, was $5500-6000 a few days ago) but that tops out at ~600GB/s, which is still double these mini AI boxes.

Oh and if you don't want to go the way of these Framework motherboards, you can buy a whole 128GB Strix Halo PC for $3k or less.

I think the main point here though is we're only a few years away from running 300B+ (or even 1T+) param models at useful speeds on enthusiast hardware.

[1]: https://www.reddit.com/r/LocalLLaMA/comments/1u5mfaq/you_can...

I have two 128gb Strix Halos and have been extremely excited about Antirez's (Redis author) work on DS4, especially with 4bit quant using two machines: https://github.com/antirez/ds4

I got my Strix machines for ~2k eur each, best computers this 90s kid has ever owned, but those days are gone :(

I had a Strix Halo laptop with 128GB which unfortunately died last week. I paid 2800 euro for it. If I buy the same machine today, the sticker price is 7899.

The device was not perfect by any means, but the ability to run fairly large models is some kind of magic.

I was hoping to buy a competent local model machine later this year but given the prices I’m shelving that for now. Especially because the frontier models are very cheap relative to the cost of building my own setup. Especially because AI specialised hardware and processors are improving very fast, meaning hardware we buy now will become obsolete for this use case much faster than for traditional computer use cases.

In 1-3 years the hardware crunch will be over, local distilled models will provide Opus 4.8 like intelligence, and the hardware will exist to provide usable performance.

Last year you could buy a AI Max 395+ with 128G for 2.5k, now it's almost $4k.

Or maybe you're right, I originally remembered 2k as well. I wanted to wait for the AI Max 395+ upgrade of my laptop, and now it makes no sense to upgrade.

What's the advantage of ds4 over llama.cpp, esp if down the line they upstream his forked kernels?

>The biggest problem is all the tech companies making consumer hardware completely unaffordable, and I don't think this is accidental. Look at Micron's profits and share price lately...

You realize "tech companies" isn't a monolith? Micron charging inflated prices doesn't magically benefit OpenAI. The "high prices keep out competitors" theory doesn't make much sense either. It's like saying Dennys benefits from higher egg prices because it makes cooking eggs at home more expensive.

This is exactly the type of technical depth that makes a difference. I've been following all the work you have been doing.

I had a Strix Halo laptop with 128GB which unfortunately died last week. I paid 2800 euro for it. If I buy the same machine today, the sticker price is 7899.

The device was not perfect by any means, but the ability to run fairly large models is some kind of magic.

>sticker price is 7899.

It's not even worth it at that point.

You can get a used enterprise grade SXM baseboard with 4-8 V100/A100 GPUs off eBay at a similar price. That will even get you actual HMB ram and NVlink. Along with 10x the AI performance, assuming you don't care about your electricity bill of course.

everything strix halo went 2-3x bananas, same ballpark figures as apple hardware now and lead times on all of those are in months. Ridiculous where we ended up at.

This is amazing!

I'm working on a three node strix halo agentic OS factory designed to be maintained by local agents: https://github.com/projectbluefin/testing-lab

This memory bandwidth combo is amazing for homelabbers. kyuz0's work on these containers has made the investment in this kit so valuable I hope Framework is sending you hardware!

https://projectbluefin.io/server/ is what I'm hoping to ship, designed to just ship setups like this ootb and things like this would be so much harder without kyuz0!

(Note: The 64GB ones are going for $1700-ish empty, the prices on the 128's are outrageous we can just keep making the labs more deterministic over time!)

Yep, nice write up, seems we are all doing this. Its as close as you can get to Provider level for essentially prosumer hardware. I'll share what I've got with this running under k0s and the npu work.

Benchmarks are here: https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/

Would love to see DeepSeek V4 flash/pro and MiniMax M3 benchmarks but already these are pretty impressive, first strix Halo setup I've seen with some serious performance.

They are heavily bogged down by bandwidth unfortunately. The macs are on another level. If Apple decides to release AI dedicated hardware, it would dominate this space (consumer AI).

The pp speeds are really slow (50), I think there‘s room for improvement still.

The machine only has pcie4x4 so 50Gb bandwidth, pcie3 would halve that to 25Gb

Thats the problem with these AMD laptop class cores, they have very little IO. They have been saying they will release in a desktop form factor, but then it probably wont have such good memory bandwidth...

The Nvidia boxes have 200Gb ethernet thats much more useful for clustering.

Yes CX5 can offload more. I believe CX4 has similar offloading capabilities as CX3, except that it supports 100G.

Another note: In my experience, RoCE works much better on CX4+ generation. CX3 is best with Infiniband. I think some firmwares on the CX3 generation, has a messed up config for RoCE. But running Infiniband is not a complex task, is way easier than people think, like 10x easier and faster to setup than Ethernet.

So this is kind of fascinating. The main hardware costs here seem to be:

- 2x Framework Desktop AI Mainboards with 128GB of RAM for $3150 each

- 2x 100G Ethernet controllers for ~$500 each

As an aside, if you are using Ethernet for clustering and you're clustering 2 devices, in an ideal world you'd be using simplex Ethernet but that's not an option here.

You can alternatively buy a Macbook Pro with M5 Max and 128GB of RAM (now $8000, was $5500-6000 a few days ago) but that tops out at ~600GB/s, which is still double these mini AI boxes.

Oh and if you don't want to go the way of these Framework motherboards, you can buy a whole 128GB Strix Halo PC for $3k or less.

I think the main point here though is we're only a few years away from running 300B+ (or even 1T+) param models at useful speeds on enthusiast hardware.

[1]: https://www.reddit.com/r/LocalLLaMA/comments/1u5mfaq/you_can...

No reason to use fiber on short runs like that. DAC cables are cheap and better in pretty much every way over short distances. You're probably thinking of RJ-45 NICs and SFP modules which are known to run pretty hot.

The 512GB Mac Studio is going for around 30K used.

In 1-3 years the hardware crunch will be over, local distilled models will provide Opus 4.8 like intelligence, and the hardware will exist to provide usable performance.

They are heavily bogged down by bandwidth unfortunately. The macs are on another level. If Apple decides to release AI dedicated hardware, it would dominate this space (consumer AI).

The pp speeds are really slow (50), I think there‘s room for improvement still.

The machine only has pcie4x4 so 50Gb bandwidth, pcie3 would halve that to 25Gb

The Nvidia boxes have 200Gb ethernet thats much more useful for clustering.

Yes CX5 can offload more. I believe CX4 has similar offloading capabilities as CX3, except that it supports 100G.

The 512GB Mac Studio is going for around 30K used.

I ran Ms-01s with 100GBE, copper DACs in my kubernetes cluster. Killed the NVME drives in that tiny box. I'd bet the same issue doing this with FW. And I wasn't even pushing 100GBE very hard at all, it was mostly for fun.

AI + 100GBE (under load) + tiny box = unreliable and eead very quickly.

He did cover the Tb/USB4 ;)

What is simplex ethernet?

>sticker price is 7899.

It's not even worth it at that point.

ignoring the fact one would need a bit of a different setup (chassis, PSU) to run it, I casually looked and there's nothing below $25-50k euros for such a board decked out, depending on the config. TBH even that doesn't sound bad, but I wouldn't even know where to start how to run it.

Last year you could buy a AI Max 395+ with 128G for 2.5k, now it's almost $4k.

Or maybe you're right, I originally remembered 2k as well. I wanted to wait for the AI Max 395+ upgrade of my laptop, and now it makes no sense to upgrade.

What's the advantage of ds4 over llama.cpp, esp if down the line they upstream his forked kernels?

>The biggest problem is all the tech companies making consumer hardware completely unaffordable, and I don't think this is accidental. Look at Micron's profits and share price lately...

> Last year you could buy a AI Max 395+ with 128G for 2.5k, now it's almost $4

Only if you pay the Framework premium.

https://www.bosgamepc.com/products/bosgame-m5-ai-mini-deskto...

I don't have access to the USD price, but it's 2500€ (tax included), up from 1600€ in November when I ordered mine.

I think people buying laptops for AI use are, sorry, just plain crazy. You overpay for the screen and keyboard and battery and whatever, plus you get much worse thermal performance because of basic physics (area vs volume). My Framework Desktop has a Noctua cooler which works really well.

[Tangent: all my life I've been downvoted into a smoking hole in the ground, particularly on reddit r/hardware, for questioning the wisdom of laptops for high performance computing, including gaming. Everyone insists they need the mobility, and then just leave it plugged in the whole time, absolutely refusing to admit it's about aesthetic preference.]

The cheapest ones with 128GB were 1580€/$1840 as late as mid December.

Currently, llama.cpp clusters don't support tensor parallelism, have a look at Donato Capitella's detailed report: https://m.youtube.com/watch?v=PkKXm_mKCCM He also provides rocm toolboxes for Strix Halo: https://strix-halo-toolboxes.com/#about

IIRC llama.cpp doesn't implement DSv4's compressed attention mechanism, and while it does use (credited) parts of llama.cpp, it's focused on this great model for now. Much of this is covered better in the repo's readme.

I think mainly that he can move much faster with specific improvements targeting Deepseek on Systems with unified memory (Mac or Strix). It's a lot easier to optimize if you don't need to worry about all the other architectures. So optimize he did and it's just a lot faster than llama cpp for deepseek v4 pro and flash. Also interesting features are more doable, like SSD streaming, which makes it possible to load MOE weights for a model larger than your VRAM, I don't see that landing in llama cpp anytime soon.

You got it wrong. Use appliances instead of eggs. If getting an oven gets more expensive I rather keep going to Dennys.

It’s classic capex vs opex. I’d keep paying my openai subscription instead of dropping $3k to run a subpar model. If the thing costs $1k I would consider it.

openai etc are going to have a higher utilisation of the hardware so can afford it more than small companies/people. Efficient resource use matters more when they're expensive.

Ah yea after watching one of the creators youtube videos I realize these benchmarks are combining prefill and decode which isn't super helpful - it seems this struggles with the exact same bottlenecks as all strix halo setups, memory bandwidth. It seems this is still significantly slower than equivalent memory sizing on Mac hardware.

+1 fiber over short distance just adds power/heat and latency compared to DAC - fiber is nice for ease of cabling and airflow, but not performance or cost when below a few meters.

everything strix halo went 2-3x bananas, same ballpark figures as apple hardware now and lead times on all of those are in months. Ridiculous where we ended up at.

> Last year you could buy a AI Max 395+ with 128G for 2.5k, now it's almost $4

Only if you pay the Framework premium.

https://www.bosgamepc.com/products/bosgame-m5-ai-mini-deskto...

I don't have access to the USD price, but it's 2500€ (tax included), up from 1600€ in November when I ordered mine.

The cheapest ones with 128GB were 1580€/$1840 as late as mid December.

You got it wrong. Use appliances instead of eggs. If getting an oven gets more expensive I rather keep going to Dennys.

It’s classic capex vs opex. I’d keep paying my openai subscription instead of dropping $3k to run a subpar model. If the thing costs $1k I would consider it.

openai etc are going to have a higher utilisation of the hardware so can afford it more than small companies/people. Efficient resource use matters more when they're expensive.

+1 fiber over short distance just adds power/heat and latency compared to DAC - fiber is nice for ease of cabling and airflow, but not performance or cost when below a few meters.

I generally agree for everything except Macbook Pros which outperform most available desktop setups for AI tasks - but they are also now out of reach for most people after the price hikes (6.7k now for 128gb, i got mine for 4.7k just about a year ago).

Honestly I think this is just a bad time to be buying hardware - everything is marked up an insane amount that doesn't really make sense.

I’m mostly with you but there are some people who like to use one machine for both laptop and AI work, and it’s much cheaper than buying two separate devices.

For me the smaller footprint, lower power consumption and portability (admittedly between desks only) are the three advantages of using a laptop over a desktop for these purposes.

In repo Readme and antirez reddit comments there was also expressed willingness to upstream.

How are the memory bandwidths specs of Macbooks vs this?

AI + 100GBE (under load) + tiny box = unreliable and eead very quickly.

How many MS-01s did you have clustered?

And could you not use something like an N5 + iSCSI for storage?

He did cover the Tb/USB4 ;)

What is simplex ethernet?

Imagine two computers A and B. A has two NICs, A1 and A2. B has B1 and B2. So 4 NICs total. You connect directly A1 to B1 and A2 to B2 with crossover cables. You then route all the traffic from A to B over A1 to B1 and all the traffic from B to A over B2 to A2.

Why do you do all this? To avoid collisions and the loss of effective bandwidth from back-offs.

It only really works with 2 computers because if you add a 3rd, now you need 12 NICs instead of 4 for unidirectional point-to-point connections.

Honestly I think this is just a bad time to be buying hardware - everything is marked up an insane amount that doesn't really make sense.

I’m mostly with you but there are some people who like to use one machine for both laptop and AI work, and it’s much cheaper than buying two separate devices.

In repo Readme and antirez reddit comments there was also expressed willingness to upstream.

How many MS-01s did you have clustered?

And could you not use something like an N5 + iSCSI for storage?

Looks like there is currently no RDMA support for thunderbolt, so it's a much higher latency connection. Apple has RDMA over thunderbolt working, so I wonder if it's possible on Strix Halo.

Why do you do all this? To avoid collisions and the loss of effective bandwidth from back-offs.

It only really works with 2 computers because if you add a 3rd, now you need 12 NICs instead of 4 for unidirectional point-to-point connections.

For me the smaller footprint, lower power consumption and portability (admittedly between desks only) are the three advantages of using a laptop over a desktop for these purposes.

The Strix Halo mini PCs use the exact same chip, and have a much smaller footprint than any laptop. Have you seen the size of these machines? I can and have easily popped my daily driver computer into my very small backpack to attend a demoparty for example.

With the laptop you probably won't get silent operation at the peak 100-140w, i.e. you've now massively overpaid for lower performance.

How are the memory bandwidths specs of Macbooks vs this?

The apple silicon chips basically beat everything in bandwidth. Highest amount of memory controllers (i.e. channels) for a given capacity. That's the main party trick.

Hmm, coing PCIe -> NIC -> NIC -> PCIe seems a bit silly, couldn't both devices communicate directly over PCIe?

The apple silicon chips basically beat everything in bandwidth. Highest amount of memory controllers (i.e. channels) for a given capacity. That's the main party trick.

Looks like there is currently no RDMA support for thunderbolt, so it's a much higher latency connection. Apple has RDMA over thunderbolt working, so I wonder if it's possible on Strix Halo.

But why would you? You don't have collisions since the introduction of full duplex ethernet on both copper/fiber. Kinda sounds like you're confusing half duplex with simplex, or maybe bidi? As a network engineer I've never seen someone ever refer to "simplex ethernet".

You need Non-Transparent Bridging (NTB) for that; I don't know if AMD has it.

With the laptop you probably won't get silent operation at the peak 100-140w, i.e. you've now massively overpaid for lower performance.

Can you get these from vendors like Asus and lenovo these days?

The ones I've seen on aliexpress are from unknown Chinese vendors.

AMD Strix Halo RDMA Cluster Setup Guide

This guide details how to configure a two-node AMD Strix Halo cluster linked via Intel E810 (RoCE v2) for distributed vLLM inference using Tensor Parallelism.

TL;DR (Quick Start)
Concepts & Architecture
Hardware Prerequisites
Host Configuration (Fedora)
Toolbox Installation & Network Verification
Running the Cluster
- 6.1 Setup & Verify
- 6.2 Launching vLLM
Troubleshooting
References & Acknowledgements

1. TL;DR (Quick Start)

On Both Nodes:

Preparation:
- Install/Update Fedora 43 and the E810 NICs (Check firmware: ethtool -i <iface>).
- BIOS/Kernel: Set iGPU to 512MB and apply kernel params (iommu=pt, pci=realloc, etc.).
- SSH: Configure passwordless SSH between nodes.
Networking: Assign static IPs (192.168.100.1 & .2), set MTU 9000, and trust the interface in firewall.
Install Toolbox: Run ./refresh_toolbox.sh (this automatically installs the container with RDMA support and the custom librccl.so patch).
Run Cluster:
- Run start-vllm-cluster.
- Select "2. Start Ray Cluster" (Follow prompts using the TUI).
- Select "4. Launch VLLM Serve" and choose your model. (Export HF_TOKEN first for gated models!)

Key Note: The refresh_toolbox.sh script detects your Infiniband/RDMA devices and automatically configures the container to expose them.

2. Concepts & Architecture

concepts

To fully utilize the Strix Halo cluster, it is helpful to understand the technologies involved:

vLLM: A high-performance inference engine. To run models larger than a single GPU (or APU) can handle, it splits the model using Tensor Parallelism (TP).
Ray: A distributed computing framework. vLLM uses Ray to orchestrate the cluster, manage the "worker" processes on each node, and ensure they start up correctly. Ray handles the control plane (issuing commands).
RCCL (ROCm Collective Communication Library): The AMD equivalent of NVIDIA's NCCL. This library handles the data plane—specifically, the extremely fast synchronization of tensor data between GPUs. When TP=2, the two nodes must exchange partial results after every single layer of the neural network. This happens thousands of times per second.
RoCE v2 (RDMA over Converged Ethernet): The protocol that allows RCCL to write data directly from one Node's memory to the other Node's memory, bypassing the CPU and OS kernel.
- Without RDMA: Latency is ~70-100µs (TCP/IP overhead).
- With RDMA: Latency is ~5µs.
- Why it matters: For interactive token generation, high latency kills performance. RoCE makes the two nodes feel like a single machine.

3. Hardware Prerequisites

cluster

Nodes: 2x Framework Desktop Mainboards with AMD Ryzen AI MAX+ "Strix Halo", 128GB of Unified Memory.
Network Cards: Intel Ethernet Controller E810-CQDA1 (or similar 100GbE QSFP28).
Connection: Direct Attach Copper (DAC) cable (e.g., QSFPTEK 100G QSFP28 DAC). No switch required for 2 nodes.
PCIe Note: The Framework motherboard PCIe slot is physically x4, so a riser is required to plug in a 16x card (e.g., CY PCI-E Express 4x to 16x Extender). Test Setup Note: One of the boards in this setup has a modified PCIe slot (cut by Framework using an ultrasonic knife) to accept x16 cards directly. This is not recommended for users. Risers are the cheaper, safer, and easier solution. Performance is identical (~50Gbps bandwidth, ~5µs latency).

4. Host Configuration (Fedora)

Perform these steps on the Host OS (Fedora 43) of both nodes.

Tested Host Configuration:

Node	Kernel	OS	IP (RDMA Interface)
Node 1	`6.18.5-200.fc43.x86_64`	Fedora Linux 43	`192.168.100.1/30`
Node 2	`6.18.6-200.fc43.x86_64`	Fedora Linux 43	`192.168.100.2/30`

Note: These specific kernel versions were verified to work. Fedora 43 is recommended.

4.1 Install Packages

Install the core RDMA userspace tools. You do not need proprietary Intel drivers; the in-kernel drivers work perfectly.

Ethernet Driver: ice
RDMA Driver: irdma (Unified driver for RoCE v2 & iWARP)

sudo dnf install rdma-core libibverbs-utils perftest

rdma-core: The userspace components for the RDMA subsystem (libraries, daemons, and configuration tools).
libibverbs-utils: Utilities for querying RDMA devices (e.g., ibv_devinfo).
perftest: A suite of benchmarks (e.g., ib_write_bw, ib_send_lat) to verify RDMA bandwidth and latency.

4.2 Check Native Firmware

Use ethtool to check the current firmware version of your Intel E810 card.

ethtool -i enp194s0np0

Recommended Firmware: Ensure your firmware is at least as new as the version shown below (Firmware 4.91...). If your firmware is older, please update it using the Intel® Ethernet NVM Update Tool for E810 Series.

Example Output:

driver: ice
version: 6.18.5-200.fc43.x86_64
firmware-version: 4.91 0x800214b5 1.3909.0
expansion-rom-version: 
bus-info: 0000:c2:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

4.3 Network Configuration

This guide assumes a subnet of 192.168.100.0/30.

Identify your interface: Run ip link to find your 100GbE card (e.g., enp194s0np0).

Node 1 (Head - 192.168.100.1):

# Bring link up
sudo ip link set enp194s0np0 up

# Assign IP
sudo ip addr add 192.168.100.1/30 dev enp194s0np0

# Set MTU (Jumbo Frames)
sudo nmcli connection modify "rdma0" ethernet.mtu 9000
sudo nmcli connection up "rdma0"

Node 2 (Worker - 192.168.100.2):

# Bring link up
sudo ip link set enp194s0np0 up

# Assign IP
sudo ip addr add 192.168.100.2/30 dev enp194s0np0

# Set MTU
sudo nmcli connection modify "rdma0" ethernet.mtu 9000
sudo nmcli connection up "rdma0"

Verify Routing: Ensure the route exists on both:

sudo ip route add 192.168.100.0/30 dev enp194s0np0

Verify Link:

rdma link
# Output should show: state ACTIVE physical_state LINK_UP used_usec X ...

4.4 BIOS & Kernel Configuration

1. BIOS Settings: Set the iGPU Memory Allocation to the minimum possible (512MB). We will use the GTT (Graphics Translation Table) to dynamically allocate system memory as "Unified Memory" for the GPU.

2. Kernel Parameters: Update GRUB to enable unified memory, optimize RDMA performance, and fix PCI resource allocation.

Edit /etc/default/grub and append to GRUB_CMDLINE_LINUX:

iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856

Explanation of Parameters:

iommu=pt: Sets IOMMU to "Pass-Through" mode. This is critical for performance, reducing overhead for both the RDMA NIC and the iGPU unified memory access.
pci=realloc: Reallocates PCI BARs. Often needed on consumer platforms to properly map large address spaces for devices like the E810 or Strix Halo.
pcie_aspm=off: Disables PCIe Active State Power Management. Prevents latency spikes and link negotiation issues on the 100GbE connection.
amdgpu.gttsize=126976: Caps the GPU GTT size to ~124GiB (126976MB). This defines how much system RAM the GPU can address as its own "VRAM".
ttm.pages_limit=32505856: Limits the Translation Table Manager to ~124GiB (in 4KB pages), matching the GTT size.

3. Apply Changes:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

4.5 Firewall Rules

Applications like Ray and NCCL use random high ports. It is easiest to trust the internal RDMA interface completely.

# Assign the interface to the trusted zone permanently
sudo firewall-cmd --permanent --zone=trusted --add-interface=enp194s0np0

# Reload firewall
sudo firewall-cmd --reload

5. Toolbox Installation & Network Verification

5.1 Prerequisites: Passwordless SSH

The cluster management and verification scripts rely on SSH to execute commands on remote nodes. You must configure passwordless SSH between both nodes (root or sudo-enabled user).

Guide: How to Set Up SSH Keys on Linux (DigitalOcean)
Quick Check: Run ssh <other-node-ip> date from each node. It should print the date without asking for a password.

5.2 Installation

The toolbox container provided in this repo includes a critical patch: a custom-built librccl.so that enables gfx1151 (Strix Halo) support for RDMA (https://github.com/kyuz0/rocm-systems/tree/gfx1151-rccl), which is currently missing in upstream ROCm packages. This library is automatically compiled using the build-rccl GitHub Action in this repository, which generates the artifact that is then bundled into the Docker container.

To install the toolbox on both nodes, run:

./refresh_toolbox.sh

What this does:

Pulls the latest kyuz0/vllm-therock-gfx1151 image.
Detects if /dev/infiniband exists on your host.
Creates the toolbox with flags to expose:
- iGPU Access: /dev/dri, /dev/kfd (Required for ROCm)
- RDMA Access: /dev/infiniband, --group-add rdma
- Memory Pinning: --ulimit memlock=-1 (Required for DMA)

5.3 Verify RDMA Connection

Before proceeding to run the cluster, verify that RDMA is active and providing low latency (~5µs vs ~70µs for Ethernet).

Run the provided verification script from the Head Node:

# Inside toolbox
/opt/compare_eth_vs_rdma.sh

Expected Results:

Path                 Latency      Bandwidth   
------------------------------------------------
Ethernet (1G LAN)    0.074 ms     0.94 Gbps   
Ethernet (RoCE NIC)  0.068 ms     55.70 Gbps  
RDMA (RoCE)          5.23 us      50.64 Gbps

Note the massive latency drop (milliseconds to microseconds) for RDMA.

6. Running the Cluster

A TUI utility, start-vllm-cluster, is provided to manage the Ray cluster and vLLM.

6.1 Setup & Verify

Enter the toolbox:
```
toolbox enter vllm
```
Run the Cluster Manager:
```
start-vllm-cluster
```
Configure IPs (Option 1):
- Ensure Head is 192.168.100.1 and Worker is 192.168.100.2.

Start Ray Cluster (Option 2):

On Node 1: Select "Head" when prompted.
On Node 2: Select "Worker" when prompted.

The script effectively runs:

# Head
export NCCL_SOCKET_IFNAME=<rdma_iface>
ray start --head --node-ip-address=192.168.100.1 ...

# Worker
ray start --address=192.168.100.1:6379 ...

Check Status (Option 3):
- Ensure you see 2 nodes and adequate GPU resources (e.g., 2.0 GPU).

6.2 Launching vLLM

Once the cluster is active (checked via Option 3):

Select "4. Launch VLLM Serve" in the TUI.
Choose a model (e.g., Meta-Llama-3.1-8B-Instruct).
Configuration Menu:
- Tensor Parallelism: Set to 2 (one GPU per node).
- Context Length: Auto or custom (e.g., 131072).
- Erase vLLM Cache: Select YES if you are restarting after a crash.
- Force Eager Mode: Select YES.
  - Why? CUDA Graphs can be unstable on distributed APU clusters and cause deadlocks. Eager mode is safer, but you might be able to squeeze 1-3% more performance if you take a chance and disable it.
Launch: Select "LAUNCH SERVER".

Important Gotchas:

First Run Download: When running a model for the first time, each node in the cluster must download the weights independently. This may take some time depending on your internet connection.
Gated Models (e.g., Gemma):
- Models like google/gemma-2-27b-it are "gated" and require you to request access on Hugging Face.
- You must export your Hugging Face token before running the cluster script:
```
export HF_TOKEN=your_token_here
start-vllm-cluster
```
- If you don't provide a token or haven't accepted the license on Hugging Face, the download will fail.

7. Troubleshooting

vLLM Deadlocks / Hangs

Cause: CUDA Graph capture can freeze on distributed APU nodes.
Fix: Enable "Force Eager Mode" in the start menu.

Firmware

If you see link issues, ensure your Intel E810 firmware is up to date using the Intel standard tools.

8. References & Acknowledgements

Reddit - Strix Halo Batching with Tensor Parallel: Thread by Hungry_Elk_3276
- Special thanks to user Hungry_Elk_3276 for their initial experiments with vLLM RDMA, which highlighted the missing gfx1151 support in upstream RCCL.

9. Alternative: Thunderbolt Networking

If you do not have dedicated 100GbE RDMA network cards, you can directly connect the two nodes using a high-quality Thunderbolt 4 / USB4 cable. This will create a thunderbolt0 network interface.

While it lacks the ultra-low microprocessor-level latency of RDMA, it provides significantly more bandwidth than standard 1GbE/5GbE Ethernet and is easier to configure.

Note: thunderbolt-net relies on standard OS kernel TCP/IP stacks.

9.1 Thunderbolt Configuration

1. Establish Connection: Connect the nodes directly using a certified Thunderbolt 4 or USB4 cable. Verify the link is active:

ip link show thunderbolt0

2. Network Configuration (Head - Node 1): Configure a persistent connection using nmcli with a static IP and Jumbo Frames (reduces CPU overhead). Note: Jumbo Frames may be unsupported on some Thunderbolt host controllers.

sudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.1/24 mtu 9000
sudo nmcli connection up thunderbolt0

3. Network Configuration (Worker - Node 2):

sudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.2/24 mtu 9000
sudo nmcli connection up thunderbolt0

4. Firewall Rules: To ensure Ray and NCCL can communicate freely over this link:

# Assign the interface to the trusted zone permanently
sudo firewall-cmd --permanent --zone=trusted --add-interface=thunderbolt0
sudo firewall-cmd --reload

9.2 Running vLLM over Thunderbolt

Our cluster scripts dynamically detect the network interface based on the provided IPs. There is no need to manually export environment variables!

Open the Toolbox: toolbox enter vllm
Launch the cluster manager: start-vllm-cluster
Select Option 1 (Configure IPs).
Set the Head IP explicitly to 192.168.2.1 and the Worker IP to 192.168.2.2.
Start the cluster normally (Option 2). The script will automatically discover and utilize thunderbolt0 as the backend network for Ray orchestration and GPU synchronization.

9.3 Validating the Link

I have added Thunderbolt support to the compare_eth_vs_rdma.sh script. Run it from inside the toolbox to see the latency and bandwidth of your Thunderbolt link compared to your other network interfaces.

You can use the -t flag to ONLY benchmark the Thunderbolt connection (or -e, -r, -i for the others):

/opt/compare_eth_vs_rdma.sh -t

You need Non-Transparent Bridging (NTB) for that; I don't know if AMD has it.

Can you get these from vendors like Asus and lenovo these days?

The ones I've seen on aliexpress are from unknown Chinese vendors.

I have a Framework Desktop as primary PC (great cooling, beautiful case with handle) and the Bosgame M5 dedicated for AI use.

I was also a bit wary about Bosgame but TBH they've been great and the machine is rock solid, if a little noisier than and not as pretty as the FD. You can just buy from them directly and be fine, best computer deal out there by a mile.

I have a Framework Desktop as primary PC (great cooling, beautiful case with handle) and the Bosgame M5 dedicated for AI use.

Hacker Times

Hacker Times

AMD Strix Halo RDMA Cluster Setup Guide

Discussion

Discussion

AMD Strix Halo RDMA Cluster Setup Guide

Table of Contents

1. TL;DR (Quick Start)

2. Concepts & Architecture

3. Hardware Prerequisites

4. Host Configuration (Fedora)

4.1 Install Packages

4.2 Check Native Firmware

4.3 Network Configuration

4.4 BIOS & Kernel Configuration

4.5 Firewall Rules

5. Toolbox Installation & Network Verification

5.1 Prerequisites: Passwordless SSH

5.2 Installation

5.3 Verify RDMA Connection

6. Running the Cluster

6.1 Setup & Verify

6.2 Launching vLLM

7. Troubleshooting

vLLM Deadlocks / Hangs

Firmware

8. References & Acknowledgements

9. Alternative: Thunderbolt Networking

9.1 Thunderbolt Configuration

9.2 Running vLLM over Thunderbolt

9.3 Validating the Link