For a lot of research questions 6 GPUs is even overkill.
It’s one of the reasons I’m skeptical of the “trillion dollar supercluster” idea [0]. I think what we need is more reasonably smart people investigating medium-sized problems. A “GPU middle class” you might say.
[0] https://situational-awareness.ai/racing-to-the-trillion-doll...
- https://www.williamangel.net/blog/2026/05/17/offline-llm-ene... - Discussion: https://news.ycombinator.com/item?id=48168198
In comparison to just spending for tokens, the tokens would have been much cheaper and much much faster. I've been running against Gemma4:31b, Qwen3.5 and 3.6, and getting local LLMs to solve AMC 8/10 math questions and it's about 10-100x slower than just doing it online. When I tried it with ChatGPT late last year, it took about one night and $25 to solve about 1000 questions. Using my RTX 6000 and M3 Ultra and Gemma4:31b on both, it answered about 40 questions in 7 hours and I haven't checked how good the answer is yet. At 800 watts (600 for RTX and 200 for M3 Ultra) and running for 7 hours, it solved around 40 questions.
At the very least I'm going to try to sell my M3 Ultra if I can find a reliable place to sell it without getting ripped off by scammers.
It just scares me to own a box that is $48K in my house, especially if it breaks, or gets stolen.
Genuine question; would anyone here recommend any specific motherboard to best utilize these cards?
But yes, for pure inference, the M5 Max Macbook Pros probably aren't there yet. They have other utility though of course. And you can get 64GB and 128GB MBPs at a discount. Micro Center currently will let you buy a 64GB M5 Max MBP for under $4k currently, for example.
Cloud is optimized for development velocity but its nature of high margin business eventually makes on-prem more promising
It could be too late but it might be worth looking into tax saving if you have a business. Depreciation of asset is a loss and may deduct your income. (I'm NOT a tax expert)
:( you paid a professional pc builder and you weren't told this?
The Ada has a memory bandwidth of 960GB/s. The Pro has 1.8TB/s and about 40-50% better performance so is at least equivalent in processing power, much better in memory bandwidth (important for inference) and can hold larger models on a single card.
I've considered buying a rig with 1-2 6000 Pros for similar reasons but I want to see what happens with this year's Mac Studios with a likely M5 Ultra. Macs have a shared memory architecture whereas NVidia segments the market based on max memory where the biggest consumer card (RTX 5090) has 32GB of VRAM but still excellent memory bandwidth (1.8TB/s). A RTX 5090 rig will still trounce a Mac Studio seems to be the conventional wisdom. Despite being able to hold larger models and being able to chain Mac Studios on TB5, their lower memory bandwidth (~900GB/s) and lower overall GFLOPS mean they still come out behind.
That being said, the current Mac Studios are relatively long in the tooth, being released in 2024.
I'm still not sure any of this is really wroth it because things are still changing so fast. I think there's a decent chance of a number of large AI companies going bust in the next 2-3 years such that you'll be able to buy enterprise AI hardware at cents on the dollar, a bit like how Google bought data centers in the post-dot-com crash.
But anyway, nowadays I'd be looking at the RTX 6000 Pro as the sweet spot, having anywhere from 1-4 in a single server.
The electricial issues the author mentions are interesting. I hadn't really thought about the max amperage on a residential circuit. In a DC, these would typically operate on three phase power and much higher overall amperage. I wonder if there's a device you can buy that can combine multiple residential circuits into a single power source for a server this power hungry?
But the trend here is interesting. I think by 2030 you'll be able to buy fairly cheap hardware that is currently $10k+. I don't know what this does to the trillions invested in AI data centers because the next NVidia architecture after Blackwell will essentially half the value of purchased cards overnight.
I'm not convinced Apple has yet pivoted the Mac Studio line towards this market and the expected M5 Ultras in Q3 2026 will likely be an incremental improvement rather than big leap forward but I'd like to be proven wrong.
"I spent a long time trying high risk/high reward experiments and failing. But now I have something good. I’ve solved a major problem with LLMs. And I’m launching next Monday so we will soon see if it’s actually a breakthrough or just LLM psychosis "
Maybe ai companies today have some bounty program?
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center"
I'm sure there are use cases when renting makes sense, but it can get crazy expensive really fast if you're not careful.
Or, for a person who did have a great way to monetize the same workload they’d probably find a lot of value in reading this post.
I myself run with gigabyte trx40 aorus xtreme, but since it's regular threadripper (not pro) with 4 GPUs 2 of them will run at x16 and two of them at x8 speeds
(I would assume they haven't made a lot of $ off of this, if nothing else because they've only just put out that post and demo. They do seem to have produced a model that doesn't sound very LLM-y to my ear, though it also seems rather weak for its size.)
While I'm skeptical that there is much of a moat, at least for the large players, it should at least hopefully set rosmine up with for the next job :)
It does seem to fix the current biggest issues with using LLMs for writing at various publishers. If you're The Economist, you have a very specific house style and you have a decent corpus of articles written in that style. At least on my reading of it, rosmine can use DFT to get a model to closely match its outputs, in terms of the language quirks that are generated, to that of the corpus it is fine tuned on. ie it will very much match the house style, particularly as it is used in writing, vs giving a system prompt to an LLM that has some Economist articles in its vast training set, and telling it to write in that style- it will do an ok job, but still exhibit LLM language quirks despite itself. Even if you feed it the specific "style guide" that they give their authors, I dare say the reality of their writing is the best place to learn, and it sounds like DFT can ground the writing of a model in a specific corpus like that.
[1]: https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
As the author notes, there are also electrical/wiring issues that cap how much compute gear you can run in a space not designed for it. I suspect a standard 20A 110V circuit can probably handle 2x RTX 6000 Pros. 15A probably can but that requires more research. Anything more than that and you're using multiple circuits, which has issues, or you need an upgraded circuit (eg 40A 240V) with all that entails (eg heavier duty cables, custom plug, etc).
Because that wasn't what they claimed to research?
>> for inference it's definitely not worth it.
It's entirely fine if you enjoy local LLMs on your computer, there are people doing horribly inefficient inference on smartphones now. But for pure inference tasks, it's pretty obvious why M5s and Mac Studios aren't replacing TPUs and GPUs.Cynical take: They made an LLM that can bypass existing AI slop detectors.
Realistic take: They found a research problem they found interesting, dumped a bunch of capital and sweat equity into and (claimed to have, at least) found a solution. Neat!
They did not. That's a mining rig not a workstation. It's visible from the photo and the chart showing multiple failures over a short period of time including the risers -- which are visibly very low quality -- failing twice.
You have 50K, you call a real expert like Puget Systems or Digital Storm.
I feel that the open weight models pale in comparison to the frontier models, and I believe that if the gap closes quickly, that the open weight vendors will stop releasing it for free.
There is no specs in this blogpost regarding cpu/motherboard choice, but if you go with threadripper pro they have 128 pci-e lanes for some time now, so using all GPUs at full speed shouldn't be a problem
I don't think anything compares to the nVidia chips at all.
Edit: I now see the author was in an apartment and couldn't do this, so I concede this is not responsive here.
Is this the best general-purpose choice as of 2026 with $50k for training, fine-tuning and running large open models?
https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
The idea is obviously to be running the LLM on your work laptop. As a developer I'd need a laptop with 24GB of RAM for work anyway, and 48GB, which is enough for a very good quant of Gemini, is just $400 extra.
edit: Hm, finding mixed information online on whether that's still supported or not. Apparently it was removed in workstation GPUs.
After my last run, I'm going to wait for the new case I ordered to come in and cannibalize my kid's PC that we built beginning of this year to form an entirely separate computer. And then figure out better ways to deal with the heat, especially with summer coming up. I'll have to play around with undervolting and running vents directly outside my house to see if that helps.
At the time he put this rig together, there weren't a lot of open-weight LLMs that could run well on 6x48=288 GB, so it probably wasn't a huge loss. There still aren't, really.
Right now I'm in the process of cramming Blackwell cards into an old DDR4-based Milan server, where the important thing is to be able to run large models at all. The GPU fans alone burn over 400 watts at full throttle.
They do it well enough that it'd take really good output to beat.
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center. But then I would miss saying Hi to grumbl once in a while."
I saw your heat comments about the RTX 6000 Pro as well. I bought a few of them recently and I'm running 2 of them in a 2U case in a colo. You need a lot of active airflow to keep them cool. Mine range from 23 C to 80 C.
Yes this is exactly what I'm doing. I isolated the actual math question, and then sent it to my two servers to process and that's what's taking 10m+ to return. I'm asking them to solve the question and return the full answer along with their steps. I care about correctness so taking time is okay but I can't use 10m per solution.
If your goal is to say, write science fiction, their reversion to classic LLM-isms, is really distracting and is what makes people say from a glance that it was written by an LLM. You basically can't use them at the moment in any real "natural" long-form writing. Everyone will call "slop" pretty quickly on the current frontier models.
Rosmin's DFT paper is worth a read.
The server is going to live in the garage, so I'm not that concerned with noise. But I had no idea what to expect when I flipped the switch for the first time. It sounds like something out of the Book of Revelation. No way, no how could something like this be used in an inhabited area.

In 2024 I quit my FAANG job to become an independent researcher. To do this I needed GPUs, so I built “grumbl”, a 6x 6000 Ada GPU server.
This blog describes the build, some of the issues I faced, and answers the question “was it worth it to build the server myself, or should I have rented cloud GPUs?”
(It’s called “grumbl” because apparently I cannot spell “GPUs”)

This rig cost me $48K. That sounds expensive, but it’s way less expensive than quitting my job. Because of the loss of income, if more powerful GPUs could help me make my work be successful just 2 months earlier than I would have with a smaller machine, then buying a more powerful server would be worth it. So I decided to buy the most powerful server that I could run in my apartment.
I found Tim Dettmers’ guide to choosing a GPU helpful. From that I narrowed it down to A100’s, H100’s or RTX 6000 Ada. A100’s don’t support FP8 and have slower inference performance than the newer GPUs, and I’m going to be doing a lot of inference (RL), so narrowed it down to 6000 Ada vs H100. Looking at the price/throughput ratios of 6000 Ada vs H100 vs A100, I went with the 6000 Ada GPUs.
I live in an apartment and don’t have the option to upgrade my electrical circuits to support standard datacenter servers. 6 GPUs requires too much power for a single apartment circuit to handle, so I had to get 2 power supplies, and plug the power supplies into 2 outlets in separate circuits.
If you google “plugging a PC into multiple outlets”, you get lots of warnings that if you even consider such a setup you will instantly burst into flames. So I hired a professional PC builder make sure it was safe. This is more expensive than doing everything myself, but it’s less expensive than doing something wrong and burning down my apartment.
Ironically, after designing the entire build around apartment power constraints, I ended up moving grumbl to my parents’ basement, where I could upgrade the circuits anyway.

Is it better to buy my own GPUs or should I have rented from a cloud provider? I decided to measure this by calculating how much I used the GPUs, and comparing that to how much it would’ve cost to rent equivalent compute in the cloud.
In 2024 I calculated at the then current GPU rental rates, it would take me about a year of close to 85%+ utilization to match cloud rental rates. That should be easy to do, but for a full analysis, I need to also account for electricity and the fact that as more powerful GPUs become available, the cost to rent equivalent compute will decrease.
To be thorough, I wrote a script that would log the usage of each gpu every minute. I also logged the power usage in watts so I could calculate how much I spent on electricity.
In this analysis, I only compared against on-demand pricing. There are also payment plans where you reserve the instance for 6-12 months, but those seemed not worth it to me, since they were only a little cheaper than buying the server itself, and this way I got to keep the gpus.

Using grumbl without a monitor is wasting its potential, since it has ports for up to 24 monitors. I could make my own mini vegas sphere
To measure GPU usage, for each GPU I counted the number of hours each day where I used that GPU at least once. This seemed a fair comparison against rental since I wouldn’t stop and restart a cloud server if it was only going to be idle for less than an hour.
This comparison is generous to cloud renting, because it assumes I could stop and start each GPU independently. Much of the idle time I had was when I was running multiple experiments in parallel, and one finished/failed but the others kept going, and I wouldn’t have stopped the server if I was renting
Note: This is meant to be a measure of how much I use the gpus, not training efficiency, so a GPU with 10% utilization would still count as active for the hour. (My code would be equally inefficient running in the cloud)
Here is the graph of use over time:

You can see 3 separate times the server was down for maintenance. This is quite stressful because you don’t know if the server isn’t booting because a single PCIe riser failed, or because something went catastrophically wrong and fried all the GPUs.
In June 2025 you can see a clear increase in usage, before that I was doing smaller experiments where dev time was comparable to experiment time, so there was more down time between experiments when implementing. After June 2025, I had a project that required more compute, so I always had most GPUs continuously running experiments, and only 1-2 GPUs for dev.
From the graph, the total average use was 76%. If you calculate since 1/1/25, utilization is 85%. I have to admit, I’m a little disappointed in that. I’m running experiments 24/7, and always have a queue of more experiments to run once they finish. I thought it would easily be 95+%
To calculate money saved, the first step is to use the rental price for each day, and multiply that by the number of GPU hours used for that day, and add it all up. I didn’t have historical provider API logs, so I estimated historical pricing from timestamped references online.
Based on the Wattage records that I had logged, I calculated the electricity cost to be ~$3000, or about $125 per month.
Putting this all together, as of 3/13/26, I calculated rental fees for equivalent compute would have cost $68000 so I saved a total of $17000 so far.
Now the GPUs have paid for themselves, and based on current market rate I’m saving $90-$105 every day after this.
The point of buying the server wasn’t to save money, it was to build something cool. I spent a long time trying high risk/high reward experiments and failing. But now I have something good. I’ve solved a major problem with LLMs. And I’m launching next Monday so we will soon see if it’s actually a breakthrough or just LLM psychosis 🙂
Questions? Comments? DM me on X or E-mail me at hello@rosmine.ai
Thanks to @algomancer for sponsoring this and other work