The Azure UI feels like a janky mess, barely being held together. The documentation is obviously entirely written by AI and is constantly out of date or wrong. They offer such a huge volume of services it's nearly impossible to figure out what service you actually want/need without consultants, and when you finally get the services up who knows if they actually work as advertised.
I'm honestly shocked anything manages to stay working at all.
> Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.
This is what people should know when seeing massive layoffs due to AI.
12 years ago I had to choose whether to specialize myself in AWS, GCP or Azure, and from my very brief foray with Azure I could see it was an absolute mess of broken, slow and click-ops methodology. This article confirms my suspicions at that time, and my colleague experience.
The second thing is that this series of blog posts (whether true or not, but still believable) provides a good introduction to vibe coders. These are people who have not written a single line of code themselves and have not worked on any system at scale, yet believe that coding is somehow magically "solved" due to LLMs.
Writing the actual code itself (fully or partially) maybe yes. But understanding the complexity of the system and working with organisational structures that support it is a completely different ball game.
And I've worked other places that had problems similar to the core problems described, not quite as severe, and not at the same scale, but bad enough to doom them (IMO) to a death loop they won't recover from.
> On January 7, 2025… I sent a more concise executive summary to the CEO. … When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.
Why is that customary? I have not come across it, and though I have seen situations of some concern in the past, I previously had little experience with US corporate norms. What is normal here for such a level of concern?
More, why is this public not a court case for wrongful termination?
Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
> In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.
That is quite scary
From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra. There are certainly problems in Azure, but it's huge and rough edges are to be expected. It mostly marches on. IMO maturity is realizing this and working within the system to improve it rather than trying to lay out all the dirty laundry to an Internet audience that will undoubtedly lap it up and happily cry Microslop.
Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really? Azure is still chugging along apparently despite everything being mentioned. People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
It didn’t get any better.
Who are the customers? Who is buying this shit?
and
"I also see I have 2 instances of Outlook, and neither of those are working." -Artemis II astronaut
> Worse, early prototypes already pulled in nearly a thousand third-party Rust crates, many of which were transitive dependencies and largely unvetted, posing potential supply-chain risks.
Rust really going for the node ecosystem's crown in package number bloat
Google’s Cloud feels like the best engineered one, though lack of proper human support is worrying there compared to AWS.
Microsoft should have promoted this guy instead of laying him off.
Did Microsoft really lose OpenAI as a customer?
IME, yes.
I'm currently working as an SRE supporting a large environment across AWS, Azure, and GCP. In terms of issues or incidents we deal with that are directly caused by cloud provider problems, I'd estimate that 80-90% come from Azure. And we're _really_ not doing anything that complicated in terms of cloud infrastructure; just VMs, load balancers, some blob storage, some k8s clusters.
Stuff on Azure just breaks constantly, and when it does break it's very obvious that Azure:
1. Does not know when they're having problems (it can take weeks/months for Azure to admit they had an outage that impacted us)
2. Does not know why they had problems (RCAs we're given are basically just "something broke")
3. Does not care that they had problems
Everyone I work with who interacts with Azure at all absolutely loathes it.
This story is more interesting, in my opinion, in how quickly things devolved and also how unwilling the more senior layers of the org were to address it. At a whole company level, the rot really sets in when you start to lose the key people that built and know the system. That seems to be what’s happening here, and it does not bode well for MS in the medium term.
I'm really struck that they have such Jr people in charge of key systems like that.
I guessed that from the title on the main hn page. Glad to see it confirmed.
Or… you’ve just normalised the deviation.
One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.
After about three or four weeks everyone adapts, learns what they can and can’t criticise without fallout, and settles into the mud to wallow with everyone else that has become accustomed to the filth.
As an Azure user I can tell you that it’s blindingly obvious even from the outside that the engineering quality is rock bottom. Throwing features over the fence as fast as possible to catch up to AWS was clearly the only priority for over a decade and has resulted in a giant ball of mud that now they can’t change because published APIs and offered products must continue to have support for years. Those rushed decisions have painted Azure into a corner.
You may puff your chest out, and even take legitimate pride in building the second largest public cloud in the world, but please don’t fool yourself that the quality of this edifice is anything other than rickety and falling apart at the seams.
Remind me: can I use IPv6 safely yet? Does it still break Postgres in other networks? Can azcopy actually move files yet, like every other bulk copy tool ever made by man? Can I upgrade a VM in-place to a new SKU without deleting and recreating it to work around your internal Hyper-V cluster API limitations? Premium SSDv2 disks for boot disks… when? Etc…
You may list excuses for these quality gaps, but these kinds of things just weren’t an issue anywhere else I’ve worked as far back as twenty years ago! Heck, I built a natively “all IPv6” VMware ESXi cluster over a decade ago!
Start with tokio. Please vend one dependency battery included, and vendor in/internalize everything, thanks.
Microsoft is the go to solution for every government agency, FEDRAMP / CMMC environments, etc.
> People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
This I'm more sympathetic to. I really don't think his approach of "here's what a rewrite would look like" was ever going to work and it makes me think that there's another side to this story. Thinking that the solution is a full reset is not necessarily wrong but it's a bit of a red flag.
Really. Apparently the Secretary of War agrees with him.
Also, after this:
https://news.ycombinator.com/item?id=20341022
You continued to work at Microsoft and now there is this takedown?
I'm no friend of MS (to put it very mildly) but it seems to me your story is a bit inconsistent as well as the 7 year break between postings.
But in this case the SECWAR has been properly advised. If anything it's astonishing that a program whereby China-based Microsoft engineers telling U.S.-based Microsoft engineers specific commands to type in ever made it off the proposal page inside Microsoft, accelerated time-to-market or not.
It defeats the entire purpose of many of the NIST security controls that demand things like U.S.-cleared personnel for government networks, and Microsoft knew those were a thing because that was the whole point to the "digital escort" (a U.S. person who was supposed to vet the Chinese engineer's technical work despite apparently being not technical enough to have just done it themselves).
Some ideas "sell themselves", ideas like these do the opposite.
There is definitely manual access of data - it requires what was termed “break glass” similar to the JIT mechanism described by the author. However, it wasn’t quite so loose; there were eventually a lot of restrictions on who could approve what, what access you got after approval, and how that was audited.
It was difficult to get into the highest sensitivity data; humans reviewed your request and would reject it without a clear reason. And you could be 100% sure humans would review your session afterwards to look for bad behavior.
I once had to compile a large list of IP addresses that accessed a particular piece of data to fulfill a court order. It took me days of effort to get and maintain the elevated access necessary to do this.
I have a lot of respect for GCP as an engineering artifact, but a significantly less rosy opinion of GCP as an organization and bureaucratic entity. The amount of wasted effort expended on engaging with and navigating the bureaucracy is truly mind-boggling, and is the reason why a tiny feature that took a day to code could take months to release.
"The practical strategy I suggested was incremental improvement... This strategy goes a long way toward modernizing a running system with minimal disruption and offers gradual, consistent improvements. It uses small, reliable components that can be easily tested separately and solidified before integration into the main platform at scale." [1]
[1] https://isolveproblems.substack.com/p/how-microsoft-vaporize...
When you submit a link to HN, there is an entry field for a comment.
It does not really describe what the comment is used for. For links, it simply gets added as the first comment.
Someone who is unfamiliar with the submission process may assume this comment should describe what they are submitting, and not format it like a regular user comment.
Then it gets posted as the first comment and tons of people downvote it, jumping to the conclusion that the weird summary comment is from an AI, and not the submitter describing their own submission.
This is the first of a series of articles in which you will learn about what may be one of the silliest, most preventable, and most costly mishaps of the 21st century, where Microsoft all but lost OpenAI, its largest customer, and the trust of the US government.
I joined Azure Core on the dull Monday morning of May 1st, 2023, as a senior member of the Overlake R&D team, the folks behind the Azure Boost offload card and network accelerator.
I wasn’t new to Azure, having run what is likely the longest-running production subscription of this cloud service, which launched in February 2010 as Windows Azure.
I wasn’t new to Microsoft either, having been part of the Windows team since 1/1/2013 and later helped migrate SharePoint Online to Azure, before joining the Core OS team as a kernel engineer, where I notably helped improve the kernel and helped invent and deliver the Container platform that supports Docker, Azure Kubernetes, Azure Container Instances, Azure App Services, and Windows Sandbox, all shipping technologies that resulted in multiple granted patents.
Furthermore, I contributed to brainstorming the early Overlake cards in 2020-2021, drafting a proposal for a Host OS <-> Accelerator Card communication protocol and network stack, when all we had was a debugger’s serial connection. I also served as a Core OS specialist, helping Azure Core engineers diagnose deep OS issues.
I rejoined in 2023 as an Azure expert on day one, having contributed to the development of some of the technologies on which Azure relies and having used the platform for more than a decade, both outside and inside Microsoft at a global scale.
As a returning employee, I skipped the New Employee Orientation and had my Global Security invite for 12 noon to pick up my badge, but my future manager asked if I could come in earlier, as the team had their monthly planning meeting that morning.
I, of course, agreed and arrived a few minutes before 10 am at the entrance of the Studio X building, not far from The Commons on the West Campus in Redmond. A man showed up in the lobby and opened the door for me. I followed him to a meeting room through a labyrinth of corridors.
The room was chock-full, with more people on a live conference call. The dev manager, the leads, the architects, the principal and senior engineers shared the space with what appeared to be new hires and junior personnel.
The screen projected a slide where I recognized a number of familiar acronyms, like COM, WMI, perf counters, VHDX, NTFS, ETW, and a dozen others, mixed with new Azure-related ones, in an imbroglio of boxes linked by arrows.
I sat quietly at the back while a man was walking the room through a big porting plan of their current stack to the Overlake accelerator. As I listened, it was not immediately clear what that series of boxes with Windows user-mode and kernel components had to do with that plan.
After a few minutes, I risked a question: Are you planning to port those Windows features to Overlake? The answer was yes, or at least they were looking into it. The dev manager showed some doubt, and the man replied that they could at least “ask a couple of junior devs to look into it.”
The room remained silent for an instant. I had seen the hardware specs for the SoC on the Overlake card in my previous tenure: the RAM capacity and the power budget, which was just a tiny fraction of the TDP you can expect from a regular server CPU.
The hardware folks I had spoken with told me they could only spare 4KB of dual-ported memory on the FPGA for my doorbell shared-memory communication protocol.
Everything was nimble, efficient, and power-savvy, and the team I had joined 10 minutes earlier was seriously considering porting half of Windows to that tiny, fanless, Linux-running chip the size of a fingernail.
That felt like Elon talking about colonizing Mars: just nuke the poles then grow an atmosphere! Easier said than done, uh?
That entire 122-strong org was knee-deep in impossible ruminations involving porting Windows to Linux to support their existing VM management agents.
The man was a Principal Group Engineering Manager overseeing a chunk of the software running on each Azure node; his boss, a Partner Engineering Manager, was in the room with us, and they really contemplated porting Windows to Linux to support their current software.
At first, I questioned my understanding. Was that serious? The rest of the talk left no doubt: the plan was outlined, and the dev leads were tasked with contributing people to the effort. It was immediately clear to me that this plan would never succeed and that the org needed a lot of help.
That first hour in the new role left me with a mix of strange feelings, stupefaction, and incredulity.
The stack was hitting its scaling limits on a 400 Watt Xeon at just a few dozen VMs per node, I later learned, a far cry from the 1,024 VMs limit I knew the hypervisor was capable of, and was a noisy neighbor consuming so many resources that it was causing jitter observable from the customer VMs.
There is no dimension in the universe where this stack would fit on a tiny ARM SoC and scale up by many factors. It was not going to happen.
I have seen a lot in my decades of industry (and Microsoft) experience, but I had never seen an organization so far from reality. My day-one problem was therefore not to ramp up on new technology, but rather to convince an entire org, up to my skip-skip-level, that they were on a death march.
Somewhere, I knew it was going to be a fierce uphill battle. As you can imagine, it didn’t go well, as you will later learn.
I spent the next few days reading more about the plans, studying the current systems, and visiting old friends in Core OS, my alma mater. I was lost away from home in a bizarre territory where people made plans that didn’t make sense with the aplomb of a drunk LLM.
I notably spent more than 90 minutes chatting in person with the head of the Linux System Group, a solid scholar with a PhD from INRIA, who was among the folks who hired me on the kernel team years earlier.
His org is responsible for delivering Mariner Linux (now Azure Linux) and the trimmed-down distro running on the Overlake / Azure Boost card. He kindly answered all my questions, and I learned that they had identified 173 agents (one hundred seventy-three) as candidates for porting to Overlake.
I later researched this further and found that no one at Microsoft, not a single soul, could articulate why up to 173 agents were needed to manage an Azure node, what they all did, how they interacted with one another, what their feature set was, or even why they existed in the first place.
Azure sells VMs, networking, and storage at the core. Add observability and servicing, and you should be good. Everything else, SQL, K8s, AI workloads, and whatnot all build on VMs with xPU, networking, and storage, and the heavy lifting to make the magic happen is done by the good Core OS folks and the hypervisor.
How the Azure folks came up with 173 agents will probably remain a mystery, but it takes a serious amount of misunderstanding to get there, and this is also how disasters are built.
Now, fathom for a second that this pile of uncontrolled “stuff” is orchestrating the VMs running Anthropic’s Claude, what’s left of OpenAI’s APIs on Azure, SharePoint Online, the government clouds and other mission-critical infrastructure, and you’ll be close to understanding how a grain of sand in that fragile pileup can cause a global collapse, with serious National Security implications as well as potential business-ending consequences for Microsoft.
We are still far from the vaporized trillion in market cap, my letters to the CEO, to the Microsoft Board of Directors, and to the Cloud + AI EVP and their total silence, the quasi-loss of OpenAI, the breach of trust with the US government as publicly stated by the Secretary of Defense, the wasted engineering efforts, the Rust mandate, my stint on the OpenAI bare-metal team in Azure Core, the escort sessions from China and elsewhere, and the delayed features publicly implied as shipping since 2023, before the work even began.
If you’re running production workloads on Azure or relying on it for mission-critical systems, this story matters more than you think.
No posts