> GitHub Will Prioritize Migrating to Azure Over Feature Development - GitHub is working on migrating all of its infrastructure to Azure, even though this means it'll have to delay some feature development.
> In a message to GitHub’s staff, CTO Vladimir Fedorov notes that GitHub is constrained on capacity in its Virginia data center. “It’s existential for us to keep up with the demands of AI and Copilot, which are changing how people use GitHub,” he writes.
https://thenewstack.io/github-will-prioritize-migrating-to-a...
So the currently delayed feature development is now gonna be further delayed, yet almost every week we see new features and changes, just the other day the single issues view was changed, as just one example. And it was "existential" 6 months ago yet they keep stumbling on the exact same issue today?
Even if they're focused exclusively on reliability and uptime, we get the experience that we have today, kind of incredible how a company with the resources of Microsoft seemingly are unable to stop continuously shot themselves in the foot. It's kind of impressive actually. As icing on the cake, they've decided to buy up all popular developer services then migrate them all to the same platform, great idea too.
GitHub instability has started way before that. I understand it’s too much to ask of a trillion-dollar corporation to consider the impact of their own actions, but perhaps they should’ve thought of that before forcing LLM development down everyone’s throats.
on another note - is the exponential growth from 'agentic' workflows actually resulting in productive software in the wild. Or it is just noise. On my end I haven't seen the software I use getting better.
Wild
Stop subsidizing tokens now that we extracted enough training data from you and we have enough agentic junkies business to keep the flywheel going up and cut on the loss leaders. [0]
> availability first, then capacity, then new features.
I'd love to experience first-hand a leadership team who says, "stop accepting new paying customers until we've got availability sorted out!"
That's a delayed April fool's right?
In seriousness, looking at their scale, this is an insane engineering challenge.
Especially if they’re moving databases, not easy ever, and certainly not at that scale
Leopard, meet face.
Too little too late, yesterday was the straw that broke the camel’s back for us and we’ve started a migration to a self-hosted GitLab.
Looking at the commit graph: Why do commits have big steps followed by slow rolloffs? Why do the steps not happen at uniform points Why do larger steps sometimes have less of a slope than smaller steps but not all the time?
Then looking at the other graphs there's completely different effects going on.
I think I found the issue.
The unlabelled graph with big numbers on top, the priorities that don't match with what we're experiencing, and a list of things that they're doing without a real acknowledgement of the _dire_ uptime over the last 12 months....
I understand the rapid growth (because of AI agents), but if such critical software service becomes unstable then it's time to migrate? Thinking about self-hosting GitLab.
GitHub is claiming they require 30x scale due to the giant increase in repository creation, PRs, commits, etc.
I have not seen a single product increase in features or quality as an end user, nor new significant products have come out in this period (other than the LLMs themselves).
Where is all this code going?
Is this microsoft stating that they aren't able to get acceptable reliability from Azure? (I mean, I think a lot of us have heard that, but it's interesting to hear it from microsoft themselves).
Status page is also still doing that thing where every component is green but in practice clone is hanging, push is timing out, actions are stuck. Per-service uptime is a managed number. The user-experience number is the one that matters and it's not in the post-mortem.
The unlabeled graphs don't help the credibility case. When you are already in the hole on trust, shipping a post that requires readers to assume favorable baselines is exactly the wrong move.
Good chuckle out of this post, it’s crazy that neither Atlassian (Bitbucket) or Gitlab are capturing value out of this same agentic coding boom. I wish github was separately publicly traded outside of Microsoft.
Nowhere to get exposure to this
amazing on one hand, quite scary on the other for github and all other forges if this continues and there is no reason why it wouldn't.
I feel like this would have negative impacts (lots of interesting historical archives on Github) but maybe if a project hasn't been touched, or cloned, in some time, it just gets deleted with some notice.
> New sign-ups for GitHub Copilot Pro, Pro+, and Student plans are paused. Pausing sign-ups allows us to serve existing customers more effectively.
They did that as a panic mode hack to mitigate performance: https://news.ycombinator.com/item?id=47912521
Global indices for this should be trivial to spin up so availability is never a concern (we're working towards this!).
* we had to resolve a variety of bottlenecks that appeared faster than expected from moving webhooks to a different backend (out of MySQL)
* * redesigning user session cache to redoing authentication and authorization flows to substantially reduce database load.
* we accelerated parts of migrating performance or scale sensitive code out of Ruby monolith into Go.
I'd like to know what database backend they migrated to. I was also surprised to read that the migration from Ruby to a more performant language had not already been completed. I assume this is because it a large code base with many moving parts, etc.
are there big conceptual serialisations that I've missed? is it just not well factored? was the move to Azure just a catastrophically bad idea? some other thing?
If you multiply all current numbers together (as of Apr 28), you find out that GitHub has a 97.26% uptime.
One ... single ... 9.
They can do better.
and that azure cannot scale fast enough to handle the load so they're embracing multi-cloud as a company... owned by microsoft?
woah. what am I reading.
Since yesterday, me and several colleagues noticed that the pull request lists on the website are incomplete, across many repositories. For example, on https://github.com/gap-system/gap/pulls it says "Pull requests 78" in the "tab list", but the PR list view reports "35 open" (the number 78 is correct, and confirmed by e.g. `gh pr list`)
And that despite <https://www.githubstatus.com> reporting "all systems operational".
Right way to think about this:
> If things we need/see as critical for our work are hosted on a platform with really bad reliability, it's time for us to migrate
My internet connection at home is really shit, and almost every week there is a multi-hour downtime for some reason, not to mention when La Liga games are on TV anything using Cloudflare is unavailable, so I've had to spend extra energy and time to setup things in a way so I can still work whenever this happens.
What I’m not seeing here but I am seeing with the Linux kernel is, most of the automatically submitted code is irrelevant or not useful
(Maybe that’s what you were getting at, apologies)
They started the trend with Copilot.
> If they weren't letting folks use it directly
There is a chasm of difference between “letting you use it” and “forcing it down your throat”. Microsoft is doing the latter, not the former. Copilot is annoyingly present by default at every step on GitHub.
What's the question here, you don't believe growth is currently exponential, or do you think it shouldn't be hard to scale, when 10x YoY is not enough?
You don't need to know the bottom left axis number. We do have to assume the graph is linear, and not some kind of negative exponent log graph. But given the rest of the content, I think that is safe to assume.
Any company that experiences significantly more growth than they were planning for will have capacity issues.
The priorities are most inline with that. The are way beyond the point that they can just add more hardware. They need to make the backend more efficient, and all the stated goals are about helping there.
Half of my friends is vibe-coding something but they can barely get the rest of the group chat to use it once.
In companies, I see people vibe-coding "miracle apps" that fall under the smallest amount of scrutiny.
Basically people are doing the same developers do when they say "I can do this in a weekend", which is getting a prototype sort of running and then immediately losing energy (or in this case lacking ability) to push it forward.
If I could get the same bells and whistles by wiring up another forge, so long as it offered a decent API and/or sent events over a webhook, I'd have everything self-hosted.
The agents would need to expose an interface on their own end but as long as you implemented it with a plugin, it'd take the dependency of GitHub and you could use MCP or skills for the rest of it.
Even as recently as 18 months ago, Lovable appeared, seemingly overnight, and caused huge problems for GitHub because they were creating repositories on GitHub for every single Lovable project, offloading the very high cost onto GitHub, hundreds of thousands of repositories. A couple of years before that, Homebrew used GitHub as a de facto CDN and that was a huge problem, too.
Nowadays it is easy to imagine how we can scale out a service like Twitter or YouTube or Facebook because everything has been done before, but that's not true of Git, Git hasn't ever scaled like this before, there are very few examples of service with GitHub's characteristics.
I was thinking of maybe doing a proper write up about how to host your own Forgejo + Action runners on Linux, Windows and macOS, not sure if there is enough interest. What would people for sure want to know in a guide/explanation of this?
> you find out that GitHub has a 97.26% uptime
Calculating that to "Downtime per day" you get ~40 minutes of downtime per day, almost a week per year. Crazy stuff for something essential like this.
The user (and not a big tech monopoly) answer to scaling issues is almost always to stop scaling and start federating and interoperating.
The only repos I left on GitHub are forks and one with a bit of public engagement.
No, they're completely useless. Using the "New repos per month" as an example, if the bottom left is 1m, then that's a 20x increase in 2 years which is a lot. If the bottom left is 19m, it's a 5% increase in 2 years which is nothing.
The massive surge on their labelled X axis starts in 2026, and these issues have been going on for a lot longer than that. GHA has been borderline unusable for a year at this point, if not longer.
> But given the rest of the content, I think that is safe to assume.
The rest of the content is "we're working on it", and "here's two outages in the last 14 days, one of which caused actual data loss"
> To summarize, for every v1 diff line there would be:
> - Minimum of 10-15 DOM tree elements
> - Minimum of 8-13 React Components
> - Minimum of 20 React Event Handlers
> - Lots of small re-usable React Components
https://github.blog/engineering/architecture-optimization/th...
I recently migrated to codeberg because I'm okay with self-hosting big runners, while using codeberg's available runners for smaller cron-based things (they even have lazy runners for this).
We very much do. The graph suggests an insane growth in PRs from almost zero to 90M. Now compare this misleading graph with this much clearer one, which shows that the growth over the last three years has been less than 80%: https://github.blog/wp-content/uploads/2025/10/octoverse-202...
> What's the question here, you don't believe growth is currently exponential, or do you think it shouldn't be hard to scale
I think you're putting words in my mouth here; I didn't say either of those things. I'm saying that this blog post is a meaningless platitude when the github stability issues predate this, and that all this post says is "we hear you're having issues".
Surely a scaling hack where they use "estimation" queries that return "kind of right" results instead of 100% correct data, as it's less load on the infrastructure. Not necessarily a bug as much a shit choice from product perspective.
Some people I know can't even explain what they are trying to create.
I’m sure they’re experiencing scaling issues across the platform, but it’s unacceptable for that to have a negative impact on us when we're sending them $250/dev/yr for (what is in all honesty) hosting a bunch of static text files.
There's no intrinsic reason they should be vulnerable to themselves.
Which is to say, this is perfect for agents given they don't need any bespoke SDK from us: simply write Tangled records for issues, pulls, whatever to your PDS and it'll show up on Tangled. We plan to start working on some exemplar agents first-party that would 1. enhance Tangled itself, 2. showcase cool things you can do with an open data firehose.
Disclaimer: the author is a colleague of mine
Though to be fair, what the parent meant by federated forges is different than this approach.
[0] https://docs.codeberg.org/getting-started/faq/#how-about-pri...
Sorry, but I don't think there is any way this can be classified as "not actually a bug"
You know, you can just host your own code forge. Or you can just drop gitolite on a server. Or pull directly from each others' dev machines on a LAN.
GitHub is not git.
But if anything, their post and your reply are precisely an endorsement of usage based billing.
The bit that's growing 13x YoY (and which they expect will easily blow past that) is unmetered - commits. The bit that is metered (for some, not all folks) - action minutes, grew only 2x YoY.
GitHub was not built to limit the number of commits, checkouts, forks, issues, PRs, etc - nor do we want them to - but that's what's growing ridiculously as people unleash hordes of busy beaver agents on GitHub, because their either free or unlimited.
Where there are limits - or usage based billing - people add guardrails and find optimizations.
Because for all the talk, agents don't bring a 10x value increase; otherwise, they'd justify a 10x cost increase.
Besides, other forges are having issues too. Even running your own. We have Anubis everywhere protecting them for a reason.
so start a GitHub competitor which bills $50/dev/yr for solving this easy problem and make a lot of money?
But Github don't have that rationale.
Then it's up to Azure how they will manage this
I guess most people at Github knew exactly it makes no sense but they didn't really have a choice. Maybe some voiced their statement, got "we hear you" in response and were told to proceed anyway.
Prime video does use some AWS services, but live and on-demand are two entirely different beasts.
I just think their charts, taken at face value, show substantially the same thing (for PRs, commits, new repos).
Either those charts are a bald-faced lie (the tweet could be as well) or there is no way for that chart to be something else.
The only way to fake exponential growth like that would be to use an inverse log scale (which would be a bald-faced lie).
It doesn't even really matter what's the y-axis baseline, unless we really think growth was huge in 2020, then cratered to zero by 2023, now back to the previous normal.
As for the rest of the post, I do think it's panic mode platitudes. But I honestly don't know what I'd write instead that's better.
You can already see people complaining loudly where they instead of "we'll do better" decided to limit usage.
I'd say we have emails, mailing lists and bug trackers. Or maybe: what is the missing killer feature that needs federation?
https://stackoverflow.com/questions/849308/how-can-i-pull-pu...
Microsoft Execs: Everyone needs to move to Azure!
GitHub developers: But Azure is not gonna be able to handle our load, we literally have our own data centers!
Microsoft Execs: Sure, but you're Microsoft now, please publish blog post about how in half a year you'll be 100% on Azure.
Few months later...
GitHub Developer: We've tried our best, users are leaving in droves and Azure can't keep up!
Microsoft Execs: Ok fine, you can use something else too, but only if you mainly use Azure and continue publishing blog posts about how great Azure is.
Issues, pull requests, collaboration/permissions/access, "staring"/"favoriting", etc.
I think ultimately the goal is that people can run their own forges, yet still collaborate on repositories hosted in other forges, leveraging your existing authentication so you no longer need to sign up individually for each forge.
> I just think their charts, taken at face value, show substantially the same thing (for PRs, commits, new repos).
The problem is that these charts show the massive exponential growth in 2026. But this didn't start in 2026, this has been going on since early last year. My team had more build failures in 2025 due to actions outages or "degraded performance" than _any other reason_ and that includes PR's that failed linting or tests that developer were working on.
> As for the rest of the post, I do think it's panic mode platitudes. But I honestly don't know what I'd write instead that's better.
IMO, this needed to be written a 6 months ago (around the time that the memo of them prioritising the migration to Azure was released), and then this post should have been "We're still struggling, this isn't good enough. Here's the amount of growth, here's what we've done to try and fix it, and here's what we're planning over the next 3-6 months", instead of "Our priorities are clear: availability first, then capacity, then new features" and "We are committed to improving availability, increasing resilience, scaling for the future of software development, and communicating more transparently along the way." This isn't transparency (yet).
I wanted to give an update on GitHub’s availability in light of two recent incidents. Both of those incidents are not acceptable, and we are sorry for the impact they had on you. I wanted to share some details on them, as well as explain what we’ve done and what we’re doing to improve our reliability.
We started executing our plan to increase GitHub’s capacity by 10X in October 2025 with a goal of substantially improving reliability and failover. By February 2026, it was clear that we needed to design for a future that requires 30X today’s scale.
The main driver is a rapid change in how software is being built. Since the second half of December 2025, agentic development workflows have accelerated sharply. By nearly every measure, the direction is already clear: repository creation, pull request activity, API usage, automation, and large-repository workloads are all growing quickly.

This exponential growth does not stress one system at a time. A pull request can touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, small inefficiencies compound: queues deepen, cache misses become database load, indexes fall behind, retries amplify traffic, and one slow dependency can affect several product experiences.
Our priorities are clear: availability first, then capacity, then new features. We are reducing unnecessary work, improving caching, isolating critical services, removing single points of failure, and moving performance-sensitive paths into systems designed for these workloads. This is distributed systems work: reducing hidden coupling, limiting blast radius, and making GitHub degrade gracefully when one subsystem is under pressure. We’re making progress quickly, but these incidents are examples of where there’s still work to do.
Short term, we had to resolve a variety of bottlenecks that appeared faster than expected from moving webhooks to a different backend (out of MySQL), redesigning user session cache to redoing authentication and authorization flows to substantially reduce database load. We also leveraged our migration to Azure to stand up a lot more compute.
Next we focused on isolating critical services like git and GitHub Actions from other workloads and minimizing the blast radius by minimizing single points of failure. This work started with careful analysis of dependencies and different tiers of traffic to understand what needs to be pulled apart and how we can minimize impact on legitimate traffic from various attacks. Then we addressed those in order of risk. Similarly, we accelerated parts of migrating performance or scale sensitive code out of Ruby monolith into Go.
While we were already in progress of migrating out of our smaller custom data centers into public cloud, we started working on path to multi cloud. This longer-term measure is necessary to achieve the level of resilience, low latency, and flexibility that will be needed in the future.
The number of repositories on GitHub is growing faster than ever, but a much harder scaling challenge is the rise of large monorepos. For the last three months, we’ve been investing heavily in response to this trend both within git system and in the pull request experience.
We will have a separate blog post soon describing extensive work we’ve done and the new upcoming API design for greater efficiency and scale. As part of this work, we have invested in optimizing merge queue operations, since that is key for repos that have many thousands of pull requests a day.
The two recent incidents were different in cause and impact, but both reflect why we are increasing our focus on availability, isolation, and blast-radius reduction.
On April 23, pull requests experienced a regression affecting merge queue operations.
Pull requests merged through merge queue using the squash merge method produced incorrect merge commits when a merge group contained more than one pull request. In affected cases, changes from previously merged pull requests and prior commits were inadvertently reverted by subsequent merges.
During the impact window, 230 repositories and 2,092 pull requests were affected. We initially shared slightly higher numbers because our first assessment was intentionally conservative. The issue did not affect pull requests merged outside merge queue, nor did it affect merge queue groups using merge or rebase methods.
There was no data loss: all commits remained stored in Git. However, the state of affected default branches was incorrect, and we could not safely repair every repository automatically. More details are available in the incident root cause analysis.
This incident exposed multiple process failures, and we are changing those processes to prevent this class of issue from recurring.
On April 27, an incident affected our Elasticsearch subsystem, which powers several search-backed experiences across GitHub, including parts of pull requests, issues, and projects.
We are still completing the root cause analysis and will publish it shortly. What we know now is that the cluster became overloaded (likely due to a botnet attack) and stopped returning search results. There was no data loss, and Git operations and APIs were not impacted. However, parts of the UI that depended on search showed no results, which caused a significant disruption.
This is one of the systems we had not yet fully isolated to eliminate as a single point of failure, because other areas had been higher in our risk-prioritized reliability work. That impact is unacceptable, and we are using the same dependency and blast-radius analysis described above to reduce the likelihood and impact of this type of failure in the future.
We have also heard clear feedback that customers need greater transparency during incidents.
We recently updated the GitHub status page to include availability numbers. We have also committed to statusing incidents both large and small, so you do not have to guess whether an issue is on your side or ours.
We are continuing to improve how we categorize incidents so that the scale and scope are easier to understand. We are also working on better ways for customers to report incidents and share signals with us during disruptions.
GitHub’s role has always been to support developers on an open and extensible platform.
The team at GitHub is incredibly passionate about our work. We hear the pain you’re experiencing. We read every email, social post, support ticket, and we take it all to heart. We’re sorry.
We are committed to improving availability, increasing resilience, scaling for the future of software development, and communicating more transparently along the way.
Vladimir Fedorov is GitHub's Chief Technology Officer, bringing decades of experience in engineering leadership and innovation. A passionate advocate for developer productivity, Vlad is leading GitHub’s engineering team to shape the future of developer tools and innovation with a developer-first mindset.
Before joining GitHub, Vlad co-founded UserClouds, a startup specializing in data governance and privacy. He spent 12 years at Facebook, now Meta, as Senior Vice President, leading engineering teams of over 2,000 across Privacy, Ads, and Platform. Earlier in his career, Vlad worked at Microsoft and earned both his BS and MS in Computer Science from Caltech. He currently serves on the board of Codepath.org, an organization dedicated to reprogramming higher education to create the first AI-native generation of engineers, CTOs, and founders.
Vlad lives in the Bay Area and when not working enjoys spending time outside and on the water with his family.
Everything you need to master GitHub, all in one place.
Build what’s next on GitHub, the place for anyone from anywhere to build anything.
Meet the companies and engineering teams that build with GitHub.
Catch up on the GitHub podcast, a show dedicated to the topics, trends, stories and culture in and around the open source developer community on GitHub.
But a VPS isn't actually infrastructure you control, you essentially have as much control over it as "cloud", so I don't think that'd be counted as "sovereign", would it?