OpenAI shows exactly how well that works and what that kind of governance does to a company and to its support of science and the commons.
TL;DR, it's fucked.
I am wary of that. IMO the business model is damaged therein. You can say in 2022 we had 27; bankrupt in 2030.
The vacuum that arXiv originally filled was one of a glorified PDF hosting service with just enough of a reputation to allow some preprints to be cited in a formally published paper, and with just enough moderation to not devolve into spam and chaos. It has also been instrumental in pushing publishers towards open access (i.e., to finally give up).
Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right. Consider the impression you get when seeing a reference to an arXiv preprint vs. a link to an author's institutional website.
In my view, arXiv fulfills its function better the less power it has as an institution, and I thus have exactly zero trust that the split from Cornell is driven by that function. We've seen the kind of appeasement prose from their statement and FAQ [1] countless times before, and it's now time for the usual routine of snapshotting the site to watch the inevitable amendments to the mission statement.
"What positive changes should users expect to see?" - I guess the negative ones we'll have to see for ourselves.
Is a mid-to-high engineering salary outlandish for a CEO of what is likely to be a fairly major non-profit? Even non-profits have to be somewhat competitive when it comes to salary, and the ideal candidate is likely someone who would be balancing this against a tenured position at a major university
A setup as a US-based "non-profit" is worrisome, if only because 300K is an obscene salary even in a for-profit setting. That the US-based posters can't see this is evidence of the basic problem which is that the US, both left and right, has been taken over by a neoliberal feudal antidemocratic nativist mindset that is anathema to the sort of free interchange of ideas that underlay the ArXiv's development in the hands of mathematicians and physicists now swept aside and ignored by machine learning grifters and technicians who program computers.
Any change to the basic premise will be a negative step.
They should just be boring quiet unopininionated neutral background infrastructure.
Could they not have made it into some legal structure that puts universities at the top? Say, with a bunch of universities owning shares that comprise the entirety of the ownership of arXiv, but that would allow arXiv to independently raise funds?
You need your favourite academic gatekeeper (= thesis advisor) to vouch for you in order to be allowed to upload.
Then AI slop gets flagged and the shame spreads through the graph. And flaggings need to have evidence attached that can again be flagged.
I had to tell my AI to set up an MCP for "fetch while bypassing arXiv's rate limit" so that it doesn't burn 40k tokens looking for workarounds every time it wants to look at a paper and gets hit with a "sorry, meatbags only" wall.
Very annoying, given how relevant arXiv papers are for ML specifically, and how many of papers there are. Can't "human flesh search" through all of them to pick the relevant ones for your work, and they just had to insist on making it harder for AIs to do it too.
arXiv has become a target for grifters in other domains like health and supplements. Iβve seen several small scale health influencers who ChatGPT some βpapersβ and then upload them to arXiv, then cite arXiv as proof of their βpublished researchβ. Itβs not fooling anyone who knows how research work but itβs very convincing to an average person who thinks that that theyβre doing the right thing when they follow sources that have done academic research.
Iβve been surprised as how bad and obviously grifty some of the documents Iβve seen on arXiv have become lately. Is there any moderation, or is it a free for all as long as you can get an invite?
All the Mozilla executives have done for the last 15+ years is
* lay off developers
* spend lots of money on stupid side projects nobody asked for or wants
* increase their own salaries
and all that with the backdrop of falling quality, market share, and relevance.
I would happily donate to Firefox, but this fucked up organization will never see a single cent from me. They will spend it on anything but Firefox, which is the only thing anybody wants them to spend it on.
It might already be too late, and we will be left with a browser monopoly.
Exactly. It should be a utility. Not quite dumb pipe, but not too far either.
The article says that "it will become an independent nonprofit corporation", and as OpenAI's failed attempt showed, converting a non-profit to a for-profit organization is either really hard or impossible.
> Could they not have made it into some legal structure that puts universities at the top?
As a corporation (even a non-profit one), it will have a board of directors. I have no idea what their charter will look like, but I would be surprised if at least one seat wasn't reserved for a university representative, and more than that seems quite likely as well.
> arXiv requires that users be endorsed before submitting their first paper to arXiv or a new category.
It's probably not perfect but in practice, it seems to have been enough to get rid of the worst crackpotty spam.
another will need to rise to take its place.
People keep falling into the same trap. They love monopolies, then are shocked when those monopolies jerk them around.
That is, it's not readily parseable, it really gives an insider term vibe - like this isn't for you if you don't already know what it means or how you should read or say it. It sort of reminds me of the overuse of latin and latinate terms generally in the old professions and, well, the academy.
Just always struck me as being somewhat at odds with the goal.
I think both sides could learn from the other. In the case of ML, I understand the desire to move fast and that average time to publication of 250-300 days in some of the top-tier journals can feel like an unnecessary burden. But having been on both sides of peer review, there is value to the system and it has made for better work.
Not doing any of it follows the same spirit as not benchmarking your approach against more than maybe one alternative and that already as an after-thought. Or benchmaxxing but not exploring the actual real-world consequences, time and cost trade offs, etc.
Now, is academic publishing perfect? Of course not, very very far from it. It desperately needs to be reformed to keep it economically accessible, time efficient for both authors, editors and peer reviewers and to prevent the "hot topic of the day" from dominating journals and making sure that peer review aligns with the needs of the community and actually improves the quality of the work, rather than having "malicious peer review" to get some citations or pet peeves in.
Given the power that the ML field holds and the interesting experiments with open review, I would wish for the field to engage more with the scientific system at large and perhaps try to drive reforms and improve it, rather than completely abandoning it and treating a PDF hosting service as a journal (ofc, preprints would still be desirable and are important, but they can not carry the entire field alone).
In my experience as a publishing scientist, this is partly because publishing with "reputable" journals is an increasingly onerous process, with exorbitant fees, enshittified UIs, and useless reviews. The alternative is to upload to arXiv and move on with your life.
This just isn't true. arXiv is not a venue. There's no place that gives you credit for arXiv papers. No one cares if you cite an arXiv paper or some random website. The vast vast majority of papers that have any kind of attention or citations are published in another venue.
It is an interesting instance of the rule of least power, https://en.wikipedia.org/wiki/Rule_of_least_power.
arXiv does not need to and should not optimize for βshareholder valueβ, which is at least nominally the justification for outlandish CEO pay packages.
And, FWIW, I do think that arXiv truly has a vast potential to be improved. It is currently in the position to change the whole process of how the research results are shared, yet it is still, as others have said, only a PDF hosting. And since the universities couldn't break out of the whole Elsevier & co. scam despite the internet existing for the 30 years, to me, breaking free from the university affiliation sounds like a good thing.
But, of course, I am talking only about the possibilities being out there. I know nothing about the people in charge of the whole endeavor, and ultimately in depends on them only, if it sails or sinks.
I read a dozen papers a month, typically on arxiv, never from paywalled journals. I find the quality on par. But maybe I'm missing something.
Its especially problematic because while ArXiv love to claim to be working for open science, they don't default to open licensing. Much of the publications they host are not Open Access, and are only read access. So there is definitely the potential to close things off at some point in the future, when some CEO need to increase value.
Oh, wait.
Sure, but the cost of living there is significantly higher as well. Anyway, I can hardly even comprehend these kinds of sums, though I am a bit of an outlier, as I earn around $27,700 as an SWE in Europe, which is low even by the standards of companies in my own country.
You get ahead as an academic computer scientist, for instance, by writing papers not by writing software. Now there really are brilliant software developers in academic CS but most researchers wrote something that kinda works and give a conference talk about it -- and that's OK because the work to make something you can give a talk about is probably 20% of the work it would take to make something you can put in front of customers.
Because of that there are certain things academic researchers really can't do.
As I see it my experience in getting a PhD and my experience in startups is essentially the same: "how do you do make doing things nobody has ever done before routine?" Talk to people in either culture and you see the PhD students are thinking about either working in academia or a very short list of big prestigious companies and people at startups are sure the PhDs are too pedantic about everything.
It took me a long time of looking at other people's side projects that are usually "I want to learn programming language X", "I want to rewrite something from Software Tools in Rust" to realize just how foreign that kind of creative thinking is to people -- I've seen it for a long time that a side project is not worth doing unless: (1) I really need the product or (2) I can show people something they've never seen before or better yet both. These sound different, but if something doesn't satisfy (2) you can can usually satisfy (1) off the shelf. It just amazes me how many type (2) things stay novel even after 20 years of waiting.
Google "sorted out" a messy web with pagerank. Academic papers link to each others. What prevents us from building a ranking from there?
I'm conscious I might be over-simplifying things, but curious to see what I am missing.
arXiv is doomed. It was nice while it lasted.
The current balance where people wrote a paper with reviers in mind, upload it to Arxiv before the review concludes and keep it on Arxiv even if rejected is a nice balance. People get to form their own opinion on it but there is also enough self-imposed quality control on it just due to wanting it to pass peer review, that even if it doesn't pass peer review, it is still better than if people write it in a way that doesn't care or anticipate peer review. And this works because people are somewhat incentivized to get peer reviewed official publications too. But being rejected is not the end of the world either because people can already read it and build on it based on Arxiv.
And while academic salaries are generally not great, tenured professors at big universities tend to make a fair bit (plus a lot more vacation time and perks than is normal in the US)
Though, saying that, I suppose all the reputation data is kind of public. Apart from emails/accounts.
The paper you link to counts as a publication, but its reputation stands on its own, it has nothing to do with arXiv as a venue. Ideally, that's how it is for all papers, but it isn't, just by publishing in certain venues your paper automatically gets a certain amount of reputation depending on the venue.
The problem is that "optimizing for peer-review" is not the same thing as optimizing for quality. E.g., I like to add a few tongue-in-cheeks to entertain the reader. But then I have to worry endlessly about anal-retentive reviewers who refuse to see the big picture.
People think, for instance, that RDFS and OWL are meant to SHACL people into bad an over engineered ontologies. The problem is these standards add facts and donβt subtract facts. At risk of sounding like ChatGPT: itβs a data transformation system not a validation system.
That is, youβre supposed to use RDFS to say something like
?s :myTermForLength ?o -> ?s :yourTermForLength ?o .
The point of the namespace system is not to harass you, it is to be able to suck in data from unlimited sources and transform it. Trouble is it canβt do the simple math required to do that for real, like ?s :lengthInFeet ?o -> ?s :lengthInInches 12*?o .
Because if you were trying OWL-style reasoning over arithmetic you would run into Kurt GΓΆdel kinds of problems. Meanwhile you canβt subtract facts that fail validation, you canβt subtract facts that you just donβt need in the next round of processing. It would have made sense to promote SHACL first instead of OWL because garbage-in-garbage out, you are not going to reason successfully unless you have clean dataβ¦ but what the hell do I know, Iβm just an applications programmer who models business processes enough to automate them.Similarly the problem of ordered collections has never been dealt with properly in that world. PostgreSQL, N1QL and other post-relational and document DB languages can write queries involving ordered collections easily. I can write rather unobvious queries by hand to handle a lot of cases (wrote a paper about it) but I canβt cover all the cases and I know back in the day I could write SPAQL queries much better than the average RDF postdoc or professor.
As for underengineering, Dublin Core came out when I worked at a research library and it just doesnβt come close in capability to MARC from 1970. Larry Masinter over at Adobe had to hack the standard to handle ordered collections becauseβ¦ the authors of a paper sure as hell care what order you write their names in. And it is all like that: RDF standards neglect basic requirements that they need to be useful and then all the complex/complicated stuff really stands out. If you could get the basics done maybe people would use them but they donβt.
It's still the land of opportunities. It's easier to find ways to reduce your living costs than ways to increase your salary.
Non-profits aren't maximizing stock value, but they do need to optimize for stakeholder value - you want to maximize the amount of money being donated in and you want to make the most of the donations you receive, both to advance the primary mission of the non-profit and to instill confidence in donors. This demands competent leadership. The idea that just because something is not being done for profit means the value of the person's contributions is worth less is absurd.
In reality you could host the entire thing for well under $50k/year in hardware and storage if someone else is providing a free CDN. Their costs could be incredibly low.
But just like Wikipedia I see them very likely very quickly becoming a money hole that pretends to barely be kept afloat from donations. All when in reality whats actually happening is that its a ridiculous number of rent seekers managed to ride the coattails of being the defacto preprint server for AI papers to land themselves cushy Jobs at a place that spends 90+% of their money on flights and hotels and wages for their staff.
I'm already expecting their financial reports to look ridiculously headcount heavy with Personnel Expenses, Meetings and Travel blowing up. As well as the classic Wikipedia style we spend a ton of money in unclear costs [1].
Whats already sad is they stopped having a real broken down report that used to actually showed things. Like look at this beautiful screenshot of a excel sheet. Imagine if Wikipedia produced anything this clear. [2]
[0] https://blog.arxiv.org/2023/12/18/faster-arxiv-with-fastly/
[1] https://info.arxiv.org/about/reports/FY26_Budget_Public.pdf
[2] https://info.arxiv.org/about/reports/2020_arXiv_Budget.pdf
Ladybird continues to have the appearance of making progress, fwiw:
"oh no, you see we are not a preprint server host anymore, our mission is a values driven blablabla to make a meaningful change in the blablabla, we have spent X dollars to promote the blablabla, take me seriously please I'm also fancy like you! "
can you think of a better one?
To this end, they added an endorsement requirement this year: https://blog.arxiv.org/2026/01/21/attention-authors-updated-...
Now, honestly, I have no idea why would one spend resources on uploading terabytes of LLM garbage to arXiv, but they sure can. Even if some crazy person is publishing like 2 nonsense papers daily, it is no harm and, if anything, valid data for psychology research. But if somebody actually floods it with non-human-generated content, well, I suppose it isn't even that expensive to make ArXiv totally unusable (and perhaps even unfeasible to host). So there has to be some filtering. But only to prevent the abuse.
Otherwise, I indeed think that proper ranking, linking and user-driven moderation (again, not to prevent anybody from posting anything, but to label papers as more interesting for the specific community) is the only right way to go.
But yes itβs a people problem, not an arxiv problem.
Itβs even less. I would bet if itβs not now, for the vast majority of its life it was a machine at someoneβs desk at Cornell.
I don't see much of a monopoly, nor any "moat" apart from it being recognised. You can already post preprints on a personal website or on github, and there are "alternatives" such as researchgate that can also host preprints, or zenodo. There are also some lesser known alternatives even. I do not see anything special in hosting preprints online apart from the convenience of being able to have a centralised place to place them and search for them (which you call "monopoly"). If anything, the recognisability and centrality of arxiv helped a lot the old, darker days to establish open access to papers. There was a time when many journals would not let you publish a preprint, or have all kinds of weird rules when you can and when you can't. Probably still to some degree.
Mozilla certainly wonβt spend it on Firefox, because the structure of the organization legally prohibits them from spending any of their donation money on Firefox. The βside projectsβ are, at least officially, the real purpose of Mozilla.
So if this is correct, then even in Switzerland, it seems like $300,000 per year would be an obscenely high salary for a senior developer.
Isn't that actually kindof a good brand signal for a repo of very specialized papers? "Fun with learning" in comic sans wouldn't help credibility.
But yeah, this is just how it works. Things can't stay good for too long. One must always be on the lookout for the new small thing that's not yet corrupted. Stay with it for a while until it rots, then jump to the next replacement.
Even if we scope it to SWE, I don't think that's far off the US percentiles.
In London I imagine the top 10% SWE is not even 100k GBP. In Germany even worse.
It is actually quite common to come across HAL in subfields of mathematics in my experience.
To me it's just a way to get out your work fast, so that there is already a trace of it on the Internets - nothing more and nothing less.
> That is, it's not readily parseable, it really gives an insider term vibe...
Isn't that normal with highly specialized research fields? I agree many papers could benefit from clearer wording, but working in a niche means you sometimes don't reach a broader audience
The original service didn't even have a name, only a description, and it was amusingly hosted at xxx.lanl.gov. But LANL wasn't really interested in it, and the founder eventually left for Cornell. At that point, the service needed a domain name, but archive.org was already taken.
And besides, the name has Ancient Greek influences. A similar Latinate term might be something like "archive".
The reason is because arxiv is growing significantly leading to 297,000 deficit in operating costs for 2025 alone. Corenell has helped with donation a long with other organizations that pay membership fees.
As a result, donors + leaders of arxiv think it's best to spin off to increase funding.
Personally I think this resource mismatch can help drive creative choice of research problems that donβt require massive resources. To misquote Feynman, thereβs plenty of room at the bottom
I wonder when they will introduce the algorithmic feed and the social network features.
Dollars? So 300 people's cable bill? That's basically nothing. They're spending too much, and it's still nothing, and the solution is going to be to privatize it and eventually loot it.
You can't hand out a collection plate and get $300K for Arxiv? Your local neighborhood church can. Civilization is obviously collapsing.
Can you elaborate on that?
So if OpenAI with billions of dollars only partially succeeded at converting to a for-profit business, then that suggests that organizations with fewer resources (like arXiv) have much worse odds.
But I did justify and maybe to reword slightly, surely if one of the main drivers is opening up research, the brand name should be something that's less obscure and more accessible / understandable as to what it is on first sight?
Maybe arXiv evoking the word 'archive' with an ancient Greek twist does that for some, but it's clearly a bit cryptic for many, and if the point is to open up probably the brand should just be something much plainer.
Good riddance! But not relevant in the least.
I can not imagine what one could possibly need $300,000 per year for unless an apartment costs like $200,000 per year.
When I used to visit the Meta campus in Menlo Park, the QA folk I worked with were commuting 2 hours each way just to be able to afford housing. I've no idea how far away the janitorial staff must have lived to do the same
Everything published on arXiv could also be published on Zenodo, but not the other way around.
Being able to afford unpredictable expenses and not have it bankrupt you. In the US, that would include healthcare. Everywhere in the world, that would be useful if you were laid off.
Not really a tenable long-term situation for a senior employee with plans to start a family. Family homes of decent size and area are literally millions of dollars.
Besides, I did already say that everyone else was underpaid relative to costs. But that's not unique to the Bay Area. Cost of housing relative to income is terrible in almost all of the major European cities too.
Once cities become wealthy enough to develop a home owning class, they seem to cease being able to provision adequate housing supply in general.
I've contracted into some consultancy teams which you could uncharitably describe as "15 people and $4mn/yr to create one PDF per month".
Bigger problem in the SF area is that a bunch of folks who owned property before the gold rush have ended up real-estate-rich, and formed a voting block that actively prevents the construction of new housing (on the basis that it might devalue their accidental real estate investment)
Also, the "human review" is a simple moderation process [1]. It usually does not dig into the submission's scientific merits.
Using a brand as a filter where you have to already know what it means to get it is exactly the opposite of what it's supposed to achieve.
Consider the most exclusive (successful) brands that exist. Even there, where exclusivity is a brand goal, none of them have this property of being obscure on first contact.
Its reasonable to have a tradeoff here to avoid cranks and now AI psychosis slop. You can still post on research gate and academia.edu or you own github page or webhosting.
Most people I talk to hate that pipeline and spend a lot of debug hours on it when Arxiv can't compile what overleaf and your local latex install can.
The reason authors like and use arxiv is that it gives 1) a timestamp, 2) a standardized citable ID, and 3) stable hosting of the pdf. And readers like the no-nonsense single click download of the pdf and a barebones consistent website look.
All else is a side show.
Spinning the service off forces other the labor out onto other universities rather than leaving them to solely Cornell
Arxiv doesn't need moderation. Nobody is asking for Arxiv moderation. It needs minimal checks to remove overtly illegal content.
Seems like a lot of people are asking for moderation. And moderation is a pretty big part of the existing offering[1].
No. Around half the cost is infrastructure. The other half of the cost is people. i.e. engineers to maintain infra and build mod tools for moderators to operate.
> Arxiv doesn't need moderation. Nobody is asking for Arxiv moderation.
This is just not true. Tons of people ask for arxiv to have moderation. Especially since covid, etc when antivaxxers and alternative medicine peddlers started trying to pump the medical categories of arxiv with quack science preprints and then go on to use the arxiv preprint and its DOI to take advantage of non academics who don't really understand what arxiv is other than it looks vaguely like a journal.
And doubly so now that people keep submitting AI generated slop papers to the service trying to flood the different categories so they can pad their resumes or CVs. And on top of that people who don't actually understand the fields they are trying to write papers in using AI to generate "innovative papers" that are completely nonsensical but vaguely parroting the terms of art.
The only reason you don't see more people calling for arxiv moderation is because they already spend so much time on it. If they were to stop moderating the site it would overflow into an absolute nightmare of garbage near overnight. And people wouldn't be upset with the users uploading this of course, they'd be upset with arxiv for failing to take action.
Moderation is inherently unappreciated because in the ideal form it should be effectively invisible (which arxiv's mostly is).
If you want to see the type of stuff that arxiv keeps out, go over to ViXrA [1] or you can watch k-theory's video [2] having fun digging through some of the quality posts that live over on that site.
I fundamentally dislike these busibodies for whom the already existing endorsement gatekeeping isn't enough. Science is not an institution and not a community but an attitude towards understanding reality.
I dislike quacks and cranks. But I don't mind if they can host their nonsense online alongside real papers on an explicitly unreviewed site. It's a bad precedent and will just ratchet more and more close to reviewing except not by blinded peers but by the Arxiv overlords based on vibes and political valence.