Any idea what's happening? I thought PG published public domain books...
> 23644 downloads in the last 30 days.
I wonder if this is bot behavior? 23k downloads feels like a lot?
[0] https://www.gutenberg.org/browse/scores/top [1] https://www.gutenberg.org/ebooks/24855
https://dave.autonoma.ca/blog/2020/04/11/project-gutenberg-p...
https://projekt-gutenberg.org/authors/johann-wolfgang-von-go...
I just looked at the history (https://www.gutenberg.org/cache/epub/60600/pg60600-images.ht...) and it dates back to the 70s. There was me thinking it was some new fangled web thing.
https://www-gutenberg-org.translate.goog/cache/epub/64099/pg...
Good job
https://www.fadedpage.com/ from Canada I think
https://runeberg.org/ from Sweden
I had some small e-correspondence with Michael S. Hart back in the 90's as well, and made a few modest contributions to the project, which made my English major undergraduate heart swell with pride and joy.
I guess this is only to say that PG is special to me for these reasons, and I am glad to see it still thriving. <3
> Michael S. Hart began Project Gutenberg in 1971 with the digitization of the United States Declaration of Independence.[5] Hart, a student at the University of Illinois, obtained access to a Xerox Sigma V mainframe computer in the university's Materials Research Lab. […] This computer was one of the 15 nodes on ARPANET, the computer network that would become the Internet. Hart believed one day the general public would be able to access computers and decided to make works of literature available in electronic form for free. […]
Any yes, the text needed a lot of processing to make it right.
Now, in my early fifties and with declining eyesight, that's out of reach now.
Thanks for sticking with the project!
https://www.gutenberg.org/ebooks/feeds.html
Every day you'll get much more than you're bargaining for, right into your feed or inbox. Easy download books you're interested in and put them on your Kindle.
I like a styled formatted book—would prefer PDFs. (I know, not a popular format apparently.)
I like the idea of Project Gutenberg but guess I found book scans on archive.org my preference.
My go-to example is Lewis Carroll's "Through the Looking Glass" with the fantastic art of John Tenniel and Carroll's sometimes creative formatting of the prose…
I see they (Project Gutenberg) have ePub now, which can be good if well done.
(If not well done it can be a kind of mess. Re-flowable "HTML", paginated… Anyone ever try to print a long web page and did you enjoy the result? Perhaps that is as much on the ePub reader though.)
Thanks for all the effort put into the site!
Keep up the good work!
I have very mixed feelings about Standard Ebooks and would much prefer being able to use Project Gutenberg directly, but one good thing Standard Ebooks does is that every book has an associated git repository (on GitHub), so it's (in principle) possible to see a history of fixes to the text over time.
Edit: welcome to your first comment after 9 years on HN btw, nice to have you here!
but yes, generally I agree with your point. Library of 75k books seems pretty valuable to have direct access to.
https://www.gutenberg.org/policy/license.html
[Way back in the early days of the iPhone, I sold a book reading app which was backed directly by Project Gutenberg texts, called “Eucalyptus”. I sent 20% of the gross profits to PG - which was never less than very supportive of the app - and felt good about doing so.]
Full story (in Italian) at https://www.wired.it/internet/web/2020/06/30/progetto-gutenb...
apparently this situation hasn't been resolved yet
Technically, I can also just directly pull the epub from Project Gutenberg, but sometimes the formatting leaves a lot to be desired.
Once you get an e-reader that runs a semi-capable OS (ex - stock android, even an older version), it's hard to go back to something like a kindle.
e-book app Gutebooks (in addition to their audio app), but it seems to have been deprecated (I'm no longer able to connect to the server on my copy (which I only got 'cause there was an in-app purchase to fund Project Librivox).
FWIW, Barnes & Noble has been plundering the public domain using a book composition/keying house in the Philippines to make their public domain books which they make available in their stores --- Amazon apparently has a similar setup for the Kindle Store:
https://www.amazon.com/Public-Domain-Books-Kindle-Store/s?k=...
Rather a shame that PG didn't monetize by putting their books up there pre-emptively.
site:gutenberg.org "it was the best of times"
Use ⌘ + + until you get the line length you like.
I was visting the ruins of a monestary the other day, and one of the texts listed that it had a library of 320ish books.
I chucked because I have almost 200 books in my personal Kindle library, but I was wrong. I actually have 75000+ books, thanks to Project Gutenberg.
I just haven't downloaded them all yet.
I suppose a printed book, black ink on paper, is "brutalist" and unpleasant to look at?
The text of a book shouldn't be encrusted with format, your reader or browser should contain the presentation that you want to see, find appealing, or need (accessibility).
I was unable to load it initially (got an error from firefox) and had to re-attempt. Still slow if one forces a reload (shift-r, etc, to not use local cache).
Why?
You can download books in most browsers. I know Amazon have done things to make life difficult for other stores in the past.
How best to perform construction work and what it will cost for materials, labor, plant and general expenses are matters of vital interest to engineers and contractors. This book is a treatise on the methods and cost of concrete construction. No attempt has been made to present the subject of cement testing which is already covered by Mr. W. Purves Taylor's excellent book, nor to discuss the physical properties of cements and concrete, as they are discussed by Falk and by Sabin, nor to consider reinforced concrete design as do Turneaure and Maurer or Buel and Hill, nor to present a general treatise on cements, mortars and concrete construction like that of Reid or of Taylor and Thompson. On the contrary, the authors have handled the subject of concrete construction solely from the viewpoint of the builder of concrete structures. By doing this they have been able to crowd a great amount of detailed information on methods and costs of concrete construction into a volume of moderate size.
Why is it 'plundering' for B&N to print physical books, transport them to their brick-and-mortar stores to sell? There are real costs associated to doing so. It would not have zero cost for me to print and bind a copy myself at home.
Please don't do this.
Quote an authoritative source, not some AI bot known for ~~hallucinating~~ bullshitting.
Happy to make other updates! Writing specific notes on the talk page is helpful.
https://www.gutenberg.org/about/background/history_and_philo...
And as another person noted, the vast majority of books have HTML, EPUB, Mobi formats. We are also looking at both KEPUB (Kobo) and PDF which will probably come in the future.
https://www.gutenberg.org/cache/epub/1513/pg1513-images.html
https://standardebooks.org/ebooks/william-shakespeare/romeo-...
Each has its particular advantages relative to the other ...
The previous version of the site had two major flaws:
1. The search bar had been removed from the top of the page, and hidden behind a "Click here to search" (or similar) link partway down the page
2. Once you opened that page, the coloring of the site was so washed out on e-ink that the text input was hard to find.
Thanks for fixing it!
I've read more (meaningful) text on PG than any other digital platform. Huge fan. Thanks for all the work and for keeping it clean and free
Author dates are a much smaller data set, can be generally supplemented from public marc records (viaf, loc, etc - I don't do that, but it's an option) and at least provide basic filtering / sorting.
I got it to a prototype level but then shelved it after having difficulty getting good results with various test datasets. Probably would make a fantastic ereader though
• On the one hand, E Ink devices have a fairly known set of limitations, and it would be ridiculous for me to expect them to render the whole web well.
• On the other hand, it's good for website designs to consider the kind of devices employed by their users. Using a Kindle to access Gutenberg is likely less of an edge case than it would be for other sites, so it's worth the extra design work.
(Keep in mind that -- given my sibling comment -- this is all theoretical. The latest iteration of Gutenberg's site is much better than the previous version)
Furthermore:
* Make sure that all books are downloadable in bulk as torrents.
* Every day, generate a CSV file of all available books and their metadata. Distribute this so that bots and user clients can run queries locally, instead of using your search engine.
I've since discontinued hosting it, but happy to add you all and merge into an official PG offering: https://www.reddit.com/r/SideProject/s/VtYKxjrMme
There are lots of reasons it could be preferable to centralize. OTOH their mission is limited and some competition is healthy, if only to explore alternative ways to do things.
I also have many positive things to say about Standard Ebooks, but I don't think you were asking about those. :)
----
Edit: Without going into what I think are the most egregious sort of changes they introduce (which I think will require a longer post) and limiting myself to ones easy to find immediately:
See the earlier discussion (linked in a sibling comment here) where the editor-in-chief says it's ok to change punctuation because "The sounds out of his mouth do not include an apostrophe whether it's there in the spelling or not." (a very American view IMO): https://news.ycombinator.com/item?id=16956931
And looking at a recent commit on one of their books, here's a recent (https://github.com/standardebooks/agatha-christie_the-secret...) revert of one of their aggressive "modernizations" from 2024 (https://github.com/standardebooks/agatha-christie_the-secret...), that had, in line with their usual practice, changed "every one" to "everyone" (in one place even when referring to "a good many risks"), and the same commit made other changes (including one still present) like "they ought to have it lithographed. It must be a frightful nuisance doing every one separately." having the last four words turned into "doing everyone separately."!
When I read an old novel, written two centuries ago in England, the little differences to modern English are part of the charm, and I certainly don't want any Americanism mixed in. For one of my favorite novels, The Forsyte saga, the author deliberately used some rare forms of words, which SE replaced with the mainstream forms.
The ISP actually knows which subscriber is on that line, can send them notices, block them, terminate them... loads of things that you simply cannot do because you have no relation to this person. And frankly I wouldn't want to need to have a personal relation with every website that I visit; my ISP can reach me if there is anything relevant to continued use of the internet. From personal experience, when I was a teenager, the ISP cutting our household off after an abuse report was an effective way of stopping what I was doing
I have about 50k of the books, I would have used a torrent of just the txt files if it was prominent.
You could have seen it on some websites already
(I worked on iBooks for the Mac like 15 years ago—it's where I got to dive into the ePub format. A lot has changed in the standard since I am sure.)
EDIT: looks like EPUB3 has a "paginated" mode as well as more sophisticated layout tags.
Also appears to have support for ruby and vertical writing modes. This was not yet supported in WebKit when I worked on iBooks. Somehow, this white guy from Kansas (who knows no language other than English) got tapped to implement the vertical TOC for Asian languages. Also tasked with annotating the ePUB pages to display (also vertical) ruby text…
If you mean epub reader software Calibre and a bunch of others exist since pretty much the beginning of epub
(iPhones 15 Pro, 11 Pro, SE-2nd; and an iPad of some kind)
I've heard good things. Also - Sherlock Holmes :)
https://play.google.com/store/apps/details?id=biz.bookdesign...
should ~~be~~ EDIT have been ENDEDIT opensource --- it does at least work to support Project Librivox (or at least that's my understanding)
Seems to no longer be available (see below)
If Amazon is going to sell public domain texts, then it would make sense to source them from PG, and fund some money from those sales to the non-profit, similarly, they could then funnel reports of typos to PG for review and correction (it was a bit of a struggle the last time I tried to get a text corrected, and the project founder/director actually stepped in on my behalf).
Also one should probably compare the former to the single-page version on standardebooks: https://standardebooks.org/ebooks/william-shakespeare/romeo-...
PG focuses on an accurate digital translation of the source material, sometimes hosting multiple different versions of the same text, and doing things like putting work into recreating the adverts at the back of some novels.
SE focuses less of preservation and more on making readers’ versions of the texts, like other publishing imprints. So there’s typography standardisation, a light-touch moderinisation of hyphenation and soundalike spelling, and things like author-wide collections of short fiction and poetry even if it didn’t previously exist.
Both are valuable, but they serve different segments.
s2 < <3
I personally worked on the Forsyte saga. If you think something was done in error, please let us know and we'll be happy to fix it.
Bot traffic comes from machines that usually have a lot of idle cpu (since they're largely blocked on network IO as they scrape a bunch of sites in parallel), so they can trivially solve the anubis "proof of work" challenge, save the cookie, and then not solve it again for that site.
The only reason scrapers don't solve it is if the developers were too lazy to implement it... and modern scrapers also do, codeberg stopped using anubis because modern scrapers were updated to solve it.
The "proof of work" has to be easy or else people on old cell phones couldn't access your site (since an old android phone would start to overheat and throttle trying to solve a challenge that would take a modern server even several seconds), and it also consumes your cell-phone user's batteries, which is a really precious resource for them compared to the idle cpu on a server.
Occasionally, you misclassify a real user as a bot, and then your reputation is ruined forever.
The official Polish train schedules website did this recently, feeding incorrect departure and arrival times to IP addresses known for aggressive scraping, without taking CGNAT into account. People... have noticed[1].
[1] (Polish) https://zaufanatrzeciastrona.pl/post/kto-i-dlaczego-losuje-w...
Every other system I've run into has constant false positives, e.g. Google captchas will sometimes say I've failed and make me do the hardest level (if it wasn't giving me that already), Cloudflare regularly thinks I'm a bot, Codeberg blocked me before, Github signup captchas used to take ~15 minutes to complete and then still said "well you failed, try again", Github's general rate limiting has false positives (some days I browse a lot, other days little, and on the little days it'll sometimes go "slow down" with no recourse whatsoever, you're just blocked for an indeterminate amount of time), OpenStreetMap blocks my browser at work because I'm using Firefox ESR instead of latest stable and it finds that user agent string to be implausible, whatever the german railway operator uses since a few days is triggering on me constantly, etc.,
etc.,
etc. Constant blocks everywhere.
With Anubis, my understanding is that you do the proof of work (with whatever implementation you like, it doesn't have to be the Javascript one that they provide) and you can move on without ever doing any task yourself. The power consumption is a shame, but so long as attackers aren't even doing this much, the couple Joules it takes doesn't seem to be an issue
Of course, the attackers will evolve, but for now...
One of the things I give duckduckgo a lot of credit for is that while they're quick to interrupt me for a bot check (sometimes multiple times in a span of minutes) they'll let me identify ducks even on the most locked down browsers I use.
At least for the first few pages of content that I looked at on both versions.
More broadly, the position of Standard Ebooks is that a modern reader would be distracted by spellings like "some one" and "every thing", and by time written like "2.30" instead of "2:30", and that books in British quotation style must be converted to American quotation style. I think most readers can in fact tolerate such small differences, and this position is frankly insulting — the punctuation and spelling of works are part of their character, and if anything, I'm more distracted by such anachronisms in style introduced as part of the Standard Ebooks process.
https://news.ycombinator.com/item?id=16957359
The edit is still in place, and I still maintain that changing 'phone to phone in dialogue changes the meaning.
The only thing they are is truly, truly wonderful.
Curious. Why even bother?
If they can't keep their ranges clean to a reasonable degree, their customers will need to move if they want to access your part of the internet. New sign-ups will always be hard, so some amount of abuse is expected, but if it's the same abuse traffic for weeks after you've notified them, well, it stops being your problem at some point
Show people a useful error, such as "You are using [ISP name] which sends large volumes of abusive traffic (think of spam and DDoS). They allow the attackers to hop around points across their entire network so we cannot block the abusers more selectively. Despite our attempts to contact them, the abuse continues in volumes which we do not see from other ISPs. To access our corner of the internet, use a different ISP. You could try mobile data instead of Wi-Fi or vice versa.", and they can make their own choices about staying with this ISP if more and more websites show this sort of error
If everyone tries to identify people piecemeal, we all need to implement ~200 different identification systems (assuming each country has a central system that everyone is signed up to in the first place), or rely on algorithms to tell who is a bot (I'm currently being misidentified on a daily basis and I'm, eh, not a bot. Trying to buy public transport tickets is currently difficult, for example, because the monopolist in my country blocks me after a few route queries when using a Google browser, and 0 queries from Firefox)
But nearly all print publishers also do what SE does. Why do you think they do, when it costs additional money and time to do that? A reasonable answer is that some, or a majority of, people prefer it.
To the ISPs? Each IP range has an abuse email address registered and this is specifically exempt from rate limiting at RIPE's WHOIS server. Not sure how it is in other RIRs but I just happen to know of this policy
You can automate the whole thing, provided that you have a reliable way of identifying the undesired traffic which you need anyway for being able to block it by any means. The trouble is in user identification (they'll just use a new IP address from that ISP or hosting provider if you don't tell the provider about the problematic user)
Do they? To check, I tried to find a recent publication of Agatha Christie, and found the collection “Country Christie: Twelve Devonshire Mysteries” which says “First published by HarperCollins Publishers Ltd 2025”. It still has British-style punctuation (throughout the book), and times like “1.30”, “9.30”, “11.30”, “7.30 a.m.”, “12.30 p.m.”, and “8.30”. I checked a couple of other recent publications and admittedly they do modernize (though not in phrases like “every one of you”), but again I found the collection “The Last Seance: Haunting Tales from the Queen of Mystery” (2019) which does not. So it seems mixed.
In any case, I think it's fine to do what Standard Ebooks does, and if it were instead called something like “Modernized Ebooks with American punctuation”—if readers would know before picking one up—it would be totally unobjectionable. The name “Standard” gives the wrong impression. It's a bit like colorizing old black-and-white movies (or dubbing foreign-language movies instead of subtitling them): yes possibly even a majority of people may prefer it, but IMO it would be good to be more explicit what has been done.
The Bible is I daresay the most famous of these. Translations aside, even the English versions have had significant alterations done to wording, spelling, and meaning depending on the version.
There's also the Great Illustrated Classics imprint for certain classic novels like H.G. Wells's The Invisible Man. (I read that one like 10 times as a kid and it's what got me into sci-fi as a whole I'd argue. Haha.)
Whether these alternate versions are good or bad is obviously up for debate and depends on the person, but I'm just saying that what SE does is hardly new in the publishing world.
Others online have been writing about their own experience with the same stuff; it's not unique to PG at all, it's everywhere. Talk to anyone that runs a web server and they'll have these stories...
You also don't have to send out 1k support requests per hour. Could trial it with some hosting provider that you expect is responsive and see how it works out
edit: like, I just don't see another solution short of banning being anonymous online. Each site would have to know who you are. Someone has to be able to track it back to a person that is doing the abuse or there can't be any rules that we can apply. Imo it's better if that's the ISP (or VPN provider, say) who already has this information anyway