Gwtar: A static efficient single-file HTML format

TIL about window.stop() - the key to this entire thing working, it's causes the browser to stop loading any more assets: https://developer.mozilla.org/en-US/docs/Web/API/Window/stop

Apparently every important browser has supported it for well over a decade: https://caniuse.com/mdn-api_window_stop

Here's a screenshot illustrating how window.stop() is used - https://gist.github.com/simonw/7bf5912f3520a1a9ad294cd747b85... - everything after <!-- GWTAR END is tar compressed data.

I was on board until I saw that those can't easily be opened from a local file. Seems like local access is one of the main use case for archival formats.

I would like to know why ZIP/HTML polyglot format produced by SingleFile [1] and mentioned in the article "achieve static, single, but not efficiency". What's not efficient compared to the gwtar format?

[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG

The author dismisses WARC, but I don't see why. To me, Gwtar seems more complicated than a WARC, while being less flexible and while also being yet another new format thrown onto the pile.

I agree with the motivation and I really like the idea of a transparent format, but the first example link doesn’t work at all for me in Safari.

Very cool idea. I think single-file HTML web apps are the most durable form of computer software. A few examples of Single-File Web Apps that I wrote are: https://fuzzygraph.com and https://hypervault.github.io/.

Pretty cool. I made something similar (much more hacky) a while ago: https://github.com/AdrianVollmer/Zundler

Works locally, but it does need to decompress everything first thing.

It's fairly common for archivers (including archive.org) to inject some extra scripts/headers into archived pages or otherwise modify the content slightly (e.g. fixing up relative links). If this happens, will it mess up the offsets used for range requests?

Hmm, so this is essentially the appimage concept applied to web pages, namely:

- an executable header

- which then fuse mounts an embedded read-only heavily compressed filesystem

- whose contents are delivered when requested (the entire dwarf/squashfs isn't uncompressed at once)

- allowing you to pack as many of the dependencies as you wish to carry in your archive (so, just like an appimage, any dependency which isn't packed can be found "live"

- and doesn't require any additional, custom infrastructure to run/serve

Neat!

I really don't understand why a zip file isn't a good solution here. Just because is requires "special" zip software on the server?

I gave up a long time ago and started using the "Save as..." on browsers again. At the end of the day, I am interested in the actual content and not the look/feel of the page.

I find it easier to just mass delete assets I don't want from the "pageTitle_files/" directory (js, images, google-analytics.js, etc).

Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

WICG/webpackage: https://github.com/WICG/webpackage#packaging-tools

"Use Cases and Requirements for Web Packages" https://datatracker.ietf.org/doc/html/draft-yasskin-wpack-us...

The example link doesn't work for me at all in iOS safari?

https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...

I will try on Chrome tomorrow.

Hmm, I’m interested in this, especially since it applies no compression delta encoding might be feasible for daily scans of the data but for whatever reason my Brave mobile on iOS displays a blank page for the example page. Hmm, perhaps it’s a mobile rendering issue because Chrome and Safari on iOS can’t do it either https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...

Gwtar seems like a good solution to a problem nobody seemed to want to fix. However, this website is... something else. It's full of inflated self impprtantance, overly bountiful prose, and feels like someone never learned to put in the time to write a shorter essay. Even the about page contains a description of the about page.

I don't know if anyone else gets "unemployed megalomaniacal lunatic" vibes, but I sure do.

TIL about window.stop() - the key to this entire thing working, it's causes the browser to stop loading any more assets: https://developer.mozilla.org/en-US/docs/Web/API/Window/stop

Apparently every important browser has supported it for well over a decade: https://caniuse.com/mdn-api_window_stop

Here's a screenshot illustrating how window.stop() is used - https://gist.github.com/simonw/7bf5912f3520a1a9ad294cd747b85... - everything after <!-- GWTAR END is tar compressed data.

Neat! I didn't know about this either.

Php has a similar feature called __halt_compiler() which I've used for a similar purpose. Or sometimes just to put documentation at the end of a file without needing a comment block.

The author dismisses WARC, but I don't see why. To me, Gwtar seems more complicated than a WARC, while being less flexible and while also being yet another new format thrown onto the pile.

WARC is mentioned with very specific reason not being good enough: "WARCs/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display)."

I don't think you can provide a URL to a WARC that can be clicked to view its content directly in your browser.

Neat! I didn't know about this either.

Php has a similar feature called __halt_compiler() which I've used for a similar purpose. Or sometimes just to put documentation at the end of a file without needing a comment block.

I don't think you can provide a URL to a WARC that can be clicked to view its content directly in your browser.

At the very least, WARC could have been used as the container ("tar") format after the preamble of Gwtar. But even there, given that this format doesn't work without a web server (unlike SingleFile, mentioned in the article), I feel like there's a lot to gain by separating the "viewer" (Gwtar's javascript) from the content, such that the viewer can be updated over time without changing the archives.

I certainly could be missing something (I've thought about this problem for all of a few minutes here), but surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx" with little to no loss of convenience, and call it a day?

I agree with the motivation and I really like the idea of a transparent format, but the first example link doesn’t work at all for me in Safari.

[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG

'efficiency' is downloading only the assets needed to render the current view. How does it implement range requests and avoid downloading the entire SingleFileZ when a web browser requests the URL?

Pretty cool. I made something similar (much more hacky) a while ago: https://github.com/AdrianVollmer/Zundler

Works locally, but it does need to decompress everything first thing.

So this is like SingleFileZ in that it's a single static inefficient HTML archive, but it can easily be viewed locally as well?

How does it bypass the security restrictions which break SingleFileZ/Gwtar in local viewing mode? It's complex enough I'm not following where the trick is and you only mention single-origin with regard to a minor detail (forms).

I was on board until I saw that those can't easily be opened from a local file. Seems like local access is one of the main use case for archival formats.

Hmm, so this is essentially the appimage concept applied to web pages, namely:

- an executable header

- which then fuse mounts an embedded read-only heavily compressed filesystem

- whose contents are delivered when requested (the entire dwarf/squashfs isn't uncompressed at once)

- allowing you to pack as many of the dependencies as you wish to carry in your archive (so, just like an appimage, any dependency which isn't packed can be found "live"

- and doesn't require any additional, custom infrastructure to run/serve

Neat!

Agreed, I was thinking it's like asm.js where it can "backdoor pilot" [1] an interesting use case into the browser by making it already supported by default.

But not being able to "just" load the file into a browser locally seems to defeat a lot of the point.

[1] https://en.wikipedia.org/wiki/Television_pilot#Backdoor_pilo...

I really don't understand why a zip file isn't a good solution here. Just because is requires "special" zip software on the server?

> Just because is requires "special" zip software on the server?

Yes. A web browser can't just read a .zip file as a web page. (Even if a web browser decided to try to download, and decompress, and open a GUI file browser, you still just get a list of files to click.) Therefore, far from satisfying the trilemma, it just doesn't work.

And if you fix that, you still generally have a choice between either no longer being single-file or efficiency. (You can just serve a split-up HTML from a single ZIP file with some server-side software, which gets you efficiency, but now it's no longer single-file; and vice-versa. Because if it's a ZIP, how does it stop downloading and only download the parts you need?)

Zip stores its central directory at the end of the file. To find what's inside and where each entry starts, you need to read the tail first. That rules out issuing a single Range request to grab one specific asset.

Tar is sequential. Each entry header sits right before its data. If the JSON manifest in the Gwtar preamble says an asset lives at byte offset N with size M, the browser fires one Range request and gets exactly those bytes.

The other problem is decompression. Zip entries are individually deflate-compressed, so you'd need a JS inflate library in the self-extracting header. Tar entries are raw bytes, so the header script just slices at known offsets. No decompression code keeps the preamble small.

I gave up a long time ago and started using the "Save as..." on browsers again. At the end of the day, I am interested in the actual content and not the look/feel of the page.

I find it easier to just mass delete assets I don't want from the "pageTitle_files/" directory (js, images, google-analytics.js, etc).

I don't know if anyone else gets "unemployed megalomaniacal lunatic" vibes, but I sure do.

Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

WICG/webpackage: https://github.com/WICG/webpackage#packaging-tools

"Use Cases and Requirements for Web Packages" https://datatracker.ietf.org/doc/html/draft-yasskin-wpack-us...

The example link doesn't work for me at all in iOS safari?

https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...

I will try on Chrome tomorrow.

Have you https://addons.mozilla.org/firefox/addon/single-file/?

If you really just want the text content you could just save markdown using something like https://addons.mozilla.org/firefox/addon/llmfeeder/.

I find that 'save as' horribly breaks a lot of web pages. There's no choice these days but to load pages with JS and serialize out the final quiescent DOM. I also spend a lot of time with uBlock Origin and AlwaysKillSticky and NoScript wrangling my archive snapshots into readability.

Save as doesn't work on sites that lazy load.

The range requests are to offsets in the original file, so I would think that most cases of 'live' injection do not necessarily break it. If you download the page and the server injects a bunch of JS into the 'header' on the fly and the header is now 10,000 bytes longer, then it doesn't matter, since all of the ranges and offsets in the original file remain valid: the first JPG is still located starting at offset byte #123,456 in $URL, the second one is located starting at byte #456,789 etc, no matter how much spam got injected into it.

Beyond that, depending on how badly the server is tampering with stuff, of course it could break the Gwtar, but then, that is true of any web page whatsoever (never mind archiving), and why they should be very careful when doing so, and generally shouldn't.

Now you might wonder about 're-archiving': if the IA serves a Gwtar (perhaps archived from Gwern.net), and it injects its header with the metadata and timeline snapshot etc, is this IA Gwtar now broken? If you use a SingleFile-like approach to load it, properly force all references to be static and loaded, and serialize out the final quiescent DOM, then it should not be broken and it should look like you simply archived a normal IA-archived web page. (And then you might turn it back into a Gwtar, just now with a bunch of little additional IA-related snippets.) Also, note that the IA, specifically, does provide endpoints which do not include the wrapper, like APIs or, IIRC, the 'if_/' fragment. (Besides getting a clean copy to mirror, it's useful if you'd like to pop up an IA snapshot in an iframe without the header taking up a lot of space.)

gwern is a legendary blogger (although blogger feels underselling it… “publisher”?) and has earned the right to self-aggrandize about solving a problem he has a vested interest in. Maybe he’s a megalomaniac and/or unemployed and/or writing too many words but after contributing so much, he has earned it.

Wow, thats one hell of a reaction to someone's blog post introducing their new project.

Its almost as if someone charged you $$ for the privilege of reading it, and you now feel scammed, or something?

Perhaps you can request a refund. Would that help?

What's up with the non-stop knee-jerk bullshit ad hom on HN lately?

> Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

As far as I know, we do not have any hash verification beyond that built into TCP/IP or HTTPS etc. I included SHA hashes just to be safe and forward compatible, but they are not checked.

There's something of a question here of what hashes are buying you here and what the threat model is. In terms of archiving, we're often dealing with half-broken web pages (any of whose contents may themselves be broken) which may have gone through a chain of a dozen owners, where we have no possible web of trust to the original creator, assuming there is even one in any meaningful sense, and where our major failure modes tend to be total file loss or partial corruption somewhere during storage. A random JPG flipping a bit during the HTTPS range request download from the most recent server is in many ways the least of our problems in terms of availability and integrity.

This is why I spent a lot more time thinking about how to build FEC in, like with appending PAR2. I'm vastly more concerned about files being corrupted during storage or the chain of transmission or damaged by a server rewriting stuff, and how to recover from that instead of simply saying 'at least one bit changed somewhere along the way; good luck!'. If your connection is flaky and a JPEG doesn't look right, refresh the page. If the only Gwtar of a page that disappeared 20 years ago is missing half a file because a disk sector went bad in a hobbyist's PC 3 mirrors ago, you're SOL without FEC. (And even if you can find another good mirror... Where's your hash for that?)

> Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

No idea. It sounds like you know more about them than I do. What threat do they protect against, exactly?

It also doesn't work on desktop Safari 26.2 (or perhaps it does, but not to the extent intended -- it appears to be trying to download the entire response before any kind of content painting.)

'efficiency' is downloading only the assets needed to render the current view. How does it implement range requests and avoid downloading the entire SingleFileZ when a web browser requests the URL?

I haven't looked closely, but I get the impression that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call and rely on range requests (zip.js supports them) to unzip and display the page. This could also be transparent for the user, depending on whether the file is served via HTTP or not. However, I admit that I haven't implemented this mechanism yet.

So this is like SingleFileZ in that it's a single static inefficient HTML archive, but it can easily be viewed locally as well?

The content is in an iframe, my code is outside of it, and the two frames are passing messages back and forth. Also I'm monkey patching `fetch` and a few other things.

Agreed, I was thinking it's like asm.js where it can "backdoor pilot" [1] an interesting use case into the browser by making it already supported by default.

But not being able to "just" load the file into a browser locally seems to defeat a lot of the point.

[1] https://en.wikipedia.org/wiki/Television_pilot#Backdoor_pilo...

> Just because is requires "special" zip software on the server?

We're talking about servers here - the article specifically said that one of the requirements was no special _server_ software, and a web server almost certainly has zip (or tar) installed. These gwtar files don't work without a server apparently either.

You can also read a zip sequentially like a tar file. Some info is in the directory only but just for getting file data you can read the file records sequentially. There are caveats about when files appear multiple times but those caveats also apply to processing tar streams.

Have you https://addons.mozilla.org/firefox/addon/single-file/?

If you really just want the text content you could just save markdown using something like https://addons.mozilla.org/firefox/addon/llmfeeder/.

On the subject of SingleFile there is also WebScrapBook: https://github.com/danny0838/webscrapbook

I prefer it because it can save without packing the assets into one HTML file. Then it's easy to delete or hardlink common assets.

Save as doesn't work on sites that lazy load.

> that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call...However, I admit that I haven't implemented this mechanism yet.

Well, yes. That's why we created Gwtar and I didn't just use SingleFileZ. We would have preferred to not go to all this trouble and use someone else's maintained tool, but if it's not implemented, then I can't use it.

(Also, if it had been obvious to you how to do this window.stop+range-request trick beforehand, and you just hadn't gotten around to implementing it, it would have been nice if you had written it up somewhere more prominent; I was unable to find any prior art or discussion.)

I was more willing to accept gwern’s eccentricities in the past but as we learn more about MIRI and its questionable funding resources, one wonders how much he’s tied up in it.

The Lighthaven retreat in particular was exceptionally shady, possibly even scam-adjacent; I was shocked that he participated in it.

The content is in an iframe, my code is outside of it, and the two frames are passing messages back and forth. Also I'm monkey patching `fetch` and a few other things.

> Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

As far as I know, we do not have any hash verification beyond that built into TCP/IP or HTTPS etc. I included SHA hashes just to be safe and forward compatible, but they are not checked.

> Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

No idea. It sounds like you know more about them than I do. What threat do they protect against, exactly?

Wow, thats one hell of a reaction to someone's blog post introducing their new project.

Its almost as if someone charged you $$ for the privilege of reading it, and you now feel scammed, or something?

Perhaps you can request a refund. Would that help?

It also doesn't work on desktop Safari 26.2 (or perhaps it does, but not to the extent intended -- it appears to be trying to download the entire response before any kind of content painting.)

OK, but how does that get you 'efficiency' if you're doing this weird thing where you serialize the entire page into some JSON blob and pass it in to an iframe or whatever? That would seem to destroy the 'efficiency' property of the trilemma. How do you get the full set of single-file, static, and efficient, while still working locally?

What's up with the non-stop knee-jerk bullshit ad hom on HN lately?

The earth is falling out from under a lot of people, and they're trying to justify their position on the trash heap as the water level continues to rise around it. It's a scary time.

Technically it’s only an ad hominem when you’re using the insult as a component in a fallacious argument; the parent comment is merely stating an aesthetic opinion with more force than is typically acceptable here.

On the subject of SingleFile there is also WebScrapBook: https://github.com/danny0838/webscrapbook

I prefer it because it can save without packing the assets into one HTML file. Then it's easy to delete or hardlink common assets.

The reason I did not implement the innovative mechanism you describe is because, in my case, all the technical effort was/is focused on reading the archive from the filesystem. No one has suggested it either.

Edit: Actually, SingleFile already calls window.stop() when displaying a zip/html file from HTTP, see https://github.com/gildas-lormeau/single-file-core/blob/22fc...

I was more willing to accept gwern’s eccentricities in the past but as we learn more about MIRI and its questionable funding resources, one wonders how much he’s tied up in it.

The Lighthaven retreat in particular was exceptionally shady, possibly even scam-adjacent; I was shocked that he participated in it.

What does any of that have to do with the value of what’s presented in the article?

The earth is falling out from under a lot of people, and they're trying to justify their position on the trash heap as the water level continues to rise around it. It's a scary time.

What does any of that have to do with the value of what’s presented in the article?

I read your BRILLIANT synopsis in the tone of Sir Humphrey (the civil servant) from "Yes Minister". Fits perfectly. Take a bow, good sir ...

Edit: Actually, SingleFile already calls window.stop() when displaying a zip/html file from HTTP, see https://github.com/gildas-lormeau/single-file-core/blob/22fc...

I read your BRILLIANT synopsis in the tone of Sir Humphrey (the civil servant) from "Yes Minister". Fits perfectly. Take a bow, good sir ...

The call to window.stop() stops HTML parsing/rendering, which is unnecessary since the script has downloaded the page via HTTP and will decompress it as-is as a binary file (zip.js supports concatenated payloads before and after the zip data). However, in my case, the call to window.stop() is executed asynchronously once the binary has been downloaded, and therefore may be too late. This is probably less effective than in your case with gtwar.

I implemented this in the simplest way possible because if the zip file is read from the filesystem, window.stop() must not be called immediately because the file must be parsed entirely. In my case, it would require slightly more complex logic to call window.stop() as early as possible.

Edit: maybe it's totally useless though, as documented here [1]: "Because of how scripts are executed, this method cannot interrupt its parent document's loading, but it will stop its images, new windows, and other still-loading objects." (you mentioned it in the article)

[1] https://developer.mozilla.org/en-US/docs/Web/API/Window/stop

Gwtar is a new polyglot HTML archival format which provides a single, self-contained, HTML file which still can be efficiently lazy-loaded by a web browser. This is done by a header’s JavaScript making HTTP range requests. It is used on Gwern.net to serve large HTML archives.

Archiving HTML files faces a trilemma: it is easy to create an archival format which is any two of static (self-contained ie. all assets included, no special software or server support), a single file (when stored on disk), and efficient (lazy-loads assets only as necessary to display to a user), but no known format allows all 3 simultaneously.

We introduce a new format, Gwtar (logo; pronounced “guitar”, .gw⁠tar.html extension), which achieves all 3 properties simultaneously. A Gwtar is a classic fully-inlined HTML file, which is then processed into a self-extracting concatenated file of an HTML + JavaScript header followed by a tarball of the original HTML and assets. The HTML header’s JS stops web browsers from loading the rest of the file, loads just the original HTML, and then hooks requests and turns them into range requests into the tarball part of the file.

Thus, a regular web browser loads what seems to be a normal HTML file, and all assets download only when they need to. In this way, a static HTML page can inline anything—such as gigabyte-size media files—but those will not be downloaded until necessary, even while the server sees just a single large HTML file it serves as normal. And because it is self-contained in this way, it is forwards-compatible: no future user or host of a Gwtar file needs to treat it specially, as all functionality required is old standardized web browser/server functionality.

Gwtar allows us to easily and reliably archive even the largest HTML pages, while still being user-friendly to read.

Example pages: “The Secret of Psalm 46” (vs original SingleFile archive—warning: 286MB download).

Background

Linkrot is one of the biggest challenges for long-term websites. Gwern.net makes heavy use of web page archiving to solve this; and due to quality problems and long-term reliability concerns, simply linking to the Internet Archive is not enough, so I try to create & host my own web page archives of everything I link.

There are 3 major properties we would like of an HTML archive format, beyond the basics of actually capturing a page in the first place: it should not depend in any way on the original web page, because then it is not an archive and will inevitably break; it should be easy to manage and store, so you can scalably create them and store them for the long run; and it should be efficient, which for HTML largely means that readers should be able to download only the parts they need in order to view the current page.

HTML Trilemma

No current format achieves all 3. The built-in web browser save-as-HTML format achieves single and efficient, but not static; save-as-HTML-with-directory achieves static partially and efficient, but not single; MHTML, MAFF, SingleFile, & SingleFileZ (a ZIP-compressed variant) achieve static, single, but not efficiency; WARCs/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display).

An ordinary ‘save as page HTML’ browser command doesn’t work because “Web Page, HTML Only” leaves out most of a web page; even “Web Page, Complete” is inadequate because a lot of assets are dynamic and only appear when you interact with the page—especially images. If you want a static HTML archive, one which has no dependency on the original web page or domain, you have to use a tool specifically designed for this. I usually use SingleFile. SingleFile produces a static snapshot of the live web page, while making sure that lazy-loaded images are first loaded, so they are included in the snapshot.

SingleFile often produces a useful static snapshot. It also achieves another nice property: the snapshot is a single file, just a simple single .html file, which makes life so much easier in terms of organizing and hosting. Want to mirror a web page? SingleFile it, and upload the resulting single file to a convenient directory somewhere, boom—done forever. Being a single file is important on Gwern.net, where I must host so many files, and I run so many lints and checks and automated tools and track metadata etc. and where other people may rehost my archives.

However, a user of SingleFile quickly runs into a nasty drawback: snapshots can be surprisingly large. In fact, some snapshots on Gwern.net are over half a gigabyte! For example, the homepage for the research project “PaintsUndo: A Base Model of Drawing Behaviors in Digital Paintings” is 485MB after size optimization, while the raw HTML is 0.6MB. It is common for an ordinary somewhat-fancy Web 2.0 blog post like a Medium.com post to be >20MB once fully archived. This is because such web pages wind up importing a lot of fonts, JS, widgets and icons etc., all of which assets must be saved to ensure it is fully static; and then there is additional wasted space overhead due to converting assets from their original binary encoding into Base64 text which can be interleaved with the original HTML.

This is especially bad because, unlike the original web page, anyone viewing a snapshot must download the entire thing. That 500MB web page is possibly OK because a reader only downloads the images that they are looking at; but the archived version must download everything. A web browser has to download the entire page, after all, to display it properly; and there is no lazy-loading or ability to optionally load ‘other’ files—there are no other files ‘elsewhere’, that was the whole point of using SingleFile!

Hence, a SingleFile archive is static, and a single file, but it is not efficient: viewing it requires downloading unnecessary assets.

So, for some archives, we ‘split’ or ‘deconstruct’ the static snapshot back into a normal HTML file and a directory of asset files, using deconstruct_singlefile.php (which incidentally makes it easy to re-compress all the images, which produces large savings as many websites are surprisingly bad at basic stuff like PNG/JPG/GIF compression); then we are back to a static, efficient, but not single file, archive.

This is fine for our auto-generated local archives because they are stored in their own directory tree which is off-limits to most Gwern.net infrastructure (and off-limits to search engines & agents or off-site hotlinking), and it doesn’t matter too much if they litter tens of thousands of directories and files. It is not fine for HTML archives I would like to host as first-class citizens, and expose to Google, and hope people will rehost someday when Gwern.net inevitably dies.

So, we could either host a regular SingleFile archive, which is static, single, and inefficient; or a deconstructed archive, which is static, multiple, and efficient, but not all 3 properties.

This issue came to a head in January 2026 when I was archiving the Internet Archive snapshots of Brian Moriarty’s famous lectures “Who Buried Paul?” and “The Secret of Psalm 46”, since I noticed while writing an essay drawing on them that his whole website had sadly gone down. I admire them and wanted to host them properly so people could easily find my fast reliable mirrors (unlike the slow, hard-to-find, unreliable IA versions), but realized I was running into our long-standing dilemma: they would be efficient in the local archive system after being split, but unfindable; or if findable, inefficiently large and reader-unfriendly. Specifically, the video of “Who Buried Paul?” was not a problem because it had been linked as a separate file, so I simply converted it to MP4 and edited the link; but “The Secret of Psalm 46” turned out to inline the OGG/MP3 recordings of the lecture and abruptly increased from <1MB to 286MB.

I discussed it with Said Achmiz, and he began developing a fix.

Trisecting

To achieve all 3, we need some way to download only part of a file, and selectively download the rest. This lets us have a single static archive of potentially arbitrarily large size, which can safely store every asset which might be required.

HTTP already easily supports selective downloading via the ancient HTTP Range query feature, which allows one to query for a precise range of bytes inside a URL. This is mostly used to do things like resume downloads, but you can also do interesting things like run databases in reverse: a web browser client can run a database application locally which reads a database file stored on a server, because Range queries let the client download only the exact parts of the database file it needs at any given moment, as opposed to the entire thing (which might be terabytes in size).

This is how formats like WARC can render efficiently: host a WARC as a normal file, and then simply range-query the parts displayed at any moment.

The challenge is the first part: how do we download only the original HTML and subsequently only the displayed assets? If we have a single HTML file and then a separate giant archive file, we could easily just rewrite the HTML using JS to point to the equivalent ranges in the archive file (or do something server-side), but that would achieve only static and efficiency, not single file. If we combine them, like SingleFile, we are back to static and single file, but not efficiency.

The simplest solution here would be to decide to complicate the server itself and do the equivalent of deconstruct_singlefile.php on the fly. HTML requests, perhaps detecting some magic string in the URL like .singlefile.html, is handed to a CGI proxy process, which splits the original single HTML file into a normal HTML file with lazy-loaded references. The client browser sees a normal multiple efficient HTML, while everything on server sees a static single inefficient HTML. (A possible example is WWZ.)

While this solves the immediate Gwern.net problem, it does so at the permanent cost of server complexity, and does not do much to help anyone else. (It is unrealistic to expect more than a handful of people to modify their servers this invasively.) I also considered taking the WARC red pill and going full WebRecorder, but quailed.

Download Stopping Mechanisms

How can we trick an HTML file into acting like a tarball or ZIP file, with partial random access?

Our initial approach was to ship an HTML + JS header with an appended archive, where the JS would do HTTP Range queries into the appended binary archive; the challenge, however, was to stop the file from downloading past the header. To do this, we considered some approaches ‘outside’ the page, like encoding the archive index into the filename/URL itself (ie. foo.gwtar-$N.html) and requiring the server to parse $N out and slice the archive down to just the header, which then handled the range requests; this minimized how much special handling the server did, while being backwards/forwards-compatible with non-compliant servers (who would ignore the index and simply return the entire file, and be no worse than before). This worked in our prototypes, but required at least some server-side support and also required that the header be fixed-length (because any changes would in length would invalidate the index).

Eventually, Achmiz realized that you can stop downloading from within an HTML page, using the JS command window.stop()! MDN (>96% support, spec):

The window.stop() stops further resource loading in the current browsing context, equivalent to the stop button in the browser.

Because of how scripts are executed, this method cannot interrupt its parent document’s loading, but it will stop its images, new windows, and other still-loading objects.

This is precisely what we need, and the design falls into place.

Concatenated Archive Design

A Gwtar is an HTML file with a HTML + JS + JSON header followed by a tarball and possibly further assets. (A Gwtar could be seen as almost a polyglot file is a file valid as more than one format—in this case, a .html file that is also a .tar archive, and possibly .par2. But strictly speaking, it is not.)

Creation

We provide a reference PHP script, deconstruct_singlefile.php, which creates Gwtars from SingleFile HTML snapshots.

It additionally tries to recompress JPG/PNG/GIFs before storing in the Gwtar, and then appends PAR2 FEC.

Example command to replace the original 2010-02-brianmoriarty-thesecretofpsalm46.html by 2010-02-brianmoriarty-thesecretofpsalm46.gwtar.html with PAR2 FEC:

php ./static/build/deconstruct_singlefile.php --create-gwtar --add-fec-data \
    2010-02-brianmoriarty-thesecretofpsalm46.html

Implementation

Details

The simple approach is to download the binary assets, encode them into Base64 text, and inject them into the HTML DOM. This is inefficient in both compute and RAM because the web browser must immediately reverse this to get a binary to work with. So we actually use the browser optimization of blobs to just pass the binary asset straight to the browser.

A tricky bit is that inline JS can depend on “previously loaded” JS files, which may not have actually loaded yet because the first attempt failed (of course) and the real Range request is still racing. We currently solve this by just downloading all JS before rendering the HTML, at some cost to responsiveness.

So, a web browser will load a normal web page; the JS will halt its loading; a new page loads, and all of its requests initially fail but get repeated immediately and work the second time; the entire archive never gets downloaded unless required. All assets are provided, there is a single Gwtar file, it is efficient; it doesn’t require JS for archival integrity, as just the entire archive downloads if the JS is not executed; and it is cross-platform and standards-compliant, requires no server-side support or future users/hosts to do anything whatsoever, and is a transparent, self-documenting file format which can be easily converted back to a ‘normal’ multiple-file HTML (cat foo.gwtar.html | perl -ne'print $_ if $x; $x=1 if /<!-- GWTAR END/' | tar xf -) or a user can just re-archive it normally with tools like SingleFile.

Fallback

In the event of JS problems, a <noscript> message explains what the Gwtar format is and why it requires JS, and links to this page for more details.

It also detects whether range requests are supported or downgraded to requesting the entire file. If the latter, it will start rendering it.

This is not as slow as it seems because we can benefit from connection level compression like gzip or Brotli compression. And because our preprocessing linearize the assets in dependency order, we receive the bytes in order of page appearance, and so in this mode, the “above the fold” images and stuff will still load first and quickly. (This in comparison to the usual SingleFile, where you have to receive every single asset before you’re done, and which may be slower.)

Compression

Gwtar does not directly support deduplication or compression.

Gwtars may overlap and have redundant copies of assets, but because they will be stored bit-identical inside the tarballs, a de-duplicating filesystem can transparently remove most of that redundancy.

Media assets like MP3 or JPEG are already compressed, and can be compressed during the build phase by a gwtar implementation.

The HTML text itself could be compressed; it is currently unclear to me how Gwtar’s range requests interact with transparent negotiated compression like Brotli compression (which for Gwern.net was as easy as enabling one option in Cloudflare). RFC 7233 doesn’t seem to give a clear answer about this, and the cursory and unhelpful discussion here seems to indicate that the range requests would have to be interpreted relative to the compressed version rather than the original, which is useful for the core use-case of resuming downloads but not for our use-case. So I suspect that probably Cloudflare would either disable Brotli, or downgrade to sending the entire file instead. It is possible that “transfer-encoding” solves this, but as of 2018, Cloudflare didn’t support it, making it useless for us and suggesting little support in the wild.

If this is a serious problem, it may be possible to compress the HTML during the Gwtar generation phase and adjust the JS.

Limitations

Local Viewing

Strangely, the biggest drawback of Gwtar turns out to be local viewing of HTML archives. SingleFileZ encounters the same issue: in the name of security (origin/CORS/sandboxing), browsers will not execute certain requests in local HTML pages, so it will break, as it is no longer able to request from itself.

We regard this as unfortunate, but an acceptable tradeoff, as for local browsing, the file can be easily converted back to the non-JS dependent multiple/single-file HTML formats.

Range Request Support

Range requests are old, standardized, and important for resuming downloads or viewing large media files like video, and every web server should, in theory, support it by default. In practice, there may be glitches, and one should check.

An example curl command which should return a HTTP 206 (not 200) request if range requests are correctly working:

curl --head --header "Range: bytes=0-99" 'https://gwern.net/doc/philosophy/religion/1999-03-17-brianmoriarty-whoburiedpaul.gwtar.html'
# HTTP/2 206
# date: Sun, 25 Jan 2026 22:20:57 GMT
# content-type: x-gwtar
# content-length: 100
# server: cloudflare
# last-modified: Sun, 25 Jan 2026 07:08:33 GMT
# etag: "6975c171-7aeb5c"
# age: 733
# cache-control: max-age=77760000, public, immutable
# content-disposition: inline
# content-range: bytes 0-99/8055644
# cf-cache-status: HIT
# ...

Servers should serve Gwtar files as text/html if possible. This may require some configuration (eg. in nginx 2), but should be straightforward.

Cloudflare Is Broken

However, Cloudflare has an undocumented, hardwired behavior: its proxy (not cache) will strip Range request headers for text/html responses regardless of cache settings. This does not break Gwtar rendering, of course, but it does break efficiency and defeats the point of Gwtar for Gwern.net

As a workaround, we serve Gwtars with the MIME type x-gwtar—web browsers like Firefox & Chromium will content-sniff the opening <html> tag and render correctly, while Cloudflare passes Range requests through for unrecognized types. (This is not ideal, but a more conventional MIME type like application/... results in web browsers downloading the file without trying to render it at all; and using a MIME type trick is better than alternatives like trying to serve Gwtars as MP4s, using a special-case subdomain just to bypass Cloudflare completely, using complex tools like Service Workers to try to undo the removal, etc.)

Accessing Binary Assets

Because a Gwtar can store large binary assets without burdening the viewer and is an archive format, it may be useful for reproducible science/statistics: include datasets, such as Sqlite3 databases, and do computation on them like visualization or analysis. The question is, how do we ensure that assets get referenced in a way that SingleFile can “see” them and include them inline (to be stored in the final Gwtar as split-out objects), and then addressed and loaded by simple user JS, in a way which still works without Gwtar?

A potential approach in Gwtar v1 would be to reference all such assets using the <object> tag 3, and then the user JS adds a simple listener hook to the load event, which will fire either when the browser loads the asset normally (multi-file) or when Gwtar completes its range-fetch rewrite, and then kicks off the actual userland work. This does not require any unusual or contorted user JS, appears to be backwards/forwards compatible, and to satisfy all our desiderata.

Untested pseudo-code:

<object id="dataset" data="dataset.sqlite3" type="application/x-sqlite3" width="0" height="0"></object>

<script>
document.getElementById('dataset').addEventListener('load', function () {
    fetch(this.data)
        .then(function (r) { return r.arrayBuffer(); })
        .then(function (buf) {
            // `buf` is the raw .sqlite3 bytes.
            // Hand off to whatever SQL-in-JS library you're using.
        });
});
</script>

Optional Trailing Data

The appended tarball can itself be followed by additional arbitrary binary assets, which can be large since they will usually not be downloaded. (While the exact format of each appended file is up to the users, it’s a good idea to wrap them in tarballs if you can.)

This flexibility is intended primarily for allowing ad hoc metadata extensions like cryptographic signatures or forward error correction (FEC).

FEC

The Gwern.net generation script uses this feature to add par2 FEC in an additional tarball.4 This allows recovery of the original Gwtar if it has been partially corrupted or lost. (It cannot recover loss of the file as a whole, which is why FEC is ideally done over large corpuses, and not individual files, but this is better than nothing, and gives us free integrity checking as well.)

PAR2 can find its FEC data even in corrupted files by scanning for FEC data (“packets”) it recognizes, while tar ignores appended data; so adding, say, 25% par2 FEC is as simple as running par2create -r25 -n1 foo.gwtar.html && tar cf. - foo.gwtar.html.par2 foo.gwtar.html.vol*.par2 >> foo.gwtar.html && rm foo.gwtar.html*.par2, and repairing a corrupted file is as simple as ln --symbolic broken.gwtar.html broken.gwtar.html.par2 && par2repair broken.gwtar.html.par2 broken.gwtar.html.5

This yields the original foo.gwtar.html without any FEC. A repaired Gwtar file can then have fresh FEC added to be just like the old Gwtar + FEC archive, or be integrated in some broader system which achieves long-term protection some other way.

Signing

A simple form of cryptographic signing would be to use GPG to sign it as a normal, separate, signature file (creates foo.gwtar.html.sig): gpg --detach-sign --armor foo.gwtar.html.

And we could also append an ASCII ‘armored’ GPG signature, as it won’t confuse tar, like gpg --detach-sign --armor foo.gwtar.html >> foo.gwtar.html. Since GPG won’t munge a file like PAR2 will, an adhoc format would be to wrap it in tar to assist extracting:

gpg --detach-sign --armor foo.gwtar.html
tar cf. - foo.gwtar.html.sig >> foo.gwtar.html
rm foo.gwtar.html.sig

or in magic text, like a HTML comment:

# sign and append
FILE="foo.gwtar.html"
gpg --detach-sign --armor -o "$FILE".asc "$FILE"
echo '<!-- GWTAR-GPG-SIG' >> "$FILE"
cat "$FILE".asc >> "$FILE"
echo '-->' >> "$FILE"
rm "$FILE".asc

# Extract and verify:
SIG=$(mktemp XXXXXX.asc)
CONTENT=$(mktemp)
sed --quiet '/<!-- GWTAR-GPG-SIG/,/-->$/p' "$FILE" |
    grep -Ev 'GWTAR-GPG-SIG|-->' > "$SIG"
sed '/<!-- GWTAR-GPG-SIG/,$d' "$FILE" > "$CONTENT"
gpg --verify "$SIG" "$CONTENT"
rm "$SIG" "$CONTENT"

Validation tool
Checking of hashsums when rendering (possibly async or deferred)
More aggressive prefetching of assets
Integration into SingleFile (possibly as a “SingleFileZ2” forma?)
Testing: corpus of edge-case test files (inline SVG, srcset, CSS @import chains, web fonts, data URIs in CSS…)

A Gwtar v2 could add breaking changes like:

format provides more rigorous validation/checking of HTML & assets; require HTML & asset validity, assets all decode successfully, etc.
standardize appending formats
require FEC
built-in compression with Brotli/gzip for formats not already compressed
multi-page support

One would try to replace MAFF’s capability of creating sets of documents which are convenient to link/archive and can automatically share assets for de-duplication (eg. page selected by a built-in widget, or perhaps by a hash-anchor like archive.gwtar.html#page=foo.html? Can an initial web page open new tabs of all the other web pages in the archive?)
Better de-duplication, eg. content-addressed asset names (hash-based) enabling deduplication across multiple gwtars

Bibliography

[Bibliography of links/references used in page]

Hacker Times