Monolith – CLI tool for saving complete web pages as a single HTML file

Monolith – CLI tool for saving complete web pages as a single HTML file(github.com)

772 points by iscream26 2 years ago | 151 comments

simonw 2 years ago |

Well this is fun... from the README here I learned I can do this on macOS:

    /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
    --headless --incognito --dump-dom https://github.com > /tmp/github.html

And get an HTML file for a page after the JavaScript has been executed.

Wrote up a TIL about this with more details: https://til.simonwillison.net/chrome/headless

My own https://shot-scraper.datasette.io/ tool (which uses headless Playwright Chromium under the hood) has a command for this too:

    shot-scraper html https://github.com/ > /tmp/github.html

But it's neat that you can do it with just Google Chrome installed and nothing else.

mkl 2 years ago | |

Can shot-scraper load a bunch of content on an "infinite scroll" page before saving? I'm guessing Monolith can't as it has no JS. The most effective way I've found to work through the history of a big YouTube channel is to hold page-down for a while then save to a static "Web Page, Complete" HTML file, but it's a bit clunky.

simonw 2 years ago | | |

shot-scraper has a feature that's meant to help with this: you can inject additional al JavaScript into the page that can then act before the screenshot is taken.

I use that for things like accepting cookie banners, but using it to scroll down to trigger additional loading should work too.

There's also a --wait-for option which takes a JavaScript expression and polls until it's true before taking the shot - useful for if there's custom loading behavior you need to wait for.

Documentation here: https://shot-scraper.datasette.io/en/stable/screenshots.html

seanwilson 2 years ago | | |

I found other problems in the area when trying to do this e.g. a lot of landing pages have hidden content that only animates in when you scroll down, subscribe/cookie overlays/modals covering content, hero headers that takes the height of the viewport ("height: 100vh") so if you make the page height large for taking a screenshot the header will cover all of it, and also sticky headers get in the way if you want to try scrolling while take multi-screenshots that are combined at the end.

You can come up with workarounds for each, but it's still hacky and there's always going to be other pages that need special treatment.

simpaticoder 2 years ago | |

Yes, it's a neat thing. I use a node script[1] that wraps the chrome invocation to do CLI driven acceptance testing with a node script that loads the site acceptance tests[2]. I adopted the simple convention to remove all body elements on successful completion, and checking the output string, but I've also considered other methods like embedding a JSON string and parsing it back out.

1 - https://simpatico.io/acceptance.js

2 - https://simpatico.io/acceptance

bonestamp2 2 years ago | |

I guess while we're talking about useful CLI options for chrome, developers and hackers might enjoy this one... You can disable CORS in Chrome if you launch it from the command line with this switch: --disable-web-security

That's handy for when you're developing a front end and IT/devops hasn't approved/enabled the the CORS settings on the backend yet, or if you're just hacking around and want to get data from somewhere that doesn't allow cross domain requests.

wodenokoto 2 years ago | |

Can Firefox do the same?

jaimex2 2 years ago | |

Does shot-scraper have a work around for sites that detect headless chrome? ie. news.com.au , nowsecure.nl

simonw 2 years ago | | |

No, nothing like that. I wonder how that detection works?

I tried this and it took a shot of a "bot detected" screen:

    shot-scraper https://news.com.au/

But when I used interactive mode I could take the screenshot - run this:

    shot-scraper -i https://news.com.au/

It opens a Chrome window. Then hit "enter" in the CLI tool to take the screenshot.

msmagh 2 years ago | |

Thanks for sharing, similar command is available in Windows as well.

I tested below command in PowerShell

& 'C:\Program Files\Google\Chrome\Application\chrome.exe' ` --headless ` --print-to-pdf="$env:USERPROFILE\Downloads\page.pdf" ` '<url>'

aidenn0 2 years ago | |

I wonder if there's an option to wait for a certain amount of time, or a particular event or something. I was trying to capture a page a few different ways, and most of them ended up with the Cloudflare "checking your browser" page.

dotancohen 2 years ago | |

Thank you for shot-scraper! I've tested it in the past, but something severely missing from all screenshot tools, shot-scraper included, is a way to avoid screenshoting popups. For instance, newsletter or login popups, GDPR popups, etc. If shot-scraper has a reliable way of screenshoting websites while avoiding these popups, I would love to know.

I'm on mobile so don't have access to my notes, but I'm pretty sure that a year ago when I tried there was no reliable way to screenshot e.g. the BBC news website without getting the popups.

Again, thank you.

simonw 2 years ago | | |

Try this:

    shot-scraper -h 800 'https://www.spiegel.de/international/' \
      --wait-for "() => {
        const div = document.querySelector('[id^="sp_message_container"]');
        if (div) {
          div.remove();
          return true;
        }
      }"

shot-scraper runs that --wait-for script until it returns true. In this case we're waiting for the cookie consent overlay div to show up and then removing it before we take the screenshot.

Screenshots here: https://gist.github.com/simonw/de75355c39025f9a64548aa3366b1...

cjr 2 years ago | | |

I work on a paid screenshot api[0] where we have features to either hide these banners and overlays using css, or alternatively we run some javascript to send a click event to what we detect as the 'accept' button in order to dismiss the popups.

It's quite a painful problem and we screenshot many millions of sites a day, our success rate at detecting these is high but still not 100%.

We have gotten quite far with heuristics and are exploring whether we can get better results by training a model.

[0]:https://urlbox.com

grey8 2 years ago | | |

Just a thought, but what happens if you use orchestrate a browser instance with an installed ad blocker like uBlock Origin?

DANmode 2 years ago | | |

Screenshot the archive.org render?

Exuma 2 years ago | |

Mmmm…. This is clever

samstave 2 years ago | |

Yay! I love Shot Scrapeer - I wish you had made it a decade ago!

Thanks for shot scraper.

Off the top of you head what would be the easiest command to have shotscraper barf a directory of shot-scraper HTMLs each day from my daily browsing history.

This would be interesting if I have a browsing session for learning something and I am researching across a bunch of sites - roll it all up into a Digi-ography of the sites used in learning that topic?

---

I've always been baffled that this isnt an inate functionality in any app/OS - its a damn computeer - I should have a great ability to recall what it displays and what you have been doing.

Heck - we need our machines to write us a daily status report for what we did at the end of each day.

Surely that would change productivity. If you were force to do a self-digital-confession and stare you ADHD and procrastination right in the face.

simonw 2 years ago | | |

Yeah, things like Archive Box are probably a better bet there. But... you could write a script that queries the SQLite database of your history, figures out the pages you visited then loops through and runs `shot-scraper html ... > ...html` against each one.

I just wasted a few minutes trying to get Claude 3 Opus to write me a script - nearly got there but Firefox had my SQLite database locked and I lost interest. My conversation so far is at: https://gist.github.com/simonw/9f20a02f35f7a129b9850988117c0...

genewitch 2 years ago | | |

This used to be fairly simple to do before https everywhere, just install squid (or whatever) and cron the cache folder to a zip file once a day or whatever.

There's paid solutions that kinda do what you want, but they capture all text on your screen and OCR it to make it searchable, which at least lets you backtrack and has the added advantage that it will make pdfs, meme images, etc searchable, too. last i heard it was mac only but a few folks mentioned some windows software that does it too.

as an aside i don't consider reading/learning nearly all day to be a net negative, even if ADD is to blame. (i haven't had the "H" since i was a child.) A status report wouldn't "stare" me in the face; in fact, it would be nice to have some language model take the daily report and over time suggest other things to read or possible contradictions to link to.

jimmySixDOF 2 years ago | | |

look at ArchiveBox from the comments below

Hendrikto 2 years ago | | |

> Heck - we need our machines to write us a daily status report for what we did at the end of each day.

I am sure Trump, Xi, Putin, etc. would like that very much.

russellbeattie 2 years ago |

If anyone is interested, I wrote a long blog post where I analyzed all the various ways of saving HTML pages into a single file, starting back in the 90s. It'll answer a lot of questions asked in this thread (MHTML, SingleFile, web archive, etc.)

https://www.russellbeattie.com/notes/posts/the-decades-long-...

rnewme 2 years ago | |

Cool post. You should make hn entry

andai 2 years ago |

I always ship single file pages whenever possible. My original reasoning for this was that you should be able to press view source and see everything. (It follows that pages should be reasonably small and readable.)

An unexpected side effect is that they are self contained. You can download pages, drag them onto a browser to use them offline, or reupload them.

I used to author the whole HTML file at once, but lately I am fond of TypeScript, and made a simple build system to let me write games in TS and have them built to one HTML file. (The sprites are base64 encoded.)

On that note, it seems (there is a proposal) that browsers will eventually get support for TypeScript syntax, at which point I won't need a compiler / build step anymore. (Sadly they won't do type checking, but hey... baby steps!)

lopkeny12ko 2 years ago |

How does this compare to SingleFile?

https://www.npmjs.com/package/single-file-cli

gildas 2 years ago | |

Author of SingleFile here, one of the major differences is that monolith doesn't use a web browser to take page captures. As a result, it doesn't support JavaScript, for example. SingleFile, on the other hand, requires a Chromium-based browser to be installed. It should also produce smaller pages and is capable of generating ZIP or self-extracting ZIP files. However, it will take longer to capture a page. Note that since version 2, it is now possible to download executable files of the CLI tool [1].

[1] https://github.com/gildas-lormeau/single-file-cli/releases

darkteflon 2 years ago | | |

SingleFile is amazing - use it tens of times every day across desktop and mobile. Can’t recall a single instance of it breaking. Thank you sincerely for your excellent work.

n8henrie 2 years ago | | |

> requires a Chromium-based browser to be installed

Not to try to correct the author here, but it supports geckobrowser as well (not just chromium-based), right?

I'm currently trying to package for nixpkgs[0] and am using Firefox for the checkPhase.

[0]: https://github.com/NixOS/nixpkgs/pull/283878

codazoda 2 years ago | | |

What does SingleFile do? The intro tells you how to run it, but not what it does.

mikae1 2 years ago | | |

> SingleFile, on the other hand, requires a Chromium-based browser to be installed.

I'm using it as a a Firefox extension. Am I missing something?

Capricorn2481 2 years ago | | |

On the front page, Monolith says it embeds javascript. Are you saying it doesn't use this javascript to render the page before taking a snap shot?

DavideNL 2 years ago | | |

@gildas Curious, is there any specific reason why singlefile-cli is not available in Homebrew on macOS ?

PS. I use SingleFile a lot, it's great... Thank you!

tiagod 2 years ago | | |

I've been using SingleFile for ages now... it's my favorite browser extension after uBlock, thank you for your great tool! :)

jchook 2 years ago |

Hm, very interesting, especially for bookmarking/archiving.

I'm curious, why not use the MHTML standard for this?

- AFAIK data URIs have practical length limits that vary per browser. MHTML would enable bundling larger files such as video.

- MHTML would avoid transforming meaningful relative URLs into opaque data URIs in the HTML attributes.

- MHTML is supported by most major browsers in some way (either natively in Chrome or with an extension in Safari, etc).

- MIME defines a standard for putting pure binary data into document parts, so it could avoid the 33% size inflation from base64 encoding. That said, I do not know if the `binary` Content-Transfer-Encoding is widely supported.

snshn 2 years ago | |

MHTML support is planned, there's a couple of other problems that need to be resolved first, but it's a good format for archiving, been requested many times

jchook 2 years ago | | |

Thanks for the reply. Very exciting. I would love to see MHTML support on this.

Hamuko 2 years ago | |

>MHTML is supported by most major browsers in some way

Firefox? What about mobile versions of browsers?

jchook 2 years ago | | |

We should submit a PR

keyle 2 years ago |

I am really loving these 'new' pure rust tools that are super fast and efficient, with lovely API/doco. Ah, it feels like the 90s again... Minus 50% bugs probably.

snshn 2 years ago | |

Hey, at least no memory leaks this time! Ü

al_borland 2 years ago |

I use read-it-later type services a lot, and save more than I read. On many occasions I've gone back to finally read things and find that the pages no longer exist. I'm thinking moving to some kind of offline archival version would be a better option.

andai 2 years ago |

Does anyone know how an entire website can be restored from Wayback Machine? A beloved website of mine had its database deleted. Everything's on Internet Archive, but I think I'd have to

(1) scrape it manually (they don't seem to let you download an entire site?),

(2) write some python magic to fix the css URLs etc so the site can be reuploaded (and maybe add .html to the URLs? Or just make everything a folder with index.html...)

It seems like a fairly common use case but I barely found functional scrapers, let alone anything designed to restore the original content in a useful form.

gildas 2 years ago | |

It's documented here: https://wiki.archiveteam.org/index.php?title=Restoring

belthesar 2 years ago | |

I bet the ArchiveTeam might be able to help you out with this. They were quite helpful when I wanted to make sure a site was preserved, and might be able to help you as well, or at least point you in the right direction. https://wiki.archiveteam.org/

joeyhage 2 years ago |

It would be awesome to see support for following links to a specified depth, similar to [Httrack](https://www.httrack.com/)

codetrotter 2 years ago | |

I made a basic crawler using Firefox, thirtyfour https://docs.rs/thirtyfour/latest/thirtyfour/ and squid

Basically, I took a start URL for the crawl, and my program would load the page in Firefox using thirtyfour, and then extract all links from the page and use some basic rules for keeping track of which ones to visit and in which order. I had Squid proxy configured to save all traffic that passed through it.

It worked ok-ish. I only really stopped that project because of a hardware malfunction.

The main annoyance that I didn’t get around to solving was being more smart about not trying to load non-html content that was already loaded anyway as part of the page. Because the way I extracted links from the page I also extracted URLs of JS, CSS etc that were referenced.

gildas 2 years ago | |

You can have a look at the last 2 examples here [1].

[1] https://github.com/gildas-lormeau/single-file-cli?tab=readme...

arp242 2 years ago |

I wrote something very similar a few years ago – https://github.com/arp242/singlepage

I mostly use it for a few Go programs where I generate HTML; I can "just" use links to external stylesheets and JavaScript because that's more convenient to work with, and then process it to produce a single HTML file.

fagrobot 2 years ago |

https://github.com/gildas-lormeau/SingleFile

pbnjeh 2 years ago |

Does anyone remember the Firefox extension Scrapbook, from "back in the day"? I used to use it a lot.

Look "back" 5 - 10 years, or more, and it's striking how many web resources are no longer available. A local copy is your only insurance. And even then, having it in an open, standards compliant format is important (e.g. a file you can load into a browser -- I guess either a current browser or a containerized/emulated one from the era of the archived resource).

Something that concerns me about JavaScript-ed resources and the like. Potentially unlimited complexity making local copies more challenging and perhaps untenable.

toomuchtodo 2 years ago |

Show HN: CLI tool for saving web pages as a single file - https://news.ycombinator.com/item?id=20774322 - August 2019 (209 comments)

lagt_t 2 years ago |

I remember IE5 was able to do this lol. It fell out of vogue for some reason, glad to see the concept is still alive.

phrz 2 years ago | |

Safari does this with .webarchive files

berkes 2 years ago | |

Firefox can still do it.

Hamuko 2 years ago | | |

Can it? I'm only having Firefox save a bunch of files.

thrdbndndn 2 years ago | | |

Chrome can too

max_ 2 years ago |

It still blows my mind that browsers don't provide features this out of the box.

Gormo 2 years ago | |

The MHTML format [1] has been around for 25 years and was natively supported by multiple browsers for decades. Modern browsers have regressed in functionality.

[1]: https://en.wikipedia.org/wiki/MHTML

hu3 2 years ago | |

Chrome does support it:

https://i.imgur.com/HF7GXEI.png

Alifatisk 2 years ago | |

I think they do? Have you tried hitting cmd+s or ctrl+s? You can save webpages like that.

But I don’t know if they can compress everything into a single html file though.

vanderZwan 2 years ago | | |

Last time I tried that it saved a static version of the current DOM, instead of the page source. I'm assuming that the reasoning behind that is that most people want to save a snapshot of what they are currently seeing, and that this is the easiest way to have somewhat reliable results for that.

max_ 2 years ago | | |

Alot of the CSS & JavaScript is usually broken with ctrl+s.

A great option used to be the mhtml format chrome. (It had to be enabled in chrome flags)

But mhtml seemed to be removed from chrome since recently.

snshn 2 years ago | |

So true. Monolith is using libraries made by Mozilla for their Rust-driven browser engine (which I believe, never happened to be). I really would love for it to be a part of some browser one day, the demand is clearly there. Nobody likes to have a file+folder abomination on their drive, or some shady formats like .webarchive

publius_0xf3 2 years ago |

Awesome tool. A note to the devs: the latest version on winget is v2.7.0, which is several months behind the latest version.

k1ck4ss 2 years ago |

How would I archive an on-prem hosted redmine solution (https://www.redmine.org/)? It is many, many years old and I want to abandon it for good but save everything and archive it. Is that possible with monolith?

planb 2 years ago | |

You're probably better off with a recursive wget here. IIRC redmine was not really javascript heavy and monolith looks to me like it only saves one page.

farzadmf 2 years ago |

Ironically, I decided to try with the repo's own Github page, and when I open the resulting HTML file in Chrome, it's all errors in the console, and I don't see the `README` or anything

yencabulator 2 years ago | |

Github is a pile of Javascript that adds things to the DOM browser-side, the monolith README specifically says it does not run Javascript, and shows you a workaround for when that matters.

stringtoint 2 years ago |

Nice! Reminds me of the time I was working on a browser extension to do this.

causality0 2 years ago |

How's this better than the MHTML functionality built into my browser?

gildas 2 years ago | |

You can find a comparison of file formats here: https://github.com/gildas-lormeau/SingleFile?tab=readme-ov-f...

dosourcenotcode 2 years ago |

A cool tool to be sure.

However I feel this tool is a crutch for the stupid way browsers handle web pages and shouldn't be necessary in a sane world.

Instead of the bullshit browsers do where they save a page as "blah.html" file + "blah_files" folder they should instead wrap both in folder that can then later be moved/copied as one unit and still benefit from it's subcomponents being easily accessed / picked apart as desired.

genewitch 2 years ago | |

"save as [single] html" or whatever hasn't worked reliably in over a decade. I wrote a snapshotter that i could post in a slack alternative "!screenshot <URL>" and it would respond (eventually) with an inline jpeg and a .png link of that URL. As i mentioned upthread, this worked for a couple of years (2017-2020 or so) and then it became unreliable on some sites as well. as an example, old.reddit.com hellthread pages would only render blank white after the first couple dozen comments.

I haven't had the heart to try it with singlefile, but now that there's at least 3 tools that claim to do this correctly, i might try again. This tool, singlefile (which i already use but haven't tested on reddit yet) and archivebox. 4 tools, if you count the WARC stuff from archive.org

fs111 2 years ago |

https://en.wikipedia.org/wiki/WARC_(file_format)

victorbjorklund 2 years ago |

This is great. I have wished for something like this.

AdmiralAsshat 2 years ago |

So what happens if the page is behind a paywall and the embedded Javascript stores some authentication or phone-home code? Does that end up getting invoked on the monolith copy HTML?

I'm wondering how this would work if I wanted to use it to, say, save a quiz from Udemy for offline review.

hollander 2 years ago | |

yt-dlp can use browser cookies to get access to Facebook videos. This should have something similar.

    yt-dlp --cookies-from-browser firefox https://www.facebook.com/1234videos/5678/

dohello1 2 years ago |

and I thought my code pages were long haha

sunshine202022 2 years ago |

fun

ethanpil 2 years ago |

Nice. My next step: Figure out how to make a web extension 1 click button. Tab to Monolith to Joplin with a tag.

gildas 2 years ago | |

You could download SingleFile [1], configure a WebDAV server in the options page (cf. "Destination" section), and set up Joplin to synchronize with the server.

[1] https://github.com/gildas-lormeau/SingleFile

AdieuToLogic 2 years ago |

Or perhaps wget[0] as described here[1] and documented here[2] could do the trick.

0 - https://www.gnu.org/software/wget/

1 - https://tinkerlog.dev/journal/downloading-a-webpage-and-all-...

2 - https://www.gnu.org/software/wget/manual/wget.html

mattsan 2 years ago | |

This is addressed in the README and a comparison is given

AdieuToLogic 2 years ago | | |

> This is addressed in the README and a comparison is given

The only mention of wget in the README reads thusly:

  If compared to saving websites with wget -mpk, this tool
  embeds all assets as data URLs and therefore lets browsers
  render the saved page exactly the way it was on the
  Internet, even when no network connection is available.

This is not the only way to invoke wget in order to download a web page along with its assets. Should the introduction article I referenced above be deemed insufficient, consider this[0] as well.

0 - https://simpleit.rocks/linux/how-to-download-a-website-with-...