0 - https://www.gnu.org/software/wget/
1 - https://tinkerlog.dev/journal/downloading-a-webpage-and-all-...
The only mention of wget in the README reads thusly:
If compared to saving websites with wget -mpk, this tool
embeds all assets as data URLs and therefore lets browsers
render the saved page exactly the way it was on the
Internet, even when no network connection is available.
This is not the only way to invoke wget in order to download a web page along with its assets. Should the introduction article I referenced above be deemed insufficient, consider this[0] as well.0 - https://simpleit.rocks/linux/how-to-download-a-website-with-...
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless --incognito --dump-dom https://github.com > /tmp/github.html
And get an HTML file for a page after the JavaScript has been executed.Wrote up a TIL about this with more details: https://til.simonwillison.net/chrome/headless
My own https://shot-scraper.datasette.io/ tool (which uses headless Playwright Chromium under the hood) has a command for this too:
shot-scraper html https://github.com/ > /tmp/github.html
But it's neat that you can do it with just Google Chrome installed and nothing else.I use that for things like accepting cookie banners, but using it to scroll down to trigger additional loading should work too.
There's also a --wait-for option which takes a JavaScript expression and polls until it's true before taking the shot - useful for if there's custom loading behavior you need to wait for.
Documentation here: https://shot-scraper.datasette.io/en/stable/screenshots.html
You can come up with workarounds for each, but it's still hacky and there's always going to be other pages that need special treatment.
That's handy for when you're developing a front end and IT/devops hasn't approved/enabled the the CORS settings on the backend yet, or if you're just hacking around and want to get data from somewhere that doesn't allow cross domain requests.
I tried this and it took a shot of a "bot detected" screen:
shot-scraper https://news.com.au/
But when I used interactive mode I could take the screenshot - run this: shot-scraper -i https://news.com.au/
It opens a Chrome window. Then hit "enter" in the CLI tool to take the screenshot.I tested below command in PowerShell
& 'C:\Program Files\Google\Chrome\Application\chrome.exe' ` --headless ` --print-to-pdf="$env:USERPROFILE\Downloads\page.pdf" ` '<url>'
I'm on mobile so don't have access to my notes, but I'm pretty sure that a year ago when I tried there was no reliable way to screenshot e.g. the BBC news website without getting the popups.
Again, thank you.
shot-scraper -h 800 'https://www.spiegel.de/international/' \
--wait-for "() => {
const div = document.querySelector('[id^="sp_message_container"]');
if (div) {
div.remove();
return true;
}
}"
shot-scraper runs that --wait-for script until it returns true. In this case we're waiting for the cookie consent overlay div to show up and then removing it before we take the screenshot.Screenshots here: https://gist.github.com/simonw/de75355c39025f9a64548aa3366b1...
It's quite a painful problem and we screenshot many millions of sites a day, our success rate at detecting these is high but still not 100%.
We have gotten quite far with heuristics and are exploring whether we can get better results by training a model.
Thanks for shot scraper.
Off the top of you head what would be the easiest command to have shotscraper barf a directory of shot-scraper HTMLs each day from my daily browsing history.
This would be interesting if I have a browsing session for learning something and I am researching across a bunch of sites - roll it all up into a Digi-ography of the sites used in learning that topic?
---
I've always been baffled that this isnt an inate functionality in any app/OS - its a damn computeer - I should have a great ability to recall what it displays and what you have been doing.
Heck - we need our machines to write us a daily status report for what we did at the end of each day.
Surely that would change productivity. If you were force to do a self-digital-confession and stare you ADHD and procrastination right in the face.
I just wasted a few minutes trying to get Claude 3 Opus to write me a script - nearly got there but Firefox had my SQLite database locked and I lost interest. My conversation so far is at: https://gist.github.com/simonw/9f20a02f35f7a129b9850988117c0...
There's paid solutions that kinda do what you want, but they capture all text on your screen and OCR it to make it searchable, which at least lets you backtrack and has the added advantage that it will make pdfs, meme images, etc searchable, too. last i heard it was mac only but a few folks mentioned some windows software that does it too.
as an aside i don't consider reading/learning nearly all day to be a net negative, even if ADD is to blame. (i haven't had the "H" since i was a child.) A status report wouldn't "stare" me in the face; in fact, it would be nice to have some language model take the daily report and over time suggest other things to read or possible contradictions to link to.
I am sure Trump, Xi, Putin, etc. would like that very much.
https://www.russellbeattie.com/notes/posts/the-decades-long-...
An unexpected side effect is that they are self contained. You can download pages, drag them onto a browser to use them offline, or reupload them.
I used to author the whole HTML file at once, but lately I am fond of TypeScript, and made a simple build system to let me write games in TS and have them built to one HTML file. (The sprites are base64 encoded.)
On that note, it seems (there is a proposal) that browsers will eventually get support for TypeScript syntax, at which point I won't need a compiler / build step anymore. (Sadly they won't do type checking, but hey... baby steps!)
[1] https://github.com/gildas-lormeau/single-file-cli/releases
Not to try to correct the author here, but it supports geckobrowser as well (not just chromium-based), right?
I'm currently trying to package for nixpkgs[0] and am using Firefox for the checkPhase.
I'm using it as a a Firefox extension. Am I missing something?
PS. I use SingleFile a lot, it's great... Thank you!
I'm curious, why not use the MHTML standard for this?
- AFAIK data URIs have practical length limits that vary per browser. MHTML would enable bundling larger files such as video.
- MHTML would avoid transforming meaningful relative URLs into opaque data URIs in the HTML attributes.
- MHTML is supported by most major browsers in some way (either natively in Chrome or with an extension in Safari, etc).
- MIME defines a standard for putting pure binary data into document parts, so it could avoid the 33% size inflation from base64 encoding. That said, I do not know if the `binary` Content-Transfer-Encoding is widely supported.
(1) scrape it manually (they don't seem to let you download an entire site?),
(2) write some python magic to fix the css URLs etc so the site can be reuploaded (and maybe add .html to the URLs? Or just make everything a folder with index.html...)
It seems like a fairly common use case but I barely found functional scrapers, let alone anything designed to restore the original content in a useful form.
Basically, I took a start URL for the crawl, and my program would load the page in Firefox using thirtyfour, and then extract all links from the page and use some basic rules for keeping track of which ones to visit and in which order. I had Squid proxy configured to save all traffic that passed through it.
It worked ok-ish. I only really stopped that project because of a hardware malfunction.
The main annoyance that I didn’t get around to solving was being more smart about not trying to load non-html content that was already loaded anyway as part of the page. Because the way I extracted links from the page I also extracted URLs of JS, CSS etc that were referenced.
[1] https://github.com/gildas-lormeau/single-file-cli?tab=readme...
I mostly use it for a few Go programs where I generate HTML; I can "just" use links to external stylesheets and JavaScript because that's more convenient to work with, and then process it to produce a single HTML file.
Look "back" 5 - 10 years, or more, and it's striking how many web resources are no longer available. A local copy is your only insurance. And even then, having it in an open, standards compliant format is important (e.g. a file you can load into a browser -- I guess either a current browser or a containerized/emulated one from the era of the archived resource).
Something that concerns me about JavaScript-ed resources and the like. Potentially unlimited complexity making local copies more challenging and perhaps untenable.
Show HN: CLI tool for saving web pages as a single file - https://news.ycombinator.com/item?id=20774322 - August 2019 (209 comments)
But I don’t know if they can compress everything into a single html file though.
A great option used to be the mhtml format chrome. (It had to be enabled in chrome flags)
But mhtml seemed to be removed from chrome since recently.
However I feel this tool is a crutch for the stupid way browsers handle web pages and shouldn't be necessary in a sane world.
Instead of the bullshit browsers do where they save a page as "blah.html" file + "blah_files" folder they should instead wrap both in folder that can then later be moved/copied as one unit and still benefit from it's subcomponents being easily accessed / picked apart as desired.
I haven't had the heart to try it with singlefile, but now that there's at least 3 tools that claim to do this correctly, i might try again. This tool, singlefile (which i already use but haven't tested on reddit yet) and archivebox. 4 tools, if you count the WARC stuff from archive.org
I'm wondering how this would work if I wanted to use it to, say, save a quiz from Udemy for offline review.
yt-dlp --cookies-from-browser firefox https://www.facebook.com/1234videos/5678/Careful what you wish for. Assuming the browser is able to run original TS without processing and you want type checking, then that also seems to effecticely lock the typechecking abilities of TS to their current level. Even without type checking it would already hinder the ability to add new syntax or standard types to TS.
Given TS is made for providing expressive typing over JS instead of constructively coming up with a type system with a language, there's still a lot of ground to cover, as can be seen by the improvements made in every TS release.
So the types will actually be able to be anything. It can be a completely different type checking superset language than Typescript even! Nothing will be locked at the current level.
It's a frikkin magical proposal.
Edit: wording.
Archive.org and wayback machine should ask for people to submit snaps of pages using this tool directly into the archive - especially during world events.
This would allow digtal archeologists to grok the sentiment of the world during that era...
(aside: when I interviewed at twitter they asked me what I thought twitter was, and I said I thought it was a global sentiment engine...)
But kudos to the world for having us now in the AI birth onto the global internet, as a wayback machine, coupled with AIs and LLMs and this tool - will allow one to ask questions about history in ways that will be very interesting.
--
"What was the general media coverage of [topic] in [decade] with respect to how we currently look at it - and are they articles covering [SUBJECT] in this topic for that time period.
etc...
Last week I started organizing them a bit, and it's shocking how much is a 404. Even from major newspapers and such. I have no idea why anyone would take down old content (outside of some specific and rare reasons). Some are also on neither internet archive or archive.today.
It snapshots a web page to a single html file. At least that's what i use it for. I use it to both archive stuff and to have proof that some site published something.
The next order up would be archivebox or whatever archive.org uses (the name escapes me) - which is a very heavy caching proxy that can save entire websites into a single directory in a way that wget/curl and all the other crawlers cannot.
If you care that the exact layout and everything is perfect, right now i think singlefile is aces.
I use it to export an HTML file that I can stick in my logseq archive for later. So much better than just printing to a PDF!
shot-scraper https://news.com.au/ \
--init-script 'delete Object.getPrototypeOf(navigator).webdriver' \
--user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:124.0) Gecko/20100101 Firefox/124.0'
Code and screenshots and prototype --init-script feature here: https://github.com/simonw/shot-scraper/issues/147#issuecomme...I don't usually have more than a few lines.
Perhaps my use case is unusual though, I work on simple web apps, games, interactive simulations. I'm about to get into writing, and I expect a small amount of CSS will be sufficient for that too. Though that would probably expand over time, heh. You want to add a quote, and then a floating image...
I've saved shopping carts and logged-in pages regularly, so the markdown reader version in the apps should definitely be independent of the article/page itself being up.
Your comment, to me, implies that the 404 links' content still exists but is not at a canonical URI anymore. I'm assuming converting stuff like /2018/08/foo.html to /newscheme/fetch?foo or whatever isn't that difficult? This whole thing is one of the reasons i haven't ever set up a blog or even a website that has dynamic content, because i can't be assed to decide on a URI scheme that will "just work" with any future engine.
Someone has to have written converters, right? I know you can import some blogs to wordpress (and vice versa, export WP to other engines...)
Can the API provide a custom tag, comment or ID which will then be inserted in the output? Like in JPEG EXIF, PNG also knows Metadata, PDF description, HTML meta tag?
You could append a new meta tag by running some custom JS to add it, but we don't modify exif, metadata or pdf description at the moment.
Though, weren't there some exceptions to that?
It seems though the syntactical structures that are chosen to be ignored need to be listed in the proposal, making the support in browser non-trivial and still hindering the future extensions of TS and similar languages, because all future constructs would need to be supersets of this proposal—or whatever version is practically supported by current browsers. If a language brings up a new construct all the users of that construct need to revert back from shipping their source-code as is, increasing the cost of introducing such things in the future.
Personally I don't see great benefits in having straight up TS work as-is in the browsers as you still need to run type checking phase locally, but I do see that some would like to see that happen and that it would simplify some release processes.
It would not simplify the release process of folks that want to minify and obfuscate their sources, but it's probably fine to make that comparatively even harder ;).
When I say nothing will be locked down, then that does only mean the type checking itself will not be locked down. Indeed it will not be specified at all. Buuuuut the location of the types in the js syntaxt will 100% be locked down and specified. That's what the proposal is. So there will be limitations on coming up with novel ways to integrate type syntax with the js syntax. But you will of course still be able to make your own compiler if you want this.
<svg role="img" aria-label="[title + description]">
<title>[title]</title>
<desc>[long description]</desc>
...
</svg>If there was a KOReader integration it would be amazing.
But if its self hosted, then that integration could simply be a SFTP / SSH server that accesses the files.
https://github.com/jjjake/internetarchive
The Internet Archive cannot trust arbitrary content previously archived, so it is more optimal to have whatever archival tools or operations you’re performing to make a request to Wayback to take a snapshot at the same time.
If you’re bookmarking something, archive it too!
SingleFile just makes this one really complex, really important thing trivially easy, and in a portable format. For anyone curating a knowledge base it’s an absolute godsend.
I didn’t see any donation instructions on your GitHub - I for one would certainly love to chip in if you could point me in the right direction?
[1] https://addons.mozilla.org/android/addon/single-file/
[2] https://apps.apple.com/app/singlefile-for-safari/id644432254...
[3] https://play.google.com/store/apps/details?id=com.kiwibrowse...
That said, Monolith's approach of not requiring a web browser could be a game changer for simpler projects or where installing a Chromium-based browser isn't viable. It strikes me as a more straightforward, lightweight solution, albeit with the clear trade-off of not supporting JavaScript.
Has anyone run into situations where one tool clearly outperformed the other in real-world usage? I'm particularly curious about the impact on performance and convenience when choosing between these two, especially for mobile use. Also, kudos to the authors and contributors of these tools. The tech community benefits greatly from such innovations that help preserve and share knowledge.
What do?
I use SingleFile to save a copy of every article / post / SO & forum discussion I find interesting or useful. I sort them into two buckets: work, and not-work.
I’ve been doing this for 10+ years (before SingleFile I used things like .pdf, plain .html, .webarchive files - although these all have drawbacks).
In the pre-LLM era, I would then interface with these almost exclusively through a search front-end. I use Houdahspot on Mac and easySearch on iOS. That lets me see everything interesting I’ve read on a particular subject just by typing it in (with the usual caveats that apply to basic keyword search - although in practice that alone has proven very effective). Because it’s just a folder of essentially zipped .html files, there’s no lock-in.
Now that we’ve got LLMs, I plug those 10+ years of files straight into my RAG pipeline using llama-index. It’s quite nice :)
Also, how is the quality of the output generated compared to a .pdf? I'm used to print PDFs from chrome for articles that I want to save, but the layout can become awkward sometimes, and navigation bars can appear several times and hide portions of the text.
I like this feature from chrome, but it's not consistently reliable.
The output compared to PDF is like night and day. It is high Fidelity versus low Fidelity. At this point now, I only use PDF if for some reason I need it
In most cases SingleFile outputs looks identical to the real thing. Though I generally only use it on simpler sites such as recipes and technical blogs.
Anyway, to answer your question, lots of pages need JS to work correctly, so using Singlefile is the better option.