It's still fun that it's there, but it's not as big a deal as it sounds from the tweet.
The specific site modules in youtube-dl will take care to extract the bare minimum necessary to solve whatever challenge.
but then this could be turned into a commandline browser that is able to interpret a whole web-page and save the resulting html structure instead of the source as curl/wget would do.
I bet someone's already started a YouTube downloader that uses a headless browser
But now you have another problem. Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work. If you're using something like Python you just frankly don't have very good packaging. So it's hard to string together all that into a solution and have it magically work for everyone. What YouTube-dl have done is good engineering. Even though it's not a full JS interpreter: they've kept their software lean, self-contained, and easier to use.
You probably have to emulate some of the DOM, but you can interact directly with whatever obfuscated/packed scripts in a more lightweight and secure way than driving an entire browser.
Do you guys use an extension to process it or something?
(Same issue with Reddit of course)
The currently obfuscated javascript media players will try to break yt-dlp by leveraging the complexity and size of those scripted web engines. They will make them out of reach to small teamns or individuals and it is even "better", it will force ppl to use apple or google web engine, killing any attempt to provide a real alternative.
A standalone javascript interpreter is actually some work, but seems to stay in the "reasonable" realm: look at quickjs from M. Bellard and friends (the guy who created qemu, ffmpeg, tinycc, etc): plain and simple C (no need of a c++ compiler), doing the job more that well enough.
That's why noscript/basic (x)html is so much important.
> M. Bellard and friends
Chose one, that dude is a wizard wielding c like a brain surgeon wields a scalpel.
I also agree with the idea that these sites will probably be able to/want to create JS that breaks these small/lightweight engines requiring constant work :-/
This final point I disagree with entirely. You can't point to Bellard doing something as evidence that it's reasonable. This is a guy that wrote a program that generated a TV signal via a VGA card. :D
Is the M key next to the F key on your particular keyboard by chance? Because I've always called him "Fabrice."
Ex. Its got a hard coded list of methods for String, and it doesn't respect prototypes. It only supports creating Date instances, and won't work if you override the global Date. It parses with regexes and implements all operators with python's operator module (which is the wrong type semantics) etc. Nearly none of the semantics of JS are implemented.
It's sort of the sandwich categorization problem:
If I write a C# "interpreter" in perl thats only 200 lines and just handles string.Join, string.Concat and Console.WriteLine, and it doesn't actually try to implement C# syntax or semantics at all and just uses perl semantics for those operations is it actually C#? :P
I say "not a sandwich".
Interesting to see the diffcheck between the two https://www.diffchecker.com/8EJGN27K
https://github.com/kristopolous/tube-get
It too deals with this problem but does so in a way that'd be easy to maliciously sabotage
Look right about here https://github.com/kristopolous/tube-get/blob/master/tube-ge...
As to why this program exists, this was originally written between about 2010-2015 or so technically predates the yt-* ecosystem.
The tool still works fine and it's not a strict subset of yt-dlp or YouTube-dl because being a different approach, although it's overall site coverage is smaller, I've had it be a "second try" system when yt-* fails and it comes up with success maybe about half the time
PS: I found it quite easy to contribute to yt-dlp and the reviewers are ultra-helpful and kind, you might want to migrate all of your extractors there.
2. They're fundamentally not compatible approaches. This is worthless to them
This video goes into some of the design and tradeoffs: https://www.youtube.com/watch?v=Jc_L6UffFOs
TL;DW: they optimized for fast creation/destruction of low-footprint VMs with no JIT or garbage collection.
Some of the stuff is kind of questionable to me in the sense that I could believe you could probably make some kind of sufficiently wonky JS that this would do the "wrong" thing.
But it's super cool that they are able to do this as I think it shows that claims of JS complexity based on the size of JS engines is overlooking just how much of that size/complexity comes from the "make it fast" drive vs. what the language requires. Here you have a <1000LoC implementation of the core of the JS language, removed from things like regex engines, GCs, etc.
Mad props to them for even attempting it as well - it simply would not have ever occurred to me to say "let's just write a small JS engine" and I would have spent stupid amounts of time attempting to use JSC* from python instead.
[* JSC appears to be the only JS engine with a pure C API, and the API and ABI are stable so on iOS/macOS at least you can just use the system one which reduces binary size+build annoyance. The downside is that C is terrible, and C++ (differently terrible? :D) APIs make for much more pleasant interfaces to the VM - constructors+destructors mean that you get automatic lifetime management so handles to objects aren't miserable, you can have templates that allow your API to provide handles that have real type information. JSC only has JSValueRef and JSObjectRef, and as a JSObjectRef is a JSValueRef it's actually just a typedef to const JSValueRef :D OTOH other hand I do thing JSC's partially conservative GC is better for stack/temporary variables is superior to Handles for the most part, but it's also absolutely necessary to have an API that isn't absolutely wretched. The real problem with JSC's API is that it has not got any love for many many many .... many years so it doesn't have any way to handle or interact with many modern features without some kludgy wrappers where you push your API objects into JS and have the JS code wrap them up. The API objects are also super slow, as they basically get treated as "oh ffs" objects that obey no rules. I really do wish it would get updated to something more pleasant and really usable.]
Or is their goal just to make youtube-dl not 100% reliable? Or to be able to say "look, you are running our code in a way we did not intend, you can't do this because you are breaking the EULA"?
https://github.com/yt-dlp/yt-dlp/issues/4635#issuecomment-12...
Earlier this year I enrolled in an online class called "Building a Programming Language" taught by Roberto Ierusalimschy (creator of Lua) and Gustavo Pezzi (creator of pikuma.com). We created a toy language interpreter/VM and the final code was around of 1,800 lines of Lua code. Keeping things as simple (and sometimes naive) as possible was definitely the right choice for me to really wrap my head around the basic theory and connect the dots.
Thanks for the link.
> Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. [1]
And here we have a complicated Python program with a partial JS implementation in it.
xhtml has been dead for a decade
Why do I need a full XML parser when I can just extract what I need with regex?
And:
All that RPC IDL stuff is overcomplicated, REST is so much easier because I can just write the client by hand.
Submitted title was "YouTube-dl has a JavaScript interpreter written in 870 lines of Python".
The amount of high engagement just plain wrong tweets there are is just sad.
Edit: You misunderstood baobabKoodaa in the same way. Nobody is arguing about what constitutes an interpreter, except you. The question is what language is being interpreted.
Before accusing someone of pedantry, it would first be good not to completely misread them.
It is technically wrong - it isn't a sufficiently rich and powerful approach to handle all JS (HTML) that you might throw at it. It'll work for a while until it eventually barfs when you least expect it.
EXCEPT that if the inputs you are giving it come from some understood source(s) that aren't likely to change, then a simpler approach to the "all singing all dancing" correct may be appropriate and justified. E.g. because it might be easier to write, easier to maintain and/or less attack surface etc.
Does that apply to YouTube? Or any of the other hundreds of supported sites?
It's definitely neat, but not especially useful outside of the confines of its current application, and the security concerns of such a tiny subset will be minimal.
It's even very sensitive to white space.
yt-dlp seems to support running javascript in a full javascript interpreter/headless browser called phantomjs though. Running javascript in a full interpreter like this is a lot more scary from a security standpoint. I am not sure whether phantomjs sandboxes the javascript evaluation from the rest of the system, and if it does, whether the sandbox actually works properly at all. It looks like the project is not being maintained which is another bad sign.
Big projects with lots of manpower behind them such as chromium have trouble keeping javascript evaluation safe, so I would really suggest not trusting phantomjs on untrusted input.
This isn't something YouTube particularly enjoys. They would rather you keep coming back -- every visit is more ad revenue for them. If you have an offline copy, you don't need to visit YouTube anymore.
YouTube has an incentive, therefore, to make it more difficult to download (or "scrape") their content.
I'm not particularly sure of the specific details, but apparently YouTube has added JavaScript (a programming language that executes in the browser) as a hurdle to jump over. A simple python script doesn't have enough brains to execute JavaScript, only enough to realize that it exists. (Clearly, youtube-dl is sophistication enough to have jumped over it.)
These are the conclusions I come to, having written software for about a decade.
1) Once you give information to someone, be it text, pictures, sound, or video -- they will do whatever they want with it, and you have no control. Oh, yes -- it may be illegal. Maybe unethical. But the fact of the matter is you do not have control over information once it leaves your hands.
2) Adding hurdles to make it harder to access the information does little to stop someone who is dedicated to accessing it.
3) Implementing a subset of JavaScript in such an elegant and tiny manner is quite impressive.
How you interpret these facts depends on your worldviews. If you are a media and content creator, you will view these facts differently than a politician, and a teenager.
As an engineer and amateur philosopher, I certainly support the rights of content creators to be paid for their work. And yet, I fear that more and more, content creators want to lease me a right to listen their music, instead of own a copy of it.
I used to own CDs, DVDs, movies, and books. What happens if Amazon or YouTube decides to not serve me anymore? Anything I've "purchased" from them, I lose access to.
Further more, if I create a song, I used to be able to burn copies of CDs and distribute it on the street corners. Now, you have to sign up to stream on Spotify. This is a double edged sword -- I get a wide audience, but Spotify will do whatever they want with me.
This troubles me.
Usually in a virtual machine.
The browser is client-facing and everything there is possible to reverse engineer and figure out. So if you design a web-based application, and are depending on client-side Javascript for any security or distribution enforcement, it can be helpful, but can ultimately be unwound and cracked even if obfuscated, etc.
> Be impressed at what was achieved here?
Yes. Try to download a YouTube video with out it or an online service which is probably using it internally.
I also assume you mean mainstream JS engine, but Duktape, JerryScript and QuickJS are all C APIs.
They probably could have used ex. https://github.com/PetterS/quickjs instead of the hacks in the OP linked file.
You are correct though that I was only thinking of the big engines - bias on my part alas.
For your suggested alternate engines, JerryScript and QuickJS seem more complete than Duktape but I can't quite work out the GC strategy of JerryScript. Bellard says QuickJS has a cycle detector but I'm generally dubious of them based on prior experience.
If I was shipping software that had to actually include a JS engine, if perf was not an issue I would probably use JerryScript or QuickJS as binary size I think would be a more critical component.
Edit: it's also required to download music, otherwise it will just fail
Source:
- https://github.com/ytdl-org/youtube-dl/issues/29326#issuecom...
- https://github.com/ytdl-org/youtube-dl/blob/d619dd712f63aab1...
- https://github.com/ytdl-org/youtube-dl/commit/cf001636600430...
Overview of the control flow (already known):
The Youtube API provides you with n - your video access token
If their new changes apply to your client (they do for "web") then it is expected your client will modify n based on internal logic. This logic is inside player...base.js
n is modified by a cryptic function
Modified n is sent back to server as proof that we're an official client. If you send n unmodified, the server will eventually throttle you.
So they can always change the function to keep you on your toes, hence you need to be able to run semi-arbitrary JS in order to keep using the API.Waste of human brainpower but I guess that energy is better spent imagining a world where Google isn't in charge instead of kvetching about what they're doing with their influence.
We have a custom-made Discord music bot on our server which uses ytdl to stream songs so we can listen together, and at one point we were listening and suddenly got some obscure JavaScript error.
We began joking that there's some bug in the code which breaks it after 6PM, but later found out that Google had changed some of the obfuscated JS and this basically broke this part of code, which prevented us from fetching the song information.
It's kinda annoying if you have a lot of youtube tabs open for a long time and come back to them.
I believe YouTube limits your bitrate if you don't pass a specific calculated value; it's possible youtube-dl has to parse and eval JS to get it.
It's starting to become Widevine bullshit all over again.
Since the calculation of the response is done in JS, and they occasionally change the formula, some download programs are moving towards running the JS rather than trying to keep up with the changes.
It’s really just bullshit to make people’s lives harder.
For reddit use old.reddit.com instead of www.reddit.com. Reddit is Fun is a great native app for android and on iOS there's Apollo.
Both sites are laser-focused on driving conversions and engagement which means forcing you into an account and native apps (specifically their shitty native apps), and undoubtedly they'll start breaking the workarounds and third-party clients for realsies at some point.
But I mean, if users don't even have an account and native app install, how can they possibly get you doomscrolling all day? It's 2022, it's all about the engagement metrics, fuck user experience.
”Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work”
Complex problems cannot be solved by simple scripts, but they can be abstracted away to vendor libraries when/if they are well maintained, such as in this case. While it can break with time, at least someone else fixes it for you.
> Youtube now throttles requests of more than 10MB at a time, yt-dlp works around it by making many requests of 10MB using Range HTTP headers (yt-dlp calls it the http-chunk-size), but ffmpeg which does the downloading for mpv doesn't support that yet.
I want to change mpv or yt-dlp to support range-based video URLs (eg. appending &range=333999644-335298975&rn=5&rbuf=0 to URLs) which speed up stream seeking and probably eliminate throttling altogether, but I haven't taken the time to look into how to achieve it. For anyone interested, I have an open bug report at https://github.com/mpv-player/mpv/issues/10601, and have found https://satadalsengupta.github.io/docs/papers/2017_nossdav_y... describing these parameters.
Open source is not enough anymore, "lean" open source is the way now, SDK included.
That said I think a decent Python-native JS interpreter isn't that bad of an idea, it definitely needs a separate project and a more sophisticated architecture but it's an attainable goal.
This interpreter is built around matching specific Regex patterns and then immediately executing hard-coded behaviors with a few slots for parameters. It's missing a whole lot of the skeletal structure that would be necessary to "scale" it up to support a generally useful subset of JS, much less the entire language. Without the necessary structure, it would be a buggy mess that's impossible to maintain, and you can't just take the existing code and structure it better: it's built on the wrong foundation. That's what I mean when I say it won't scale.
That's not a judgement against the team that wrote it! It meets their needs fine, and choosing a minimal solution is great engineering that takes a lot of guts in the current software culture. I like what they've done here. Just don't take it and try to scale it up into a full JS runtime.
Engineers: make half arsed attempt.
It's not pedantry (or I'm pedantic). It's a reaction to the title that can lead people to believe that a complete JavaScript interpreter has been written in less than a thousand lines of Python. This reaction is perfectly understandable.
It's not a pedantic argument. Based on the title I thought that somebody wrote something akin to V8 in 800 lines of Python. After reading the comments I realized those 800 lines just interpret a particular JavaScript function written by Youtube. Those things are different. Pointing out the fact that they are different is not pedantry. The title is misleading and the comments pointing that out are helpful.
most of us know that a thousand or so lines of code is not a full JavaScript interpreter and cannot be the real thing.
there is no argument or conversation to have about it.
(If you disagree, this may the one time I will actively ask you to flag this post, so a mod can respond to this point)
Instead we could be considering if we're meant to read the Twitter conversation as well, or sharing a laugh about the link in the tweet author's bio. Or maybe the sharer didn't feel comfortable enough seeming like they made the claim but still wanted to share it because it's kind of cool.
AFAIK there's no junior HN mod of the year award.
Hell, how is Creative Commons licence they totally give you option to select, work in case of videos that can't be downloaded in any way?
For ballpark numbers, youtube dedicates 1200kbps to 1080p videos in VP9. Let's say we have a 10 minute video with an RPM of $3.
We can arrange a CDN to deliver files at $0.005 per GB without even putting effort into it. And that's at a super low scale. The price drops a lot from there as things get bigger. So I'll use that number, and note that it's being generous to google.
So that's 0.3 cents of revenue per watch, which is 90MB of data that would cost .045 cents to deliver.
One view would pay for about 7 downloads. And how many downloads are we likely to see? Probably under 10% of viewers.
I'd turn that option on.
Even after I fixed hardware acceleration, playing a 1080p YouTube video in Firefox using hardware H.264 decoding took more CPU energy (40% of a core) than playing the same video in mpv using software H.264 decoding (20% of a core). Web browsers are just horrifically complex, intractable to understand, and inefficient.
Most likely the reason is that they keep the botguard system for the stuff that matters to them a lot more like account signups and click fraud, and don't want to incentivize the ytdl guys to break it on behalf of spammers/clickfraudsters.
Now if they do it right and only embed some bare JS interpreter, it's still way harder to audit than these < 900 lines, for which it is quite easy to convince oneself that the interpreted script cannot do much.
Having full control like this +simple code is probably lower risk and more maintainable, even if there's the challenge of expanding feature set if scripts change.
The alternative would be a console js shell, but those are very different from browsers so that poses it's own challenges.
https://github.com/PetterS/quickjs
https://github.com/stefano/pyduktape
https://github.com/amol-/dukpy
I can't speak to the quality of those bindings, but they do seem maintained.
Cue libv8-node+mini_racer from which PyMiniRacer was born. It is non-trivial but not as hard as one might think.
The most painful part is the libv8 build system and Google-centric tooling (depot tools!), which makes it an absolute PITA for libv8 consumers that are not Google/Chrome.
This is why the libv8 gem was atrocious to keep up to date and to build for several platforms, and why libv8-node was born, because the node build system and source distribution are actually sane (props to their relentless work on which we piggyback on)
Disclaimer: worked at Sqreen, now maintainer of libv8-node and collaborator of mini_racer
https://github.com/sqreen/PyMiniRacer
Another option is to use node, but it also has weird limitations/behaviors when running code.