YouTube-dl has an interpreter for a subset of JavaScript in 870 lines of Python

YouTube-dl has an interpreter for a subset of JavaScript in 870 lines of Python(twitter.com)

473 points by yuuta 3 years ago | 155 comments

lolinder 3 years ago |

To be clear, this is an extremely tiny subset of JS. It looks like they only implemented the features needed to run a very specific function. For example, the only symbol allowed after "new" is "Date", everything else throws an exception.

It's still fun that it's there, but it's not as big a deal as it sounds from the tweet.

krab 3 years ago | |

It will only grow - as new scripts will need to be interpreted, new features will be added.

lolinder 3 years ago | | |

I would be horrified if this grew much further. It's perfectly fine for its current scope, but the architecture would not scale at all to a full interpreter without essentially starting from scratch.

mid-kid 3 years ago | |

Yeah, it's essentially used as a javascript expression solver. You can see the full extent of its capabilities in the testsuite: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...

The specific site modules in youtube-dl will take care to extract the bare minimum necessary to solve whatever challenge.

em-bee 3 years ago | |

if it's going to need much more than that then it probably would make more sense to port the whole application to javascript instead.

but then this could be turned into a commandline browser that is able to interpret a whole web-page and save the resulting html structure instead of the source as curl/wget would do.

pvillano 3 years ago | | |

Eventually, YouTube-dl might have to simulate an entire browser and human user to fool Google. Until then, the usefulness of YouTube-dl is that it's less heavy than a full browser.

I bet someone's already started a YouTube downloader that uses a headless browser

Uptrenda 3 years ago |

Anyone who has ever pulled a website from a script knows the pain that is Javascript. Normally you want to just get some text and work out the API actions but a lot of sites use horribly obfuscated Javascript -- either because that's what modern web development is (lolz) -- or because its part of their 'security.' That means if you want to write browser-based bots properly -- you ought to use a browser. There are special browsers that run 'headlessly' or are designed mostly for bot use. Like https://www.selenium.dev/ which plugs into a few different 'browser engines.'

But now you have another problem. Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work. If you're using something like Python you just frankly don't have very good packaging. So it's hard to string together all that into a solution and have it magically work for everyone. What YouTube-dl have done is good engineering. Even though it's not a full JS interpreter: they've kept their software lean, self-contained, and easier to use.

Scaevolus 3 years ago | |

Embedding V8 can work quite well: https://github.com/sqreen/PyMiniRacer

You probably have to emulate some of the DOM, but you can interact directly with whatever obfuscated/packed scripts in a more lightweight and secure way than driving an entire browser.

hansvm 3 years ago | |

I use pyminiracer to great effect for that sort of scraping.

eurasiantiger 3 years ago | |

Just npm install puppeteer.

lolinder 3 years ago | | |

Puppeteer is cool, but it's exactly what OP is warning against: it's a full browser that is downloaded and run through npm. It's remarkably well packaged, but still far more error prone than a simple HTTP request, and far more likely to break on its own just with the passage of time.

ciupicri 3 years ago | | |

By the way there is also Playwright [1] and it has Python bindings too [2].

[1]: https://playwright.dev/

[2]: https://playwright.dev/python/docs/intro

delusional 3 years ago |

Can we stop the trend of linking to tweets that just contain another link to the content? what's the point? Wouldn't this be 10x better if it was a link directly to the github?

derangedHorse 3 years ago | |

I like the Twitter linking since it's almost like the OP is giving credit to where they found the information from.

plaguepilled 3 years ago | | |

Agreed. If you only know this from someone else's observation, you should link the observation.

kelnos 3 years ago | |

I was thinking the same thing; link to the file on Github, with the same title text as is there now, and it saves me an extra click. And any time I don't have to visit Twitter, I consider that a win.

caned 3 years ago | |

I often share links to HN instead of the referred link. Many times the comments are as interesting as the content. This applies to sharing Twitter or Reddit links, too, albeit with a lower S/N ratio.

Firmwarrior 3 years ago | | |

Is there some trick to actually being able to see information on Twitter? When I click a tweet, I get the tweet, then a random smattering of 2-3 semi-related tweets, and then a login popup that breaks the page

Do you guys use an extension to process it or something?

(Same issue with Reddit of course)

sylware 3 years ago |

Nowadays "javascript" refers to the scriptable, grotesquely and absurdely complex and massive web engines, aka google financed blink and geeko, then apple financed webkit, that with their SDK.

The currently obfuscated javascript media players will try to break yt-dlp by leveraging the complexity and size of those scripted web engines. They will make them out of reach to small teamns or individuals and it is even "better", it will force ppl to use apple or google web engine, killing any attempt to provide a real alternative.

A standalone javascript interpreter is actually some work, but seems to stay in the "reasonable" realm: look at quickjs from M. Bellard and friends (the guy who created qemu, ffmpeg, tinycc, etc): plain and simple C (no need of a c++ compiler), doing the job more that well enough.

That's why noscript/basic (x)html is so much important.

dtx1 3 years ago | |

> but seems to stay in the "reasonable" realm

> M. Bellard and friends

Chose one, that dude is a wizard wielding c like a brain surgeon wields a scalpel.

olliej 3 years ago | |

Yeah I agree with almost all of this - the massive size and complexity of commercial engines makes it seem like JS the language must also be complex.

I also agree with the idea that these sites will probably be able to/want to create JS that breaks these small/lightweight engines requiring constant work :-/

This final point I disagree with entirely. You can't point to Bellard doing something as evidence that it's reasonable. This is a guy that wrote a program that generated a TV signal via a VGA card. :D

axiolite 3 years ago | |

> quickjs from M. Bellard and friends

Is the M key next to the F key on your particular keyboard by chance? Because I've always called him "Fabrice."

https://en.wikipedia.org/wiki/Fabrice_Bellard

a_e_k 3 years ago | | |

Could just be the usual abbreviation for Monsieur.

ganjatech 3 years ago | | |

Monsieur Bellard - M. Bellard

esprehn 3 years ago |

This isn't really JS, it's a purpose built evaluator that's only for evaluating a particular script on YouTube, assuming a huge list of things are true about how YouTube JS is written.

Ex. Its got a hard coded list of methods for String, and it doesn't respect prototypes. It only supports creating Date instances, and won't work if you override the global Date. It parses with regexes and implements all operators with python's operator module (which is the wrong type semantics) etc. Nearly none of the semantics of JS are implemented.

It's sort of the sandwich categorization problem:

If I write a C# "interpreter" in perl thats only 200 lines and just handles string.Join, string.Concat and Console.WriteLine, and it doesn't actually try to implement C# syntax or semantics at all and just uses perl semantics for those operations is it actually C#? :P

I say "not a sandwich".

haunter 3 years ago |

The same in yt-dlp https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/jsinterp...

Interesting to see the diffcheck between the two https://www.diffchecker.com/8EJGN27K

cheschire 3 years ago | |

Is yt-dlp's implementation being better the reason why I have fewer throttling issues than with youtube-dl?

LeoPanthera 3 years ago | | |

Maybe this isn't true anymore, but for a while they would hit different APIs. yt-dlp was using the Android YouTube API because it had no throttling.

kristopolous 3 years ago |

To understand why, I have a far simpler tool that focuses on a subset of sites (adult content video aggregators)

https://github.com/kristopolous/tube-get

It too deals with this problem but does so in a way that'd be easy to maliciously sabotage

Look right about here https://github.com/kristopolous/tube-get/blob/master/tube-ge...

As to why this program exists, this was originally written between about 2010-2015 or so technically predates the yt-* ecosystem.

The tool still works fine and it's not a strict subset of yt-dlp or YouTube-dl because being a different approach, although it's overall site coverage is smaller, I've had it be a "second try" system when yt-* fails and it comes up with success maybe about half the time

pabs3 3 years ago | |

Would you mind switching to subprocess with shell=False? os.popen is obsolete and insecure because it passes the command through the shell.

PS: I found it quite easy to contribute to yt-dlp and the reviewers are ultra-helpful and kind, you might want to migrate all of your extractors there.

kristopolous 3 years ago | | |

1. It's ancient code but sure

2. They're fundamentally not compatible approaches. This is worthless to them

aeyes 3 years ago |

They just don't want to use any external dependencies... There is also an AES implementation: https://github.com/ytdl-org/youtube-dl/blob/master/youtube_d...

M30 3 years ago |

How should a programming noob interpret this? Be impressed at what was achieved here? Be concerned about security implications using the tool? Something else entirely?

lewisl9029 3 years ago |

Another really cool JS dialect I recently learned about is njs from the nginx team: https://github.com/nginx/njs

This video goes into some of the design and tradeoffs: https://www.youtube.com/watch?v=Jc_L6UffFOs

TL;DW: they optimized for fast creation/destruction of low-footprint VMs with no JIT or garbage collection.

homarp 3 years ago |

the tests for it: https://github.com/ytdl-org/youtube-dl/blob/master/test/test...

olliej 3 years ago |

This is super cool.

Some of the stuff is kind of questionable to me in the sense that I could believe you could probably make some kind of sufficiently wonky JS that this would do the "wrong" thing.

But it's super cool that they are able to do this as I think it shows that claims of JS complexity based on the size of JS engines is overlooking just how much of that size/complexity comes from the "make it fast" drive vs. what the language requires. Here you have a <1000LoC implementation of the core of the JS language, removed from things like regex engines, GCs, etc.

Mad props to them for even attempting it as well - it simply would not have ever occurred to me to say "let's just write a small JS engine" and I would have spent stupid amounts of time attempting to use JSC* from python instead.

[* JSC appears to be the only JS engine with a pure C API, and the API and ABI are stable so on iOS/macOS at least you can just use the system one which reduces binary size+build annoyance. The downside is that C is terrible, and C++ (differently terrible? :D) APIs make for much more pleasant interfaces to the VM - constructors+destructors mean that you get automatic lifetime management so handles to objects aren't miserable, you can have templates that allow your API to provide handles that have real type information. JSC only has JSValueRef and JSObjectRef, and as a JSObjectRef is a JSValueRef it's actually just a typedef to const JSValueRef :D OTOH other hand I do thing JSC's partially conservative GC is better for stack/temporary variables is superior to Handles for the most part, but it's also absolutely necessary to have an API that isn't absolutely wretched. The real problem with JSC's API is that it has not got any love for many many many .... many years so it doesn't have any way to handle or interact with many modern features without some kludgy wrappers where you push your API objects into JS and have the JS code wrap them up. The API objects are also super slow, as they basically get treated as "oh ffs" objects that obey no rules. I really do wish it would get updated to something more pleasant and really usable.]

jraph 3 years ago |

I do wonder why YouTube does not try harder to make it difficult to do this computation meant to prove you are a legit YouTube web client. Providing an easy-to-find, simple JS function interpretable with 900 lines of Python is like they don't try at all. They might as well do nothing.

Or is their goal just to make youtube-dl not 100% reliable? Or to be able to say "look, you are running our code in a way we did not intend, you can't do this because you are breaking the EULA"?

zuminator 3 years ago | |

I'd guess that their efforts to make it harder are limited by the fact that they want YouTube to be able to play on thousands of different low powered set top boxes and cheap phones. So whatever obfuscated code they use has to be simple enough to be run and periodically updated by all these different devices, and that same simplicity makes it emulable.

Arnavion 3 years ago | |

They do make it harder from time to time. In fact yt-dlp's interpreter has been broken for a month or so now and the devs finally gave up and told users to just install PhantomJS (which itself hasn't been updated since 2016 and probably has bugs / vulns of its own, but whatever).

https://github.com/yt-dlp/yt-dlp/issues/4635#issuecomment-12...

whywhywhywhy 3 years ago | | |

I mean if this is the direction it’s heading it makes more sense to port yt-dlp to node. It’s already dependent on a scripting language, it may as well be the one YouTube speaks.

Cthulhu_ 3 years ago | |

I'm guessing the amount of people using it is low enough to not bother with mitigation. Then again, there's a LOT of YT videos that take clips from other videos (which in most cases falls under fair use), which I can imagine would use this tool.

mdaniel 3 years ago |

I was expecting this to be about Duktape <https://github.com/svaarala/duktape>, but heh, for sure no. I'd bet $1 there's no way youtube-dl would switch, but I wonder if yt-dlp would?

rcarmo 3 years ago |

Awesome. Even if it's likely incomplete, it might come in really handy for some scraping I need to do...

Too 3 years ago |

They must have been inspired by this PyCon presentation, where David Beazley live codes a fully working webassembly interpreter, in under one hour. https://youtu.be/VUT386_GKI8

atan2 3 years ago |

This seems to be a pretty small subset of JavaScript, but I personally love small projects like this for educational purposes. Removing the noise and keeping things minimal helps my brain reason about things.

Earlier this year I enrolled in an online class called "Building a Programming Language" taught by Roberto Ierusalimschy (creator of Lua) and Gustavo Pezzi (creator of pikuma.com). We created a toy language interpreter/VM and the final code was around of 1,800 lines of Lua code. Keeping things as simple (and sometimes naive) as possible was definitely the right choice for me to really wrap my head around the basic theory and connect the dots.

Thanks for the link.

Tao3300 3 years ago |

Greenspun's Tenth Rule:

> Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. [1]

And here we have a complicated Python program with a partial JS implementation in it.

[1] https://en.wikipedia.org/wiki/Greenspun's_tenth_rule

anony23 3 years ago |

What purpose does it serve?

tonetheman 3 years ago |

If this got much bigger I would switch it to quickjs

Overview of the control flow (already known): The Youtube API provides you with n - your video access token If their new changes apply to your client (they do for "web") then it is expected your client will modify n based on internal logic. This logic is inside player...base.js n is modified by a cryptic function Modified n is sent back to server as proof that we're an official client. If you send n unmodified, the server will eventually throttle you.