RLHF a LLM in <50 lines of Python(datadreamer.dev) |
RLHF a LLM in <50 lines of Python(datadreamer.dev) |
When you get so far as to abstracting every step to loading a one-liner from huggingface, including the downloading of a prepared dataset with no example of doing the same on custom local dataset, you've abstracted too far to be useful for anyone other than the first user.
However, there is a lot of documentation on the site to help guide users. This documentation page shows you can load in data via local datasets as well. For example, JSON, CSV, text files, a local HF Dataset folder, or even from a Python `dict` or `list`:
https://datadreamer.dev/docs/latest/datadreamer.steps.html#t...
We'll definitely keep improving documentation, guides, and examples. We have a lot of it already, and more to come! This has only recently become a public project :)
If anyone has any questions on using it, feel free to email me directly (email on the site and HN bio) for help in the meantime.
I would not have guessed that the base input data processing would have been filed under 'steps'. But now I kinda see how you are working, but I admit I'm not the target audience.
If you want this to really take off for people outside of a very, very specific class of researchers... setup an example on your landing page that calls to a local JSON of user prompts/answers/rejects finetuning a llama model with your datadreamer.steps.JSONDataSource into the loader. Or, a txt file with the system/user/assistant prompts tagged and examples given. Yes, your 'lines of code' for your frontpage example may grow a bit!
Maybe there are a lot of 'ML researchers' that are used to the type of super-abstract OOP API, load-it-from-huggingface-scheme-people you are targeting but also know that there are a ton that aren't.
you can look at the samples. Mostly its questions and accepted/rejected answers.
Title should be instead “Library for low-code RLHF in python”
That's like saying, I can solve any problem in 2 lines of code. I'll publish a library for it first, then:
import foo; foo.do_the_thing()
Magic!
DataDreamer is an open source Python package with a nice API from the University of Pennsylvania that does all this that we’re actively developing. Will be here to answer questions.
However, we also tried to simplify the API and have sensible defaults to make it usable for anyone / make ML research code cleaner :)
Algined models are dumber, treat everyone like they're stupid immature idiots who can't handle words and they're a wannabe moral authority.
Say for simple conversation usecases (eg customer support for a specific product, interactive fiction, things like that without deep technical knowledge).
I was also wondering if it’s possible to do such RLHF for SD running locally.
https://datadreamer.dev/docs/latest/pages/get_started/quick_...
It would be nice. But I’ve seen too many nice ideas completely fall apart in practice to accept this without some justification. Even if there are papers on the topic, and those papers show that the models rank highly according to some eval metrics, the only metric that truly matters is "the user likes the model and it solves their problems."
By the way, on a separate topic, the 90/10 dataset split that you do in all of your examples turns out to be fraught with peril in practice. The issue is that the validation dataset quality turns out to be crucial, and randomly yeeting 10% of your data into the validation dataset without manual review is a recipe for problems.
to actually do something from scratch or using the author's code requires adopting something esoteric just for this purpose. for these scenarios it is nice to appreciate hf and their abstraction. but the reinventing the wheel situation is very frustrating to work with.
if you want to go beyond the demo, you have to deal with this painful reality. i hope there is more progress on this rather than making stacks of api.
Theoretically the hard part is collecting the examples with rejections etc.
Unless your research hypothesis is specifically around improving or changing RLHF, it's unlikely you should be implementing it from scratch. Abstractions are useful for a reason. The library is quite configurable to let you tune any knobs you would want.
As far as I understand, what the training loop is supposed to be doing is pretty static and you don't need to understand most of it in order to "do ML", but at the same time it's full of complicated things to get right (which would be much easier to understand when controlled through well defined parameters instead of mixing boilerplate and config).
They're saying why does it matter if it's 50 vs 60 or even 100. It's a wrapper, which should be less lines. That's the whole point. Abstracting things even further and making assumptions.
Of course you can use them. Of course you can remove them after and use the underlying code. But the LOC shouldn't be the important part of it
Kind of like everybody knows the pop-science around e = mc^2 but most are completely oblivious that it takes a bunch of whiteboards to derive it and what all that actually means.
No pithy formula no way for the actual ideas to spread to the mainstream for you to somehow hear about it.
First, I’m an ML researcher. I don’t go around saying so because appeal to authority is bogus, but since every one of your comments seems to do this, it’s unavoidable.
You say the code is for ML researchers, then flat out say that it’s not a working production example, nor is it a faithful reproduction of a paper. So what is it?
Whether you want it to be or not, your audience is the hobbyist ML community, because without benchmarks to back up your code examples, no one from the research community will trust your examples without actual proof that they work. That’s the hard part of research, and it’s most of the effort.
My advice is, write something that can train useful models. Implement a production grade workflow, and show some reasons why it works. If you’re trying to get the wider ML research community to buy in to this, there’s not much other way to do it. No one will want to take easy code that does the wrong thing, and most of your examples show the wrong thing to do, like the 90/10 split.
You’re also a bit defensive about accepting feedback. Trust me that it’s better to accept that your code sucks and does the wrong thing, and then try to make it suck less and do the right thing. That’s how the majority of good software is written, unless you’re cperciva. But he’d also publish a paper explaining why his code is correct.
Anyway, the whole point of posting this to HN is to get feedback on it. (If you were hoping that a bunch of people would suddenly use it, then you need to appeal to the hobbyist community. They’ve told you a bunch of things that you’ve straight up said is out of scope.) And it sounds like you were hoping for feedback from ML researchers. Maybe others will chime in, but for now, that’s the best I’ve got.
I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.
Still, what the code does isn't what is described in the paper that the page links to.
Isn't this just because reinforcement learning and supervised learning are both optimization problems?
Nowadays, many datasets have different forms or are synthetic. DPO uses datasets with both positive and negative examples (instead of just a target output as with traditional SL); RLHF uses synthetic rewards.
Given that, shouldn't the first sentence on the linked page end with "...in a process known as DPO (...)" ? Ditto for the title.
It sounds like you're saying that the terms RL and RLHF should subsume DPO because they both solve the same problem, with similar results. But they're different techniques, and there are established terms for both of them.
Out of genuine curiosity, do you have any pointers/evidence to support this. I know that some of the industry leading research labs haven't switched over to DPO yet, in spite of the fact that DPO is significantly faster than RLHF. It might just be organizational inertia, but I do not know. I would be very happy if simpler alternatives like DPO were as good as RLHF or better, but I haven't seen that proof yet.
As an ML researcher, infrastructure libraries need to show how to train a production grade model, or else they’re useless for research. This is why research is hard. You keep handwaving this in various ways, but if you want ML researchers to take this seriously, you need a serious example.
"Production grade" doesn’t mean that it needs to have a deployable API. It memes the model needs to not suck. And until your training code can train a model that doesn’t suck, every ML researcher will view this and think "this code is guaranteed to produce a model that sucks," since there’s no evidence to the contrary. It’s incredibly hard to get the details right, and I can’t count the number of times I’ve had to track down some obscure bug buried deep within abstraction layers.
I’m trying to help you here. Ask yourself: who are my users? Are your users ML researchers? I already explained the problems we have, and why your library doesn’t meet those needs. Are your users ML hobbyists? You’ve already said no to this, and I think that’s a mistake. Most ML researchers behave as hobbyists, in the sense that they’re always looking for simple, understandable examples. Your library gives that, but without any of the rigor necessary to show that it can be trusted. Are your users ML devops, since it’s infrastructure? No, because it’s training models.
So you’re excluding every possible user, whether you realize it or not. But we’ll see; in a few months, if your library has significant traction, I’m empirically wrong. But I’m trying to help you avoid the default outcome of nobody uses your code because you’re not designing it for any particular user.
But I hear you on it would be useful to also have some examples that show a proper, reliable model being trained with the library v.s. just example models. The project is pretty early, and we'll work on adding more examples.