Would you say the following understanding is correct?:
- You can fine-tune a model, regardless of whether it has been quantized (as in the 4-bit versions of models made to fit in consumer grade RAM sizes) or not.
- You can fine-tune any model on any hardware, provided it fits into RAM. That means, that the 30B llama-derived models in their 4-bit quantized version and 19.5GB of VRAM requirement can be fine-tuned on consumer grade GPUs with 24gb of VRAM. (Like the RTX 3090 and 4090)
To the second, I'm not sure that the RAM requirements are the same to train because you have to preserve the state which takes extra memory.
But nonetheless, training time improvements look interesting.
e: Oh I see, the training time improvement is compared to a grid search over the LoRA rank. Not for a single run.
I am not convinced that you shouldn't just train on the highest possible rank that you can with your compute budget. If you can train a DynLoRA with rank 8, why not just train a LoRA with that rank?
Maybe if the "optimal rank" of LORA applies to any adaptation and you interested in training multiple adaptations for different use cases?
I am not convinced that the "best rank" is not just the highest possible with your compute budget, personally.
It seems like they use a fixed-distribution controller for training. It’d be nice to see why it’s worth deviating from the original RL paradigm.
But if you have some capacity constraint (e.g., memory, I guess?) then you can imagine dynamic rank allocation helping in the case where the maximum rank across all layers isn't within budget.
It's a bit of a stretch though, I agree
Seems complicated but I could see it being useful potentially.
I have yet to understand the difference between fine tuning and training and therefore yet to understand if a distributed decentralized eventually consistent training approach is a possibility or simply not realistic.
It becomes an empirical engineering question how many parallel nodes you can train on for how long before averaging them back together. It's an expensive question to answer, since you have to train many variations to get the data.
It's basically not possible to do what you are trying to do in an async manner. With advancements in large batch gradients, it might be possible to do some sort of synchronous P2P gradient averaging.
What about with some fairly frequent and periodic synchronization?
Is there potentially some balance where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion. I was thinking maybe this algorithm would be 10x less energy efficient but have the benefit of decentralization. Something along those lines.
I’m guessing the current training algorithms do something like this but since rapid synchronization always makes the efficiency increase (in the extreme that giant single wafer cpu) then openAI and others use systems with high interconnect bandwidth.
> where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion
I think this really probably depends on the terrain of your loss landscape. My intuition is that many are too spike-y and if you take a step or two in each of your subsets and then average them, you will end up on a steep hill rather than a valley between your two points.
But this is an active area of research for sure.