Hoping that someone will shepherd the cause of merging the two; I think I'm too out of the loop to do it this time around :-)
Which I'm happy for. So given that decision, I don't think it's unreasonable to think that they might be open to including Mmproj files in the GGUF.
Only issue I can think of is, which one? BF16, F16? Etc
Good lord, they managed to invent a format that is even less readable than XML.
Not sure what the solution is, other than writing a DSL to describe the model graphs which you then embed in the GGUF. The other fallback is to just read the PyTorch modules from the official model releases and convert that to GGML ops somehow.
I'd still love to see this, but it would need a cheerleader very familiar with the current state of the GGML IR.
Funny, to me AI models have "always" been single files, as that's what has been the norm in the local image gen business. Safetensors files allow stuffing all kinds of stuff inside them too, no GGUF needed for that. Though given that the text encoders of modern models are multi-gigabyte language models themselves, nobody includes redundant copies of those in every checkpoint.
That doesn't even make sense in the "local image gen business", you don't use a single weights file, you need a bunch of encoders/decoders and what not to actually be able to run the architecture with the weights.
Maybe the tooling you use hides those things from you, but they're still there under the surface.
As someone who is tinkering with a desktop-based inference app in FLTK[0], i wish this used the actual Jinja2 template parser llama.cpp uses (or there was another C function that did that since AFAICT for "proper" parsing you need to be able to pass a bunch of data to the template so it knows if you, e.g., do tool calling). Currently i'm using this adhocky function, but i guess i'll either write a Jinja2 interpreter or copy/paste the one from llama.cpp's code (depending on how i feel at the time :-P).
But yeah, GGUF's "all-in-one" approach is very convenient. And i agree that it feels odd to have the projection models as separate files - i remember when i first download a vision-capable model, i just grabbed whatever GGUF looked appropriate, then llama.cpp told me it couldn't do model and took me a bit to realize that i had to download an extra file. Literally my thought once i did was "wasn't GGUF supposed to contain everything?" :-P
Try both in lm studio, they really are surprisingly capable
Tried all the stuff bios, volting
I love TheBloke I wish he still made stuff
I didn't want to get personal with an LLM unless it was local so that's why I was setting this up but yeah. So far just research is what I was looking at.
They're mostly aimed at role play and sillytavern, but they're still generally good models, with lots of quants available
That means that every foundational model architecture requires new code in whatever is consuming the gguf to support that model.
hmmm...
it’s good at writing, coding, decently intelligent
you can try it on nvidia nim
If only you could see how much code I've stolen from Rich Felker, David Gay, Sun Microsystems, etc.
Though using text-based templates make this a bit tricky regardless. AFAIK llama.cpp tries to avoid this confusion by having their Jinja2 implementation use a custom string type that contains metadata about where characters "come from" so that it can distinguish between special tokens (which would be part of the Jinja2 template) and content (which would be either generated text or text given in by the user) - i.e. even if a string is "<|turn>" the metadata would be used to tell if it is meant to be tokenized as a special token or as a series of non-special tokens.
[0] i might be wrong, this is based on my understanding by messing around with the llama.cpp code, but i never implemented an LLM inference or training engine
For this reason, it can be tricky to work on the runtime for a model with the same model. This really feels like an accidental problem, but I'm not sure if it's really solvable without abandoning the text representations altogether (and the jinja abstraction along with it).
Obviously the prefix-with-backslash convention won't do it. The escaping system could be something like inserting a character on the second position in the text repr, and reversing that on output too if it matches an escaped known special token.
Changing the vocab on the fly requires tokenizing things separately, breaking the chat template.
Anecdotally, even claude code has an anneurism sometimes when listing special tokens. Idk exactly what claude's <eos> token is, but I'm fairly sure I've seen it stop generation when it tried to generate it before.
I should also say that I've (clearly) not thought about this deeply. There should be a simpler way to do it.
All models ever see are just tokens, even when you pass images or what not.
In this case, <|turn> is likely Token ID 1, <turn|> is Token ID 2 and so on, these common "markers" are all just tokens in, tokens out.
I've been asking other people but what do you use it for?
No RSS feed currently, but it's a good idea to add one!
Doing this means that you can't just tokenize the string output of the chat template as one big string. You might need to tokenize things separately, and combine them after.
From a space perspective, this is actually better because tokenization tends to compress text quite well. For example, common tokens in English text take up ~4 characters on average (expands to 32 bits), but only take up a fraction of that to store (15-18 bits/token depending on vocabulary size)
In fact it appears that designing the tokens as a text compression encoding is a decent approach, since it's roughly what some LLMs do. For example, early GPT tokenizers followed byte pair encoding to create the vocabulary, which is a text compression algorithm from the 90s.