Why is GPT-3 15.77x more expensive for certain languages?(denyslinkov.medium.com) |
Why is GPT-3 15.77x more expensive for certain languages?(denyslinkov.medium.com) |
For diacratics in French or Spanish, diacratics are logically characters. I can't think of an example where it's actually useful to split the letter into a different token but I could see it happening and not being harmful. I do think it's possible French is just weird and just needs more tokens. When I think about how I process French, I probably do treat e.g. "Je l'ai aimé" as a pathological example as 3 tokens when I speak it out loud. But I can also see why you would tokenize it as 6 tokens, I'm not sure that's Anglocentrism so much as it's recognizing a complexity difference between French and English writing.
But all this is contrast to how non-roman characters are tokenized at the byte level. That just seems bad and like it's definitely going to make it worse with non-roman languages. There's no point in having tokens that split characters.
I'm no linguist, so I apologize if I'm misinterpreting this statement. My impression has always been that Spanish is less dense than English, only because in almost all cases, the Spanish version of product instructions is wordier. Look at the back of a shampoo bottle[0] and notice that the Spanish version is either longer, or a smaller font, to fit it all.
[0] https://i.postimg.cc/xd2X5WJN/Ghub-Fo-N11u8jz-Pjj-RDt-W-CGA9...
One area where Spanish is more dense is verb forms, because it retains most of the inflected verbs of Latin, whereas English has lost or merged together a lot of the historical Indo-European inflections. Speaking intuitively, I think it, like most Latin languages, tends to be a bit more verbose with noun phrases.
> One of the models listed above called NLLB (No Language Left Behind) has been open sourced by Facebook allowing for translation for 200 languages.
It was not. The model's weights are under CC-BY-NC, which certainly motivates commercial entities to not leave those languages behind. /s
[0]: https://en.m.wikipedia.org/wiki/The_Open_Source_Definition
I wouldn't release a chatbot based on LLaMA 65B, because of the legal issues, I'm not sure others are using the same restraint.
By the time a copyright dispute makes it to trial these companies will be able to hide behind actual killer robots.
I sometimes wonder what it takes to unseat a lingua franca, but it looks like we won't see that soon. English is set to dominate for a long time.
I think even human has to spend extra energy to speak a language they were not born with, no matter how fluent they are in this language. I don't know about natural multilinguals.
You can test it offline using tiktoken: https://github.com/openai/tiktoken
The tokenization speedups in that repo are very impressive. It was the most annoying part about processing 190,000 books. I think it took a few days on a server with 96 cores.
Surprisingly hard to figure out the vocab size from that repo.
For example, what would the real-world performance of ChatGPT be if we had trained it predominantly on German or Korean text?
Is English actually the best language/structure for this system?
Author compares different encoders: for Facebook's NLLB and GPT2. Where did title came from?
Another point is that OpenAI changed encoders for chat models. Link: https://github.com/openai/openai-cookbook/blob/main/examples...
Now English is less optimized for tokens usage and other languages are much more balanced. E.g. Ukrainian takes only twice as much tokens, before it had 6 times more tokens
It's not just broken grammar, it's a surprising lack of creativity, that English doesn't suffer from. ChatGPT English -> DeepL and fixing the auto-translation gives vastly improved results, than prompting ChatGPT to respond in an asian language.
Of course you'd end up with a lot more tokens. Just tokenize by word regardless of language.
But there are tons of languages (not just CJK languages) that use either compounding or combos of root + prefix/suffix/infix to express what would be multiple words in English. E.g. German 'Schadenfreude'. Its actually way more useful to tokenize this as separate parts because e.g. 'Freude' might be part of a lot of other "words" as well. So you can share that token across a lot of words, thereby keeping the vocab compact.
CJK languages do not really have that, they don't even have conjugation. They have simple suffixes at best to mark a verb as being interrogative.
Words do exist in all CJK languages and are meaningful. Tokenizing by Hangul radicals doesn't really make sense if you're not going to tokenize by decomposed letter in romance languages too (e.g. hôtel would be h, o, ^, t, e, l).
The only language that doesn't is Thai, but there are still well-documented algorithms for it.
Yes, there is overhead from localization. So what, this overhead was always there for software.
- “I want a pizza” = 4 tokens
- “Je voudrais une pizza” = 7 tokens
Why is “want” only 1 token in English, but “voudrais” 4 tokens? Following the French example, would “wants” and “wanted” map to 1 or two tokens?For example, if you were to go from French, you'd have 33 characters to work with rather than 26 (accents such). And you'd have chemisier and chemisière being two different genders of the same word that are used in different contexts.
English tends to not have this difference.
Likewise, French has more verb conjugation forms than English does.
If you were to go to Japanese, you'd have the hiragana, katakana and kanji.
While my Anglocentrism may be showing, I'm not sure there is another language that tokenizes as well when it comes to novel character combinations.
Make up a new word. Use it in a setence. Give a definition for it.
My new word is 'diflubble'. It is the feeling one gets when they are both excited and nervous in anticipation of an upcoming event.
For example, I felt diflubble on the morning of my graduation ceremony.
vs: Make up a new word in Japanese. Use it in a setence and give a translation for it. Give a definition for it.
My new Japanese word is "keigarou", which means "being full of energy".
例えば、私は今日、keigarouな気持ちでいます。
Translation: For example, I am feeling keigarou today.
The thing there is that you can't just make up new kanji. And it wouldn't be hiragana either.Take "lampara" or "pantalones" in Spanish for example. English speakers were clever enough to shorten those words to "lamp" and "pants" respectively. And they have done this with many words.
Translate text into Spanish and you will see text gets longer and there is more meaning encoded into words.
"La mesa" refers to a female table, although tables are not lifeforms and have no sex.
To me some languages impose a communication tax. It is taboo because people conflate language and culture and such.
GPT-3 is fluent in many languages despite English taking up 93% of the corpus by word count. French is next with 1.8%
https://github.com/openai/gpt-3/blob/master/dataset_statisti...
Dunno the statistics of language presence with GPT-4 but it takes it up another level in terms of its multilingual capabilities.
Going one step further, I asked it to "translate" into both "Riksmål", an artificial conservative variant of Bokmål that basically rejects most of the last few decades worth of language reforms, as well as Romeriksdialect (dialect from the Eastern part of Norway)... For the latter it gave me a lecture about how it varies internally in the region (which is correct) and presented a "translation" of a test sentence that is recognisably one of the variants from the Northern part of the region.
Of course for these competency definitely bleeds over. They share an almost identical grammar and a majority of orthography, but I'm impressed enough it can handle Norwegian that well at all, much less that it knows the distinctions between the variants.
Is it?
Are you sure about that? Most of the media we see, sure, but there has been, and still is a lot of media being produced in other languages.
GPT-4 has much more 32k token vocabulary (GPT-3 seems to have had up to 175k, GPT-2 in the neighborhood of 50k, based on the max value reported for their tokenizers). It has a 32k token context window (that is, the maximum size of prompt + response), not vocab.
But, tokens are generally semantically-significant parts of words (often whole words), not just letters or the equivalent. So, while you might get most alphabets in less than a thousand, you need a lot more than alphabet to handle a language.
While that is true (14 consonants, 10 vowels [0]), there are encodings for Korean that encode at the syllable level (where each syllable contains one or two consonants and one vowel) and the combinations for syllables are over 10000 (e.g. 11172 code points listed in Unicode, see [1]).
[0] in practice, more, both to cater for both modern and obsolete forms as well to distinguish the forms based on their position, i.e. with separate encodings for leading vs trailing consonants etc.).
And tossing ドイツ into the tokenizer shows that it is 3 tokens.
Consider also the question "is it useful to just tokenize hiragana or katakana and not all of the kanji characters?"
The glyph by glyph approach to tokenization of non-english text is already present the way that you are describing it - and because it is glyph by glyph that means that it gets expanded out and consumes more tokens.
Korean gets rather interesting because 독일 is not one character but several - multiple sounds are combined into one glyph and each glyph is one syllable. That word is 'dog-il' according to google translate. On the first glyph, ㄷ is 'd' and ㅗ is 'o' and ㄱ is a trailing 'g'. On the second glyph ㅣ is 'i' and ㄹ is a trailing 'l'.
Likewise, its GPT tokenization is 5 tokens.
that's the idea of byte pair encoding based tokenizers, reduce the average sentence's number of tokens to an optimal (short) size to reduce the computational cost. in this case, most of its training data is in english so it's going to have shorter sentences (nb of tokens) in english vs other languages
I am curious how the token set affects quality of responses, ignoring the factors related to token count mentioned in the post (cost, prompt expressivity, latency, etc)
Is it always better for the token set to be "native" to the majority of the training dataset and prompts/completions, or is it possible there's some "intermediate representation" (in compiler terms) that would be better?
The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.
GPT-4's tokenizer is already far more efficient though still weighted to English.
You can test it here https://tiktokenizer.vercel.app/
The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.
GPT-4's tokenizer is already far more efficient though still weighted to English.
BTW, English might have shorter words than many languages, but the sentences get wordier. For example, English "die" is shorter than Czech "umřít", but the sentence "We are going to die." is much longer than "Umřeme." in Czech.
The linked article shows one such language, Malayalam, costing 15.7 times more. Try again.
But when it comes to Chinese...something weird is going on.
The reason spanish might encode longer is the tokenization scheme compacts tokens based on popularity in training data, and most training data was english. No more no less.
See also (great read): https://pubmed.ncbi.nlm.nih.gov/31006626/
wrt your Spanish example: grammatical gender adds information redundancy to make it easier to process spoken language (e.g. helps with reference resolution). This redundancy enables Spanish speakers to speak at a relatively fast rate without incurring perception errors. English has fewer words but a slower speech rate. It's an optimization problem.
The speech rate issue isn't as obvious if you're only looking at text, but I'd argue/speculate that lossless speech as a language evolutionary constraint has implications for learnability.
tl;dr there is no communication tax, languages are basically equivalent wrt to information rate, they just solved the optimization problem of compactness vs speech rate differently
(But I guess I also won't be surprised if the OpenAI guys can't write algorithms worth spit if it's not a large matrix multiplication.)
It *IS* accurate to say that the program and source must be free to use for all purposes the recipient wants including commercial.
This was an involved conversation that was had during the 90's and yes, "no commercial reuse allowed" licenses are not -in fact- free licenses. I might be wrong but I have the impression they are not allowed on debian cds/dvds for that reason.
> If I use a piece of software that has been obtained under the GNU GPL, am I allowed to modify the original code into a new program, then distribute and sell that new program commercially? (#GPLCommercially)
> You are allowed to sell copies of the modified program commercially, but only under the terms of the GNU GPL. Thus, for instance, you must make the source code available to the users of the program as described in the GPL, and they must be allowed to redistribute and modify it as described in the GPL.
> These requirements are the condition for including the GPL-covered code you received in a program of your own.
https://www.gnu.org/licenses/gpl-faq.en.html#GPLCommercially
The BSD and MIT licenses
>Source code: The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost preferably, downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
Does this apply to Linux? Check.
>Derived works: The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.
Does this apply to Linux? Check.
Please visit https://opensource.org/licenses/ to see all of the licenses that are generally agreed upon to be Open Source licenses.
Tokens for non-english languages that are groups of characters just suggests that common groups of 2-3 characters from the training set became tokens, which feels unsurprising. The fallback behavior would be 1 utf8 byte = 1 token.
I mean OpanAI is a US company is unsurprisingly going to mostly communicate in English.
Are we counting all societies, if so should our software cater to all their demands, language and/or culturally demanded?
The vocab size itself is doubled. (~50k for GPT-2/3, ~100k for ChatGPT)
It certainly makes training more expensive. One clever trick to get some memory savings is to freeze the vocab embedding layer when fine tuning. It makes a noticeable improvement, both in speed and in mem required.
Surprised they went the larger vocab route. LLaMA is only 30k. I wonder what the reason is...
Thanks!
Remember, OpenAI's tokenizer was created in an era when 125MB was considered large for a language model. It's hard to fault them for making something that lasted four or five years.
GPT-2 and GPT-3 have different vocabularies and maximum token #s, which (even if the tokenizer architecture is the same) implies a different tokenizer model. GPT-3.5 might share the GPT-3 tokenizer, but even then I’d expect GPT-4 to have its own.
But even if they are using the tokenizer from GPT-3, its not from “an era when 125MB was considered large for a language model”.
You had me questioning myself for a minute.
(The vocab size is still 50257. Even rounded up to a multiple of 128 for better sharding across the vocab embedding, only the first 50257 are used.)
Believe it or not, 125M was large at the start of the GPT-2 era. No one knew LLMs could do anything interesting, let alone that they'd change the world.
I tested it by giving it some news articles from NRK Sápmi, and compare it with the Norwegian translation they have.
Edit: Seems I may have gotten lucky that time, it's being a lot more, um, creative in its translation now. Or for all I know it could be changes in the model.
Depends what you're doing. I haven't managed to make it continue after it stopped in the middle of a sentence in Japanese, but giving it the instruction to do so in English does. In some other cases, prompting in English (and asking for an answer in Japanese) can produce better results than giving the same prompt in Japanese.
Generating Japanese is slower than English (it's annoying on GPT-4), that's my reason to prefer English sometimes (especially for tech topics). ChatGPT web users don't pay for each token, but API users pay for each token, so they would make different decision.
The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.
GPT-4's tokenizer is already far more efficient though still weighted to English.
Right. It's a general question. Should they be allowed to take the kinds of optimizations they can with tokenization when it's a function of how much data they can use, even if that means some languages get more optimization than others? Or should users of those languages that could be optimized effectively pay a tax out of some sense of fairness?
Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data. But it becomes an intentional choice with consequences, like the 15.77x seen here.
Isn't that exactly how OpenAI managed to 10x GPT 3.5 with GPT 4.0?
Or is it? The original source content was in a mix of languages - english, japanese, chinese, korean, etc. It only then got translated into english and tagged in english. So if you had trained on the original source content, you would have been training on a mix of languages, but that got erased for the convenience of the people training the network.
But anyways the danbooru tags consist of things like: short hair, blue eyes, portrait. Things that are much more easier to translate (or "understand") in several languages than entire phrases like GPT does.
Though I feel that's answering a slightly different question. Data used to train currently popular models is mostly English, and the marjority of data in sources popular in the anglosphere is English. Neither of these show whether the majority of available media is English.
https://www.reddit.com/r/libgen/comments/r3lzg2/top_15_langu...
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Co...
Comparative analysis of language isn't taboo [1]. It's just vastly more complicated than you make out, and the specific examples you chose aren't representative enough to support any point.
You're likely getting downvoted for misunderstanding basic socio-linguistic concepts that belie the confidence of your arguing: conflating biological and grammatical gender, implying that English was created by a committee of clever language designers, a focus on letters and words over concepts and comprehension.
As a lowly beginner I find the lack of word boundaries in Thai frustrating but I think it’s just that I have not yet learned to think in syllables, I’m still always sounding them out in my head until I have a word I recognize, there’s no flow.
This seems like something the LLMs should be very good at. Google Translate does OK-ish while Apple just throws up its hands in frustration and refuses to translate Thai texts.
This is the software licensing world's version of "a hotdog is not a sandwich"
For the last 25 years, in the software field, the common usage has been the one from the OSI.
Of course contrarians will disagree. Contrarians will always be contrary and may be safely ignored.
1. Intelligence community people, who have long understood the term “open source” to mean a source of intelligence which is not itself secret.
2. People who, without having ever looked it up, assume it means that the source code is available for reading. These people are simply ignorant, and should be using the term “source available” instead, since it means exactly that.
3. People who want to be able to use the “open source” term for their software to gain goodwill, but don’t want to actually give all of the freedoms it should guarantee. These people are dishonest shills who try to confuse the debate in order to get away with fraudulent labeling.
(Repost of https://news.ycombinator.com/item?id=29332056)
Or is "open source" just a term for "free" as in beer software that doesn't actually give people all the freedoms it should guarantee? Because that's what the FSF thinks.
Different people have different ideas about what freedoms people "should" have. Nobody is being dishonest about software freedoms when the BSD-4-clause was written, CC0 or when they write licenses with 'no evil' or 'no nuclear proliferation' clauses.
It is also not data collected by an intelligence agency or law enforcement from public sources.
Its a model not a program. So that definition can't possibly apply to it no matter the license.
No it isn’t. The OSI invented the term "Open Source” as applied to software, and they get to define its meaning as what they intended.
> Because that's what the FSF thinks.
No they don’t. The FSF completely accepts the OSI definition of the term “Open Source”: https://www.gnu.org/philosophy/open-source-misses-the-point....
The freedoms that a license "should" convey is not a fact, it is an opinion. And there are more than a few valid and honest opinions that exist, even beyond the opinions of FSF/OSI/CC/UCB/USG/Apache/FAANG/whoever
This debate is beyond silly. It’s like arguing about what the rules of, say, Settlers of Catan is. The rules are the official rules which come with the box; anything else is house rules or custom rules, and cannot be used in something like an official tournament. When people say that “Settlers of Catan does this thing X”, and the official rules expressly says it does not do X, they are (knowingly or not) being misleading.
All of these claims are untrue. Here is an example of open source being used to describe software in 1996. OSI was founded in 1998.
https://web.archive.org/web/20180402143912/http://www.xent.c...
> This debate is beyond silly. It’s like arguing about what the rules of, say, Settlers of Catan
Commercial board games typically use trademark law to prevent others from changing their rules. Popular games which do not have legally protected names often do have multiple sets of rules defined by different people. e.g poker.
Interesting. The attendees of the meeting on February 3rd, 1998 certainly all seem to think that they at least independently re-invented the term, so the term can’t have been very common. The meeting was held two weeks after the announcement of the release of the Netscape source code, and the announcement did not use the term.
I am saying the exact opposite: that more than one definition may be valid, because the English lexicon is both descriptive and additive.
The model is CC. Because models aren't code.