How well does ChatGPT speak Japanese?(passaglia.jp) |
How well does ChatGPT speak Japanese?(passaglia.jp) |
> tokenizes text in a linguistically equitable way
> ensures that linguistic diversity is prioritized throughout the training process
You’d get very different results if you used a different language and my guess is that if you applied BPE to a huge corpus that was balanced (same amount of Japanese, Korean, French, …) you’d get something mediocre for all of them.
On the other hand this article seems to show that GPT-4 does a good job with a terrible tokenization which leaves me thinking… Could we just give up on word parts entirely and fall back on character (or UTF-8) modeling?