Ask HN: I want to train a LM on my home country's dialect, how can I do it? I'm from Algeria. The language spoken on a daily basis by almost everybody is a weird mix of different languages : french, arabic, english..etc. I was thinking of grabbing data from tweets to fine-tune the model. I may be able to figure out other sources, but it's not gonna be much better than that. Just short-form text for the most part. I was thinking of potentially leveraging the smaller models I came across recently (nanoGPT for example) or something similar. I'm tech-savvy enough to make this work but I'd like some feedback from people more knowledgeable than me before I spend time and effort into this. Thanks! |