Also other tokenization pb, for texts in medieval french/latin/italian etc,
I got texts from (imperfect) OCRs,
But at medieval time, s were written ſ, words weren't spelled exactly the same, and grammar wasn't the same, and so LLMs are completely lost, and can't reproduce the text at all. Exemple medieval French text:
Preftres & moynes. Ils fuiuoient en cela la cruauté de
leur Capitane Zifca, lequel non contant d'auoir faict
affez du fol en fa vie, ordonna par fon teftament, qu'il
fut efcorché, & qu'on feit de fa peau vn tabourin, af-
feurant que au fon d'iceluy leurs ennemis fenfuyroiét
tous effrayez.Coribut donc feftoit ioinct auec ces gés.
On traicta auec eux des poincts de la Religion: & leur
furet baillez des Docteurs de l'vniuerfité de Cracouie,
pour difputer auec eux & refuterleurs erreurs. Le Roy
fans entrer en ces difputes leur remonftra combien de
troubles il y auoir eu entr'eux. à caufe du changement
Nice work! Have you seen our paper in this area? It may be of interest for looking at embeddings, quite simple things are pretty effective: https://arxiv.org/abs/2405.05417
it is pretty hard to make it write "Nameeee" exactly as it is, but somehow it was able to do that after 5 messages in chat with r1. Also funny to see model trying to justify itself
Cebuano language has second biggest Wikipedia, it consists almost entirely of bot generated articles about geographical landmarks, locations and biological species. Another Philippines language, Waray, has 8th largest Wikipedia, created by the same bot owner. This models was learned on these Wikipedias.
Great article. This is so interesting. Would you mind sharing a complete list of the anomalous tokens?
both runs, semi-cleaned but still with some false positives:
1st) https://pastebin.com/XUV4u93d
2nd) https://pastebin.com/iUzC8s1V
Thank you
Hi very interesting thanks!
Also other tokenization pb, for texts in medieval french/latin/italian etc,
I got texts from (imperfect) OCRs,
But at medieval time, s were written ſ, words weren't spelled exactly the same, and grammar wasn't the same, and so LLMs are completely lost, and can't reproduce the text at all. Exemple medieval French text:
Preftres & moynes. Ils fuiuoient en cela la cruauté de
leur Capitane Zifca, lequel non contant d'auoir faict
affez du fol en fa vie, ordonna par fon teftament, qu'il
fut efcorché, & qu'on feit de fa peau vn tabourin, af-
feurant que au fon d'iceluy leurs ennemis fenfuyroiét
tous effrayez.Coribut donc feftoit ioinct auec ces gés.
On traicta auec eux des poincts de la Religion: & leur
furet baillez des Docteurs de l'vniuerfité de Cracouie,
pour difputer auec eux & refuterleurs erreurs. Le Roy
fans entrer en ces difputes leur remonftra combien de
troubles il y auoir eu entr'eux. à caufe du changement
vue 330/493
Nice work! Have you seen our paper in this area? It may be of interest for looking at embeddings, quite simple things are pretty effective: https://arxiv.org/abs/2405.05417
I haven't run deepseek v3 myself as it's so big, but v2 is here: https://github.com/cohere-ai/magikarp/blob/main/results/reports_mini/deepseek_ai_DeepSeek_V2_Lite.md
it is pretty hard to make it write "Nameeee" exactly as it is, but somehow it was able to do that after 5 messages in chat with r1. Also funny to see model trying to justify itself
Cebuano language has second biggest Wikipedia, it consists almost entirely of bot generated articles about geographical landmarks, locations and biological species. Another Philippines language, Waray, has 8th largest Wikipedia, created by the same bot owner. This models was learned on these Wikipedias.