7 Comments

Great article. This is so interesting. Would you mind sharing a complete list of the anomalous tokens?

Expand full comment

both runs, semi-cleaned but still with some false positives:

1st) https://pastebin.com/XUV4u93d

2nd) https://pastebin.com/iUzC8s1V

Expand full comment

Thank you

Expand full comment

Hi very interesting thanks!

Also other tokenization pb, for texts in medieval french/latin/italian etc,

I got texts from (imperfect) OCRs,

But at medieval time, s were written ſ, words weren't spelled exactly the same, and grammar wasn't the same, and so LLMs are completely lost, and can't reproduce the text at all. Exemple medieval French text:

Preftres & moynes. Ils fuiuoient en cela la cruauté de

leur Capitane Zifca, lequel non contant d'auoir faict

affez du fol en fa vie, ordonna par fon teftament, qu'il

fut efcorché, & qu'on feit de fa peau vn tabourin, af-

feurant que au fon d'iceluy leurs ennemis fenfuyroiét

tous effrayez.Coribut donc feftoit ioinct auec ces gés.

On traicta auec eux des poincts de la Religion: & leur

furet baillez des Docteurs de l'vniuerfité de Cracouie,

pour difputer auec eux & refuterleurs erreurs. Le Roy

fans entrer en ces difputes leur remonftra combien de

troubles il y auoir eu entr'eux. à caufe du changement

vue 330/493

Expand full comment

Nice work! Have you seen our paper in this area? It may be of interest for looking at embeddings, quite simple things are pretty effective: https://arxiv.org/abs/2405.05417

I haven't run deepseek v3 myself as it's so big, but v2 is here: https://github.com/cohere-ai/magikarp/blob/main/results/reports_mini/deepseek_ai_DeepSeek_V2_Lite.md

Expand full comment

it is pretty hard to make it write "Nameeee" exactly as it is, but somehow it was able to do that after 5 messages in chat with r1. Also funny to see model trying to justify itself

Expand full comment

Cebuano language has second biggest Wikipedia, it consists almost entirely of bot generated articles about geographical landmarks, locations and biological species. Another Philippines language, Waray, has 8th largest Wikipedia, created by the same bot owner. This models was learned on these Wikipedias.

Expand full comment