Anomalous Tokens in DeepSeek-V3 and r1

Jan 25

A first attempt at identifying and cataloging DeepSeek's glitched tokens

8 Comments

Nice work! Have you seen our paper in this area? It may be of interest for looking at embeddings, quite simple things are pretty effective: https://arxiv.org/abs/2405.05417

I haven't run deepseek v3 myself as it's so big, but v2 is here: https://github.com/cohere-ai/magikarp/blob/main/results/reports_mini/deepseek_ai_DeepSeek_V2_Lite.md

Expand full comment

Reply (1)

henry

Jan 28

I hadn't yet, thanks for pointing me to it! Very interesting!

Expand full comment

Mike

Jan 27

it is pretty hard to make it write "Nameeee" exactly as it is, but somehow it was able to do that after 5 messages in chat with r1. Also funny to see model trying to justify itself

Expand full comment

saisengen

Jan 26

Cebuano language has second biggest Wikipedia, it consists almost entirely of bot generated articles about geographical landmarks, locations and biological species. Another Philippines language, Waray, has 8th largest Wikipedia, created by the same bot owner. This models was learned on these Wikipedias.

Expand full comment

Zwoelf

Jan 25

Great article. This is so interesting. Would you mind sharing a complete list of the anomalous tokens?

Expand full comment

Reply (1)

henry

Jan 25

both runs, semi-cleaned but still with some false positives:

1st) https://pastebin.com/XUV4u93d

2nd) https://pastebin.com/iUzC8s1V

Expand full comment

Reply (1)

Zwoelf

Jan 25

Thank you

Expand full comment

Blaise d'Estais

Jan 25

Hi very interesting thanks!

Also other tokenization pb, for texts in medieval french/latin/italian etc,

I got texts from (imperfect) OCRs,

But at medieval time, s were written ſ, words weren't spelled exactly the same, and grammar wasn't the same, and so LLMs are completely lost, and can't reproduce the text at all. Exemple medieval French text:

Preftres & moynes. Ils fuiuoient en cela la cruauté de

leur Capitane Zifca, lequel non contant d'auoir faict

affez du fol en fa vie, ordonna par fon teftament, qu'il

fut efcorché, & qu'on feit de fa peau vn tabourin, af-

feurant que au fon d'iceluy leurs ennemis fenfuyroiét

tous effrayez.Coribut donc feftoit ioinct auec ces gés.

On traicta auec eux des poincts de la Religion: & leur

furet baillez des Docteurs de l'vniuerfité de Cracouie,

pour difputer auec eux & refuterleurs erreurs. Le Roy

fans entrer en ces difputes leur remonftra combien de

troubles il y auoir eu entr'eux. à caufe du changement

vue 330/493

Expand full comment

Outside Text

Anomalous Tokens in DeepSeek-V3 and r1