Tokenizer
Workflow
- Tokenization/分词/Encode - 文本 -> Token IDs
- 一般使用 BPE 算法
- Decode - Token IDs -> 文本
- 算法
- BPE - Byte Pair Encoding - 基于频率合并、自底向上 - BP 数据压缩算法
- Byte Level
- 能处理任意数据,不会出现
[UNK]
- 能处理任意数据,不会出现
- Character Level
- Word Level
- e.g. gpt2, llama
- Byte Level
- WordPiece - 基于概率合并、自底向上
- e.g. bert
- https://arxiv.org/pdf/1609.08144.pdf
- Unigram - 基于概率剪枝、自顶向下
- e.g. T5, ALBERT
- BPE - Byte Pair Encoding - 基于频率合并、自底向上 - BP 数据压缩算法
- digram coding / Byte-pair encoding / BPE tokeniser
- SentencePiece
- Apache-2.0, C++
- 支持 BPE, unigram, char, word
- openai/tiktoken
- pypi tiktoken
- 定义的模型 https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py
- huggingface/tokenizers
- Apache-2.0, Rust
- Go binding daulet/tokenizers
- Proposal: Add Golang Bindings for tokenizers huggingface/tokenizers#1751
- https://github.com/rsennrich/subword-nmt
- Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
- Subword Neural Machine Translation
- dqbd/tiktoken
- JS port and JS/WASM bindings for openai/tiktoken
- https://github.com/dqbd/tiktoken/blob/main/tiktoken/model_to_encoding.json
- https://github.com/dqbd/tiktoken/blob/main/tiktoken/registry.json
- Online https://tiktokenizer.vercel.app/
- dqbd/tiktokenizer
- Online playground
- dqbd/tiktokenizer
- https://tiktokenizer.vercel.app/hf/Qwen/Qwen2.5-72B/tokenizer.json
- zurawiki/tiktoken-rs
- MIT, Rust
- https://crates.io/crates/tiktoken-rs
- Typescript/JS/Wasm
- Golang
- tiktoken-go/tokenizer
- MIT, Go
- pkoukk/tiktoken-go
- MIT, Go
- tiktoken-go/tokenizer
- https://platform.openai.com/tokenizer
ChatGPT 特殊 Token
<|endoftext|>
<|endofprompt|>
<|eos|>
<|pad|>
<|bos|>
<|eol|>
<|math|>
<|doc|>
<|im_start|>
<|im_end|>
<|im_sep|>
<|fim_prefix|>
<|fim_middle|>
<|fim_suffix|>
- add_special_tokens
- 序列标记
- 对话标记
- 功能标记
model | vocab size | tokenizer | notes |
---|---|---|---|
openai/gpt2 | 50257 | byte-level-bpe | |
openai/r50k_base | 50257 | byte-level-bpe | |
openai/p50k_base | 50281 | byte-level-bpe | |
openai/p50k_edit | 50283 | byte-level-bpe | |
openai/cl100k_base | 100276 | byte-level-bpe | |
openai/o200k_base | 200018 | byte-level-bpe | |
openai/o200k_harmony | 201088 | byte-level-bpe |
- o200k_harmony
- vocabe 201088
- 在 c200k_base 上增加了额外的特殊 token
abbr. | stand for | meaning |
---|---|---|
BPE | Byte Pair Encoding | |
unigram | Unigram | |
NFKC | Unicode Normalization Form KC |
SentencePiece
- spm_encode
- spm_decode
- spm_normalize
- spm_train
- spm_export_vocab
pip install sentencepiece
FAQ
tokenizer.json vs vocab.json
- vocab.json + merges.txt
tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json)
# 会保存为 merges.txt+vocab.json
tokenizer.model.save('tokenizer')