Skip to content

Tokenizer

bit_axon.tokenizer.QwenTokenizerWrapper

QwenTokenizerWrapper(path_or_name: str | Path)

Lightweight Qwen2.5 tokenizer wrapper using the tokenizers library.

Loads a tokenizer.json file (Qwen2.5 format) and provides: - encode/decode - Qwen2.5 chat template rendering (pure Python, no Jinja) - Special token properties

Load tokenizer from local file path or HuggingFace Hub repo name.

  • If path is a local file that exists: HFTokenizer.from_file()
  • If path looks like a HuggingFace ID (contains '/'): download tokenizer.json via huggingface_hub.hf_hub_download, then load with HFTokenizer.from_file()

Attributes

pad_token_id property
pad_token_id: int

Return the pad token ID (endoftext, 151643 for Qwen2.5).

eos_token_id property
eos_token_id: int

Return the end-of-sequence token ID (im_end, 151645 for Qwen2.5).

vocab_size property
vocab_size: int

Return the vocabulary size including added tokens.

Functions

encode
encode(text: str) -> list[int]

Encode text to list of token IDs.

decode
decode(token_ids: list[int] | array, skip_special_tokens: bool = True) -> str

Decode token IDs to text. Accepts list or mx.array.

apply_chat_template
apply_chat_template(messages: list[dict[str, str]], add_generation_prompt: bool = False) -> list[int]

Apply Qwen2.5 chat template to messages.

Template: <|im_start|>{role}\n{content}<|im_end|>\n If add_generation_prompt=True, appends: <|im_start|>assistant\n

Parameters:

Name Type Description Default
messages list[dict[str, str]]

[{"role": "system"|"user"|"assistant", "content": "..."}]

required
add_generation_prompt bool

Whether to append assistant prompt

False

Returns: list of token IDs