xtuner.v1.train.toy_tokenizer.UTF8ByteTokenizer#

class xtuner.v1.train.toy_tokenizer.UTF8ByteTokenizer(bos_token_id: int | None = 256, eos_token_id: int | None = 257, pad_token_id: int | None = 258, image_start_id_map: dict[str, int] | None = None, image_context_id_map: dict[str, int] | None = None, image_end_id_map: dict[str, int] | None = None)[源代码]#

A simple byte-level tokenizer that encodes text as UTF-8 bytes with special token handling.

This tokenizer converts text into a sequence of byte values (0-255) and supports special tokens for text boundaries and image-related placeholders. It provides basic encoding/decoding functionality compatible with transformers’ tokenizer interface.

参数:

bos_token_id (int, optional) – Beginning of sequence token ID. Defaults to 256.
eos_token_id (int, optional) – End of sequence token ID. Defaults to 257.
pad_token_id (int, optional) – Padding token ID. Defaults to 258.
image_start_id_map (dict[str, int], optional) – Mapping from image start string to token ID. Defaults to {“<img>”: 259}.
image_context_id_map (dict[str, int], optional) – Mapping from image context string to token ID. Defaults to {“<IMG_CONTEXT>”: 260}.
image_end_id_map (dict[str, int], optional) – Mapping from image end string to token ID. Defaults to {“</img>”: 261}.

Methods

`convert_tokens_to_ids`(token)
`decode`(ids[, skip_special_tokens])	Decode a sequence of token IDs back to text.
`encode`(text[, add_special_tokens])	Encode text into a sequence of token IDs.
`save_pretrained`(save_directory)

convert_tokens_to_ids(token: str) → int | list[int][源代码]#

decode(ids: list[int], skip_special_tokens: bool = True) → str[源代码]#

Decode a sequence of token IDs back to text.

Converts byte-level token IDs back to UTF-8 text. Special tokens are either skipped or kept as-is based on skip_special_tokens parameter.

参数:

ids (list[int]) – List of token IDs to decode.
skip_special_tokens (bool, optional) – Whether to skip special tokens during decoding. Defaults to True.

返回:

The decoded text.

返回类型:

str

抛出:

ValueError – If any token ID is negative.

encode(text: str, add_special_tokens: bool = False) → list[int][源代码]#

Encode text into a sequence of token IDs.

Converts text to UTF-8 bytes and replaces special image tokens with their corresponding IDs. Optionally adds BOS/EOS tokens.

参数:

text (str) – The text to encode.
add_special_tokens (bool, optional) – Whether to add BOS/EOS tokens. Defaults to False.

返回:

List of token IDs representing the encoded text.

返回类型:

list[int]

save_pretrained(save_directory: str | Path)[源代码]#

xtuner.v1.train.toy_tokenizer.UTF8ByteTokenizer

目录

xtuner.v1.train.toy_tokenizer.UTF8ByteTokenizer#