xtuner.v1.train.toy_tokenizer.UTF8ByteTokenizer#
- class xtuner.v1.train.toy_tokenizer.UTF8ByteTokenizer(bos_token_id: int | None = 256, eos_token_id: int | None = 257, pad_token_id: int | None = 258, image_start_id_map: dict[str, int] | None = None, image_context_id_map: dict[str, int] | None = None, image_end_id_map: dict[str, int] | None = None)[源代码]#
A simple byte-level tokenizer that encodes text as UTF-8 bytes with special token handling.
This tokenizer converts text into a sequence of byte values (0-255) and supports special tokens for text boundaries and image-related placeholders. It provides basic encoding/decoding functionality compatible with transformers’ tokenizer interface.
- 参数:
bos_token_id (int, optional) – Beginning of sequence token ID. Defaults to 256.
eos_token_id (int, optional) – End of sequence token ID. Defaults to 257.
pad_token_id (int, optional) – Padding token ID. Defaults to 258.
image_start_id_map (dict[str, int], optional) – Mapping from image start string to token ID. Defaults to {“<img>”: 259}.
image_context_id_map (dict[str, int], optional) – Mapping from image context string to token ID. Defaults to {“<IMG_CONTEXT>”: 260}.
image_end_id_map (dict[str, int], optional) – Mapping from image end string to token ID. Defaults to {“</img>”: 261}.
Methods
convert_tokens_to_ids(token)decode(ids[, skip_special_tokens])Decode a sequence of token IDs back to text.
encode(text[, add_special_tokens])Encode text into a sequence of token IDs.
save_pretrained(save_directory)- decode(ids: list[int], skip_special_tokens: bool = True) str[源代码]#
Decode a sequence of token IDs back to text.
Converts byte-level token IDs back to UTF-8 text. Special tokens are either skipped or kept as-is based on skip_special_tokens parameter.
- 参数:
ids (list[int]) – List of token IDs to decode.
skip_special_tokens (bool, optional) – Whether to skip special tokens during decoding. Defaults to True.
- 返回:
The decoded text.
- 返回类型:
str
- 抛出:
ValueError – If any token ID is negative.
- encode(text: str, add_special_tokens: bool = False) list[int][源代码]#
Encode text into a sequence of token IDs.
Converts text to UTF-8 bytes and replaces special image tokens with their corresponding IDs. Optionally adds BOS/EOS tokens.
- 参数:
text (str) – The text to encode.
add_special_tokens (bool, optional) – Whether to add BOS/EOS tokens. Defaults to False.
- 返回:
List of token IDs representing the encoded text.
- 返回类型:
list[int]