xtuner.v1.train.toy_tokenizer.UTF8ByteTokenizer#

class xtuner.v1.train.toy_tokenizer.UTF8ByteTokenizer(bos_token_id: int | None = 256, eos_token_id: int | None = 257, pad_token_id: int | None = 258, image_start_id_map: dict[str, int] | None = None, image_context_id_map: dict[str, int] | None = None, image_end_id_map: dict[str, int] | None = None)[源代码]#

A simple byte-level tokenizer that encodes text as UTF-8 bytes with special token handling.

This tokenizer converts text into a sequence of byte values (0-255) and supports special tokens for text boundaries and image-related placeholders. It provides basic encoding/decoding functionality compatible with transformers’ tokenizer interface.

参数:
  • bos_token_id (int, optional) – Beginning of sequence token ID. Defaults to 256.

  • eos_token_id (int, optional) – End of sequence token ID. Defaults to 257.

  • pad_token_id (int, optional) – Padding token ID. Defaults to 258.

  • image_start_id_map (dict[str, int], optional) – Mapping from image start string to token ID. Defaults to {“<img>”: 259}.

  • image_context_id_map (dict[str, int], optional) – Mapping from image context string to token ID. Defaults to {“<IMG_CONTEXT>”: 260}.

  • image_end_id_map (dict[str, int], optional) – Mapping from image end string to token ID. Defaults to {“</img>”: 261}.

Methods

convert_tokens_to_ids(token)

decode(ids[, skip_special_tokens])

Decode a sequence of token IDs back to text.

encode(text[, add_special_tokens])

Encode text into a sequence of token IDs.

save_pretrained(save_directory)

convert_tokens_to_ids(token: str) int | list[int][源代码]#
decode(ids: list[int], skip_special_tokens: bool = True) str[源代码]#

Decode a sequence of token IDs back to text.

Converts byte-level token IDs back to UTF-8 text. Special tokens are either skipped or kept as-is based on skip_special_tokens parameter.

参数:
  • ids (list[int]) – List of token IDs to decode.

  • skip_special_tokens (bool, optional) – Whether to skip special tokens during decoding. Defaults to True.

返回:

The decoded text.

返回类型:

str

抛出:

ValueError – If any token ID is negative.

encode(text: str, add_special_tokens: bool = False) list[int][源代码]#

Encode text into a sequence of token IDs.

Converts text to UTF-8 bytes and replaces special image tokens with their corresponding IDs. Optionally adds BOS/EOS tokens.

参数:
  • text (str) – The text to encode.

  • add_special_tokens (bool, optional) – Whether to add BOS/EOS tokens. Defaults to False.

返回:

List of token IDs representing the encoded text.

返回类型:

list[int]

save_pretrained(save_directory: str | Path)[源代码]#