xtuner.v1.train.trainer.Trainer#

class xtuner.v1.train.trainer.Trainer(*, load_from: str | Path | None = None, model_cfg: XTunerBaseModelConfig, optim_cfg: OptimConfig, fsdp_cfg: FSDPConfig | None = pydantic.BaseModel, dataset_cfg: list[xtuner.v1.datasets.config.DatasetCombine] | None = None, dataloader_cfg: DataloaderConfig, loss_cfg: CELossConfig | None = pydantic.BaseModel, lr_cfg: LRConfig, tokenizer_path: str | Path | None = None, global_batch_size: int | None, work_dir: Path | str | None = None, log_dir: Path | str | None = None, sp_size: int = 1, total_step: int | None = None, total_epoch: int | None = None, resume_cfg: ResumeConfig | None = pydantic.BaseModel, auto_resume: bool = False, load_checkpoint_cfg: LoadCheckpointConfig = pydantic.BaseModel, strict_load: bool = True, checkpoint_interval: int | None = -1, checkpoint_maxkeep: int | None = -1, async_hf_export: bool = False, skip_checkpoint_validation: bool = False, patch_for_dcp_finish: bool = False, async_checkpoint: bool = False, snapshot_interval: int | None = None, check_health_interval: int | None = None, hf_interval: int | None = None, hf_max_keep: int | None = None, exp_tracker: Literal['tensorboard', 'jsonl'] = 'jsonl', profile_step: list[int] | int | None = None, profile_time: bool = True, profile_memory: bool = False, intra_layer_micro_batch: int = 1, seed: int = 42, debug: bool = False, backend: str | None = None, debug_skip_save: bool = False, prober_list: list[str] = [], do_clip: bool = True, grad_norm_dtype: torch.dtype = torch.float32, trainer_cfg: TrainerConfig | None = None, hooks_config: HooksConfig = pydantic.BaseModel, internal_metrics_cfg: InternalMetricsConfig | None = None)[源代码]#

Trainer class for fine-tuning transformer models with FSDP support.

This class provides a high-level interface for training transformer models with configurable distributed training, optimization, and checkpointing. It supports various training configurations including sequence parallelism, tensor parallelism, and data parallelism.

参数:
  • load_from (str | Path | None) – Path to Huggingface model or saved trainer checkpoint.

  • model_cfg (TransformerConfig | InternS1BaseConfig) – Configuration for the transformer model architecture.

  • optim_cfg (OptimConfig) – Configuration for the optimizer.

  • fsdp_cfg (FSDPConfig | None) – Configuration for Fully Sharded Data Parallel (FSDP).

  • dataset_cfg (DatasetConfigList) – Configuration for training datasets.

  • dataloader_cfg (DataloaderConfig) – Configuration for the data loader.

  • loss_cfg (CELossConfig | None) – Config for the cross-entropy loss function.

  • lr_cfg (LRConfig) – Configuration for the learning rate scheduler.

  • tokenizer_path (str | Path | None) – Path to the tokenizer.

  • global_batch_size (int | None) – Global batch size for training.

  • work_dir (Path | str | None) – Directory for saving experiment outputs.

  • log_dir (Path | str | None) – Directory for log files.

  • sp_size (int) – Sequence parallel size.

  • total_step (int | None) – Total training steps.

  • total_epoch (int | None) – Number of training epochs.

  • resume_cfg (ResumeConfig | None) – Configuration for resuming training.

  • auto_resume (bool) – Whether to automatically resume training. Defaults to False.

  • load_checkpoint_cfg (LoadCheckpointConfig) – Configuration for loading checkpoints.

  • strict_load (bool) – Whether to strictly load model weights.

  • checkpoint_interval (int | None) – Interval for saving checkpoints.

  • checkpoint_maxkeep (int | None) – Maximum number of checkpoints to keep.

  • patch_for_dcp_finish (bool) – If True, skip returning finish_checkpoint result.

  • hf_interval (int | None) – Interval for saving Huggingface format checkpoints.

  • hf_max_keep (int | None) – Maximum number of Huggingface checkpoints to keep.

  • profile_step (list[int] | int | None) – Step to perform profiling.

  • profile_time (bool) – Whether to profile training time.

  • profile_memory (bool) – Whether to profile memory usage.

  • intra_layer_micro_batch (int) – Intra-layer micro batch size.

  • seed (int) – Random seed for reproducibility.

  • debug (bool) – Whether to enable debug mode.

  • backend (str) – Backend for distributed training.

Methods

build_engine(model_path, model_config, ...)

Build the training engine for the transformer model.

build_lr_scheduler(lr_cfg, scheduler_step)

Build the learning rate scheduler.

fit()

Run the training loop.

from_config(config)

Create a Trainer instance from a TrainerConfig.

build_engine(model_path: Path | None, model_config: XTunerBaseModelConfig, optim_config: OptimConfig, fsdp_config: FSDPConfig, load_checkpoint_path: str | Path | None, intra_layer_micro_batch: int = 1, strict: bool = True)[源代码]#

Build the training engine for the transformer model.

参数:
  • model_path (Path | None) – Path to the model checkpoint or None for new initialization.

  • model_config (TransformerConfig | BaseComposeConfig) – Model configuration.

  • optim_config (OptimConfig) – Optimizer configuration.

  • fsdp_config (FSDPConfig) – FSDP configuration for distributed training.

  • resume_cfg (ResumeConfig | None) – Resume configuration for continuing training.

  • intra_layer_micro_batch (int) – Intra-layer micro batch size for gradient accumulation.

  • strict (bool) – Whether to strictly load model weights.

返回:

Initialized training engine.

返回类型:

TrainEngine

build_lr_scheduler(lr_cfg: LRConfig, scheduler_step: int) torch.optim.lr_scheduler.LRScheduler[源代码]#

Build the learning rate scheduler.

参数:

lr_cfg (LRConfig) – Configuration for the learning rate scheduler.

返回:

Configured learning rate scheduler.

返回类型:

torch.optim.lr_scheduler.LRScheduler

fit()[源代码]#

Run the training loop.

This method executes the main training loop, iterating through the dataset and performing training steps. It handles data loading, forward pass, backward pass, optimization, logging, and checkpointing.

classmethod from_config(config: TrainerConfig) Self[源代码]#

Create a Trainer instance from a TrainerConfig.

参数:

config (TrainerConfig) – TrainerConfig instance containing all configuration parameters.

返回:

Trainer instance initialized with the provided config.

返回类型:

Self