xtuner.v1.train.trainer.Trainer#
- class xtuner.v1.train.trainer.Trainer(*, load_from: str | Path | None = None, model_cfg: XTunerBaseModelConfig, optim_cfg: OptimConfig, fsdp_cfg: FSDPConfig | None = pydantic.BaseModel, dataset_cfg: list[xtuner.v1.datasets.config.DatasetCombine] | None = None, dataloader_cfg: DataloaderConfig, loss_cfg: CELossConfig | None = pydantic.BaseModel, lr_cfg: LRConfig, tokenizer_path: str | Path | None = None, global_batch_size: int | None, work_dir: Path | str | None = None, log_dir: Path | str | None = None, sp_size: int = 1, total_step: int | None = None, total_epoch: int | None = None, resume_cfg: ResumeConfig | None = pydantic.BaseModel, auto_resume: bool = False, load_checkpoint_cfg: LoadCheckpointConfig = pydantic.BaseModel, strict_load: bool = True, checkpoint_interval: int | None = -1, checkpoint_maxkeep: int | None = -1, async_hf_export: bool = False, skip_checkpoint_validation: bool = False, patch_for_dcp_finish: bool = False, async_checkpoint: bool = False, snapshot_interval: int | None = None, check_health_interval: int | None = None, hf_interval: int | None = None, hf_max_keep: int | None = None, exp_tracker: Literal['tensorboard', 'jsonl'] = 'jsonl', profile_step: list[int] | int | None = None, profile_time: bool = True, profile_memory: bool = False, intra_layer_micro_batch: int = 1, seed: int = 42, debug: bool = False, backend: str | None = None, debug_skip_save: bool = False, prober_list: list[str] = [], do_clip: bool = True, grad_norm_dtype: torch.dtype = torch.float32, trainer_cfg: TrainerConfig | None = None, hooks_config: HooksConfig = pydantic.BaseModel, internal_metrics_cfg: InternalMetricsConfig | None = None)[源代码]#
Trainer class for fine-tuning transformer models with FSDP support.
This class provides a high-level interface for training transformer models with configurable distributed training, optimization, and checkpointing. It supports various training configurations including sequence parallelism, tensor parallelism, and data parallelism.
- 参数:
load_from (str | Path | None) – Path to Huggingface model or saved trainer checkpoint.
model_cfg (TransformerConfig | InternS1BaseConfig) – Configuration for the transformer model architecture.
optim_cfg (OptimConfig) – Configuration for the optimizer.
fsdp_cfg (FSDPConfig | None) – Configuration for Fully Sharded Data Parallel (FSDP).
dataset_cfg (DatasetConfigList) – Configuration for training datasets.
dataloader_cfg (DataloaderConfig) – Configuration for the data loader.
loss_cfg (CELossConfig | None) – Config for the cross-entropy loss function.
lr_cfg (LRConfig) – Configuration for the learning rate scheduler.
tokenizer_path (str | Path | None) – Path to the tokenizer.
global_batch_size (int | None) – Global batch size for training.
work_dir (Path | str | None) – Directory for saving experiment outputs.
log_dir (Path | str | None) – Directory for log files.
sp_size (int) – Sequence parallel size.
total_step (int | None) – Total training steps.
total_epoch (int | None) – Number of training epochs.
resume_cfg (ResumeConfig | None) – Configuration for resuming training.
auto_resume (bool) – Whether to automatically resume training. Defaults to False.
load_checkpoint_cfg (LoadCheckpointConfig) – Configuration for loading checkpoints.
strict_load (bool) – Whether to strictly load model weights.
checkpoint_interval (int | None) – Interval for saving checkpoints.
checkpoint_maxkeep (int | None) – Maximum number of checkpoints to keep.
patch_for_dcp_finish (bool) – If True, skip returning finish_checkpoint result.
hf_interval (int | None) – Interval for saving Huggingface format checkpoints.
hf_max_keep (int | None) – Maximum number of Huggingface checkpoints to keep.
profile_step (list[int] | int | None) – Step to perform profiling.
profile_time (bool) – Whether to profile training time.
profile_memory (bool) – Whether to profile memory usage.
intra_layer_micro_batch (int) – Intra-layer micro batch size.
seed (int) – Random seed for reproducibility.
debug (bool) – Whether to enable debug mode.
backend (str) – Backend for distributed training.
Methods
build_engine(model_path, model_config, ...)Build the training engine for the transformer model.
build_lr_scheduler(lr_cfg, scheduler_step)Build the learning rate scheduler.
fit()Run the training loop.
from_config(config)Create a Trainer instance from a TrainerConfig.
- build_engine(model_path: Path | None, model_config: XTunerBaseModelConfig, optim_config: OptimConfig, fsdp_config: FSDPConfig, load_checkpoint_path: str | Path | None, intra_layer_micro_batch: int = 1, strict: bool = True)[源代码]#
Build the training engine for the transformer model.
- 参数:
model_path (Path | None) – Path to the model checkpoint or None for new initialization.
model_config (TransformerConfig | BaseComposeConfig) – Model configuration.
optim_config (OptimConfig) – Optimizer configuration.
fsdp_config (FSDPConfig) – FSDP configuration for distributed training.
resume_cfg (ResumeConfig | None) – Resume configuration for continuing training.
intra_layer_micro_batch (int) – Intra-layer micro batch size for gradient accumulation.
strict (bool) – Whether to strictly load model weights.
- 返回:
Initialized training engine.
- 返回类型:
TrainEngine
- build_lr_scheduler(lr_cfg: LRConfig, scheduler_step: int) torch.optim.lr_scheduler.LRScheduler[源代码]#
Build the learning rate scheduler.
- 参数:
lr_cfg (LRConfig) – Configuration for the learning rate scheduler.
- 返回:
Configured learning rate scheduler.
- 返回类型:
torch.optim.lr_scheduler.LRScheduler