```{important}
XTuner's RL (Reinforcement Learning) functionality is currently in Beta version. RL features are continuously being improved, welcome to try it out and provide feedback.
```


# [Beta] Customizing GRPO Training with Python Code


In the previous [tutorial](../../get_started/grpo.md), we experienced quickly launching GRPO reinforcement learning training through the command line. This tutorial will introduce how to customize GRPO training configuration through Python code, allowing you to have more flexible control over training parameters.

GRPO training mainly includes two configuration modules: **Generation Config (Generation Configuration)** and **Trainer Config (Training Configuration)**.

## 1. Generation Config (Generation Configuration)

In reinforcement learning training, data generation is a key link, usually containing three steps: **sampling → inference → filtering**. In the inference stage, we use efficient inference engines (such as LMDeploy) to generate model responses. This section will introduce various configurations related to data generation to help you control the entire generation process.

### 1.1 DataFlowConfig

`DataFlow` is the core controller for training data generation, responsible for coordinating the entire generation process.

For the GRPO algorithm, in `DataFlowConfig`, you need to modify the following key parameters:
- `prompt_repeat_k`: The number of repeated sampling times for each prompt
- `global_batch_size`: The global batch size for each Rollout round

```{tip}
:class: margin

For more configuration parameters, please refer to the API documentation: {class}`~xtuner.v1.ray.dataflow.DataFlowConfig`
```

```{code-block} python
:caption: Configure Data Flow
from xtuner.v1.ray.dataflow import DataFlowConfig

dataflow_config = DataFlowConfig(
    prompt_repeat_k=5,
    global_batch_size=1024
)
```


### 1.2 ReplayBufferConfig

The experience replay pool (`Replay Buffer`) is like a "data warehouse", its job is very simple: **sample data, store data, provide data according to certain rules**. In reinforcement learning, samples generated by the model will be stored in this "warehouse" first, and then data will be taken from here for training when training the model.

**For most users, you only need to modify four key parameters in `ReplayBufferConfig` to use it normally**:
- `model_path`: Model path
- `train_data_path`: Training data path
- `max_prompt_length`: Maximum length of input text
- `pack_max_length`: Maximum length of training data packaging

```{code-block} python
:caption: Configure Experience Replay Pool
from transformers import AutoTokenizer
from xtuner.v1.config import DatasetConfig, DataloaderConfig
from xtuner.v1.ray.dataflow import ReplayBufferConfig
from xtuner.v1.datasets import RLTextTokenizeFnConfig

train_data_path = "./gsm8k/train.jsonl"    # Training data path
model_path = "/path/to/qwen3-8B"           # Model path
max_prompt_length = 512                    # Maximum input length
pack_max_length = 32768                    # Maximum packaging length

replay_buffer_cfg = ReplayBufferConfig(
    dataset_cfg=[{
        "dataset": DatasetConfig(name="gsm8k", anno_path=train_data_path),
        "tokenize_fn": RLTextTokenizeFnConfig(max_length=max_prompt_length),
    }],
    dataloader_cfg=DataloaderConfig(
        pack_max_length=pack_max_length,
        collator='fake_collator',
        pack_level='none',
    ),
    tokenizer=AutoTokenizer.from_pretrained(model_path, trust_remote_code=True),
)
```

### 1.3 RolloutConfig

`RolloutConfig` is responsible for configuring the model inference environment, it determines how to use the model to generate sample data needed for training. You can think of it as the "configuration file for the inference engine".

In this example, you only need to specify the model path to start using. Other configurations use defaults.

```{tip}
:class: margin

If you need more fine-grained control (such as distributed inference, inference optimization options, etc.), you can refer to the API documentation: {class}`~xtuner.v1.ray.config.worker.RolloutConfig`
```

```{code-block} python
:caption: Configure Inference Environment
from xtuner.v1.rl.rollout.worker import RolloutConfig

model_path = "/path/to/qwen3-8B"  # Replace with your model path

rollout_config = RolloutConfig(
    model_path=model_path,           # Inference model path
    model_name="qwen3-8B",           # Model name
    tokenizer_path=model_path,       # Tokenizer path
)
```


### 1.4 JudgerConfig

XTuner provides ready-made judges for GSM8K. You can use the example code directly.

```{code-block} python
:caption: Configure Reward Model
from xtuner.v1.rl.judger import GSM8KJudgerConfig
from xtuner.v1.rl.utils import CPUResourcesConfig

judger_config = GSM8KJudgerConfig(
    judger_name="openai/gsm8k",
    cpu_resources=CPUResourcesConfig(
        num_workers=1,
        num_cpus_per_worker=1,
    ),
)
```

**Usage Instructions**:
- `"openai/gsm8k"`: Logical judge name. With a single `JudgerConfig`, samples are sent directly to this judge. With `ComposedJudgerConfig`, `RolloutState.data_source` routes samples to branches, and string values or dict keys must match `branches`.
- `GSM8KJudgerConfig()`: Judge specifically for GSM8K math problems, will check if the numerical answer is correct
- `cpu_resources`: Runs the judge in PG-external Ray CPU actor(s). If omitted, the judge runs locally.

💡 **Extended Functionality**: XTuner supports functional reward handlers, API-service reward handlers, custom Judgers, and composed Judgers for routing or multi-judge scoring.

## 2. Trainer Config (Training Configuration)

### 2.1 WorkerConfig

`WorkerConfig` is the core of the training phase, it controls how the model learns and optimizes. This includes all training-related core configurations such as model structure, optimizer, loss function, etc.

For the Qwen3-8B model, we have prepared best practice configurations for you. In most cases, you only need to specify basic parameters such as model path, training optimization steps, and training data packaging length:

```{tip}
:class: margin

For more configuration parameters, please refer to the API documentation: {class}`~xtuner.v1.rl.base.worker.WorkerConfig`
```

```{code-block} python
:caption: Configure Training Strategy
from xtuner.v1.config import AdamWConfig, FSDPConfig, LRConfig
from xtuner.v1.model.dense.qwen3 import Qwen3Dense8BConfig
from xtuner.v1.rl.trainer import WorkerConfig
from xtuner.v1.rl.loss import GRPOLossConfig

model_path = "/path/to/qwen3-8B"        # Fill in your model path
train_optimizer_steps = 4               # Training optimization steps
pack_max_length = 32768                 # Maximum data packaging length

train_worker_cfg = WorkerConfig(
    model_cfg=Qwen3Dense8BConfig(),                    # Use preset Qwen3-8B configuration
    optim_cfg=AdamWConfig(lr=1e-6, foreach=False),    # Optimizer: learning rate 1e-6
    loss_cfg=GRPOLossConfig(                          # GRPO loss function configuration
        policy_loss_cfg=dict(
            cliprange_high=0.2,     # Policy gradient clipping upper limit
            cliprange_low=0.2,      # Policy gradient clipping lower limit
            loss_type="vanilla",    # Loss type
        ),
        ignore_idx=-100,            # Ignored token index
        use_kl_loss=True,           # Enable KL divergence loss
        kl_loss_coef=0.001,         # KL loss coefficient
        kl_loss_type="low_var_kl",  # KL loss type
        mode="chunk",               # Calculation mode
        chunk_size=512              # Chunk size
    ),
    lr_cfg=LRConfig(warmup_ratio=0),       # Learning rate strategy: no warmup
    fsdp_cfg=FSDPConfig(),                 # Distributed training configuration
    load_from=model_path,                  # Load model path
    optimizer_steps=train_optimizer_steps, # Optimization steps
    pack_max_length=pack_max_length,       # Maximum sequence length
)
```


### 2.2 EvaluatorConfig [Optional]

If you need to perform validation during training, you can configure `EvaluatorConfig`. It defines the validation dataset, validation frequency, etc.
In this example, you only need to modify eval_data_path and evaluate_step interval.

```{code-block} python
:caption: Configure Validation Process
from xtuner.v1.ray.evaluator import EvaluatorConfig

eval_data_path = "./gsm8k/test.jsonl"
eval_dataset_cfg = [{"dataset": DatasetConfig(name="gsm8k", anno_path=eval_data_path)}]
evaluator_cfg = EvaluatorConfig(
    dataset_cfg=eval_dataset_cfg,
    tokenizer=tokenizer,
    evaluate_step=10, # Validate once every 10 training epochs
)
```

## 3. Build and Launch RLTrainer

### 3.1 AcceleratorResourcesConfig

In addition to the above generation and training configurations, we need to configure system required resources (such as GPU, CPU, memory), etc. Here we use the default resource configuration, example as follows.

```{code-block} python
from xtuner.v1.ray.accelerator import AcceleratorResourcesConfig
resources = AcceleratorResourcesConfig(
    accelerator="GPU",
    num_accelerators_per_worker=1,
    num_cpus_per_worker=12,
    num_workers=8,
    cpu_memory_per_worker=16 * 1024**3,
)
```

### 3.2 Assemble RLTrainer
After completing the configuration of all components, we can assemble them into `RLTrainer` and launch the training process.

```{code-block} python
:caption: Build and Launch RLTrainer
import ray
from xtuner.v1.train.rl_trainer import RLTrainer

# Initialize Ray
ray.init(num_cpus=128, ignore_reinit_error=True)

# Modify paths
model_path = "/path/to/qwen3-8B"
train_data_path = "./gsm8k/train.jsonl"
eval_data_path = "./gsm8k/test.jsonl"
work_dir = "work_dirs/grpo_py_train"

# Configure parameters
prompt_repeat_k = 5
global_batch_size = 1024
max_prompt_length = 512
pack_max_length = 32768
train_optimizer_steps = 4

# Declare all above configs
# ...

# Assemble RLTrainer
trainer = RLTrainer(
    resources=resources,
    rollout_config=rollout_config,
    dataflow_config=dataflow_config,
    judger_config=judger_cfg,
    replay_buffer_config=replay_buffer_cfg,
    evaluator_config=evaluator_cfg,
    train_worker_cfg=train_worker_cfg,
    tokenizer_path=model_path,
    work_dir=work_dir,
    total_epochs=15,
    enable_evaluate=False
)
# Start training
trainer.fit()
```

## 4. Conclusion

Combine and save all the above configurations as a Python file (e.g., `train_grpo.py`), and you can launch training with the following command:

```bash
XTUNER_USE_FA3=1 XTUNER_USE_LMDEPLOY=1 python train_grpo.py
```

Congratulations! Now you have mastered the method of customizing `RLTrainer` through Python code, and can conduct reinforcement learning experiments more flexibly.