# Advanced Usage of RL Trainer

This document introduces the user-facing usage of `AgentLoopManager`, `ProduceStrategy`,
`RLColocateTrainer`, and the disaggregated `RLDisaggregatedTrainer`. It focuses on what each
component is responsible for, how the training flow is connected, and how to choose common
configurations.

If you only want to run single-turn GRPO, start with the [basic training tutorial](../tutorial/rl_grpo_trainer.md).
Read this document when you need multi-task training, asynchronous rollout, partial rollout, or when you want to
place training and rollout on different groups of GPUs.

## Overall Relationship

An RL training pipeline can be roughly understood as:

```text
Sampler
  -> AgentLoop generates responses
  -> Judger writes rewards
  -> ProduceStrategy controls the production pace
  -> AgentLoopManager writes to / reads from ReplayBuffer
  -> RLTrainer trains, evaluates, saves, and synchronizes weights
```

The responsibility boundaries of the main modules are:

| Module                    | User-facing understanding                                                                                  |
| ------------------------- | ---------------------------------------------------------------------------------------------------------- |
| `AgentLoopManager`        | Assembles the sampler, agent loop, judger, and production strategy into one or more rollout tasks, and provides training batches to the trainer. |
| `ProduceStrategy`         | Decides how rollout data is produced: synchronously produce a batch, or keep producing in the background with oversampling and continuation. |
| `RLColocateTrainer`       | Training workers and rollout workers use the same group of GPUs; rollout and training switch resources by step. |
| `RLDisaggregatedTrainer`  | Training workers and rollout workers use different groups of GPUs; rollout produces in the background while training consumes in the foreground. |

## AgentLoopManager

`AgentLoopManagerConfig` is the main entry point on the generation side. It does not define how to generate a
single sample by itself. Instead, it binds the following configurations into training tasks:

- `sampler_config`: samples prompts from the dataset and groups them by `prompt_repeat_k`.
- `agent_loop_config`: defines how a rollout is executed, such as single-turn QA or tool calling.
- `judger_config`: scores rollout results.
- `produce_strategy_config`: controls the rollout production pace.
- `weight`: controls batch allocation weights during multi-task training.

Single-task configuration example:

```{code-block} python
:caption: Configure a single rollout task
from xtuner.v1.rl.agent_loop_manager import (
    AgentLoopManagerConfig,
    SyncProduceStrategyConfig,
    TaskSpecConfig,
)

agent_loop_manager_cfg = AgentLoopManagerConfig(
    tasks=TaskSpecConfig(
        task_name="train_task",
        agent_loop_config=train_agent_loop_config,
        judger_config=judger_config,
        produce_strategy_config=SyncProduceStrategyConfig(),
        sampler_config=train_sampler_config,
    )
)
```

For multi-task training, pass a list of `TaskSpecConfig`. `weight` controls how `train_batch_size` is allocated
to different tasks at each training step:

```{code-block} python
:caption: Configure multi-task rollout
agent_loop_manager_cfg = AgentLoopManagerConfig(
    tasks=[
        TaskSpecConfig(
            task_name="math",
            weight=2.0,
            agent_loop_config=math_agent_loop_config,
            judger_config=math_judger_config,
            produce_strategy_config=SyncProduceStrategyConfig(),
            sampler_config=math_sampler_config,
        ),
        TaskSpecConfig(
            task_name="code",
            weight=1.0,
            agent_loop_config=code_agent_loop_config,
            judger_config=code_judger_config,
            produce_strategy_config=SyncProduceStrategyConfig(),
            sampler_config=code_sampler_config,
        ),
    ]
)
```

In the example above, if `train_batch_size=96`, batches are allocated approximately according to
`math:code = 2:1` by default.

## ProduceStrategy

`ProduceStrategy` decides how data is produced. Users usually only need to choose the configuration class.

### SyncProduceStrategyConfig

`SyncProduceStrategyConfig` is the simplest and most on-policy-like option: when the current training step needs
data, it first generates enough rollout groups and then passes this batch to the trainer.

Applicable scenarios:

- The default choice for colocated training.
- You want each step to use data generated by the current weights as much as possible.
- You do not need partial rollout, oversampling, or stale sample reuse.

Configuration:

```python
produce_strategy_config = SyncProduceStrategyConfig()
```

### AsyncProduceStrategyConfig

`AsyncProduceStrategyConfig` is used to improve rollout throughput. It allows the producer to prepare future
samples beyond the current batch, and supports partial rollout, stale sample reuse, and retrying expired samples.

Common configuration:

```{code-block} python
:caption: Configure an asynchronous production strategy
from xtuner.v1.rl.agent_loop_manager import AsyncProduceStrategyConfig

produce_strategy_config = AsyncProduceStrategyConfig(
    over_sample_threshold=0.2,
    enable_partial_rollout=True,
    max_staleness=1,
    tail_batch_trigger_size=64,
)
```

Parameter meanings:

| Parameter | Description |
| --- | --- |
| `over_sample_threshold` | The ratio of extra samples that may be generated. A larger value makes the rollout side easier to keep fully loaded, but may produce more samples that are not from the current step. |
| `enable_partial_rollout` | Whether rollouts paused before weight synchronization may continue after synchronization. Before using this for tool calling or multi-turn tasks, confirm that the AgentLoop supports continuation. |
| `max_staleness` | The number of synchronization cycles by which samples may lag behind the current training progress. A larger value gives more throughput flexibility but weakens the on-policy property. |
| `tail_batch_trigger_size` | When expired samples accumulate to this number, tail batch mode is entered and these samples are retried first. |

`max_staleness` is counted in "weight synchronization cycles". The actual expiration threshold used in code is:

```text
stale_threshold = (max_staleness + 1) * sync_weights_interval
```

The `+1` represents the natural lag allowed within the current synchronization cycle: `model_step` indicates
which train step the rollout model was synchronized from, and when training step `model_step + 1`, these samples
are still generated by the current weights. `max_staleness=0` means only samples within the current synchronization
cycle are allowed. `max_staleness=1` additionally allows samples to cross one extra synchronization cycle.

Both oversampling and partial rollout are affected by `max_staleness`:

- `over_sample_threshold>0` generates samples ahead of future steps. If these samples cross the next weight
  synchronization point, they are retained as trainable samples only when allowed by `max_staleness`.
- `enable_partial_rollout=True` lets paused responses continue after synchronization. Sample staleness is calculated
  by the earliest model version in the response, so continuation across synchronization cycles also needs room from
  `max_staleness`.

Tail batch is used to handle samples that have expired during asynchronous production. When the number of `expired`
samples reaches `tail_batch_trigger_size`, `AsyncProduceStrategy` enters tail batch mode: this round no longer
oversamples according to `over_sample_threshold`, only fills the required target, and retries samples from the
expired sample pool first. You can understand it as a non-oversampling synchronous fill-up production. Its goal is
not to improve throughput, but to collect long-tail expired samples again and avoid leaving them in the buffer for
too long.

Note: it is not recommended to set `max_staleness>0` and `enable_partial_rollout=False` at the same time. With this
combination, long-tail oversampled samples may be reset after weight synchronization because partial rollout is not
supported. Currently, sample reset in `RolloutWorker` only keeps the prompt field. However, because expiration
information is reset to 0 each time, these samples will not expire, tail batch will not take over in time, and they
may still fail to finish within the next synchronization window and be retried repeatedly. There is not yet a
`tail_batch_max_tries` mechanism to trigger tail batch by retry count. Therefore, when `max_staleness>0`, prefer
setting `enable_partial_rollout=True`.

In disaggregated training, do not customize `should_continue_fn` for early stopping. The current
`RLDisaggregatedTrainer` requires it to keep the default behavior, otherwise background production and foreground
training consumption may not match.

For `ReplayBufferConfig`, the example configurations are usually sufficient: use `SyncReplayBufferConfig()` for
the introductory synchronous setup. For colocated asynchronous production, refer to `rl_grpo_gsm8k_async.py` and use
`AsyncReplayBufferConfig()` so that the consumer side prioritizes completed samples with higher staleness.

### Two Interfaces of AsyncProduceStrategy

The core interfaces of `AsyncProduceStrategy` are split into two categories to support colocated and disaggregated
training:

- In the colocated path, `AgentLoopManager.produce_batch()` uses local progress to connect "produce to buffer ->
  pause and finish -> take training batch from buffer".
- In the disaggregated path, `AgentLoopManager.produce_loop()` continuously calls the production interface in the
  background. The foreground trainer consumes through `get_batch()`, calls `pause_produce()` at synchronization
  points, and resumes with `continue_produce()` after synchronizing weights.

## RLColocateTrainer

`RLColocateTrainer` corresponds to the configuration class
{class}`~xtuner.v1.train.rl_trainer.RLColocateTrainerConfig`. It lets training workers and rollout workers use the
same group of resources.

The flow can be understood as:

```text
rollout generates one batch of data
  -> pause / release rollout-side resources
  -> training workers consume this batch
  -> synchronize training weights to rollout workers at synchronization points
  -> enter the next rollout step
```

Minimal structure:

```{code-block} python
:caption: Configure a colocated Trainer
from xtuner.v1.rl.replay_buffer import SyncReplayBufferConfig
from xtuner.v1.rl.utils import AcceleratorResourcesConfig
from xtuner.v1.train.rl_trainer import RLColocateTrainerConfig

resources = AcceleratorResourcesConfig(
    accelerator="GPU",
    num_workers=8,
)

trainer = RLColocateTrainerConfig(
    resources=resources,
    train_worker_cfg=train_worker_cfg,
    rollout_config=rollout_config,
    tokenizer_path=model_path,
    replay_buffer_config=SyncReplayBufferConfig(),
    agent_loop_manager_cfg=agent_loop_manager_cfg,
    eval_agent_loop_manager_cfg=eval_agent_loop_manager_cfg,
    evaluator_config=evaluator_config,
    load_from=model_path,
    total_train_steps=1000,
    train_batch_size=128,
    sync_weights_interval=1,
    enable_evaluate=True,
    evaluate_step=50,
    work_dir=work_dir,
)
```

Common fields:

| Field | Description |
| ----- | ----------- |
| `resources` | A shared group of training / rollout resources used in colocated mode. |
| `train_batch_size` | How many rollout groups are consumed at each training step. |
| `sync_weights_interval` | How many training steps between synchronizing weights to rollout workers. |
| `checkpoint_interval` / `hf_interval` | Saving intervals. When enabled, they must be multiples of `sync_weights_interval`. |
| `enable_evaluate` / `evaluate_step` | Whether to evaluate and the evaluation interval. Evaluation only runs at weight synchronization points. |

Common colocated modes:

| Mode | Key configuration | Description |
| --- | --- | --- |
| Strict on-policy | `SyncProduceStrategyConfig()`, `sync_weights_interval=1` | Each step performs rollout, then training, then weight synchronization. |
| Low-frequency synchronization | `SyncProduceStrategyConfig()`, `sync_weights_interval>1` | Reduces weight synchronization overhead. Multiple steps in the same synchronization cycle use the same rollout weights. |
| Colocated stale oversampling | `AsyncProduceStrategyConfig(over_sample_threshold>0, max_staleness>0)` | Allows oversampled data to cross additional synchronization cycles and continue training, improving throughput while weakening the on-policy property. |
| Colocated partial rollout | `AsyncProduceStrategyConfig(over_sample_threshold>0, max_staleness>0, enable_partial_rollout=True)` | Suitable for long responses or tool-chain tasks. |

In colocated asynchronous mode, future samples generated by `over_sample_threshold` and samples produced by
continuing partial rollouts both rely on `max_staleness>0` to relax the expiration threshold if they need to cross
weight synchronization points and continue being used for training.

Colocated mode is suitable when resources are limited and you want a simpler configuration.

## RLDisaggregatedTrainer

Disaggregated training corresponds to the configuration class
{class}`~xtuner.v1.train.rl_trainer.RLDisaggregatedTrainerConfig` and the runtime class
{class}`~xtuner.v1.train.rl_trainer.RLDisaggregatedTrainer`. In logs, `RLDisaggTrainer` sometimes refers to this
disaggregated trainer.

It splits resources into two groups:

- `train_resources`: used by training workers.
- `rollout_resources`: used by rollout workers.

At runtime, the rollout producer continuously writes samples to the replay buffer in the background, while the
trainer consumes batches and trains in the foreground. At weight synchronization points, the trainer first pauses
background production, then saves, synchronizes weights, evaluates, and finally resumes production.

```text
Background: rollout producer -> replay buffer -> rollout producer -> ...
Foreground: get batch -> train -> sync point -> pause producer -> sync weights -> continue producer
```

Training sample production in disaggregated training only uses `AsyncProduceStrategyConfig`. This is because the
training producer and trainer run concurrently, so the strategy needs to retain pending rollouts, respond to pause
signals, and explicitly finish at synchronization points under trainer control. Disaggregated evaluation still uses
`SyncProduceStrategyConfig`: evaluation runs after weight synchronization and before the background producer resumes.
It only needs to generate a fixed eval batch and does not need background oversampling, staleness reuse, or partial
rollout.

```{code-block} python
:caption: Produce strategies for disaggregated training and evaluation
from xtuner.v1.rl.agent_loop_manager import (
    AsyncProduceStrategyConfig,
    SyncProduceStrategyConfig,
    TaskSpecConfig,
)

train_task = TaskSpecConfig(
    task_name="train_task",
    agent_loop_config=train_agent_loop_config,
    judger_config=judger_config,
    produce_strategy_config=AsyncProduceStrategyConfig(
        over_sample_threshold=0.2,
        max_staleness=1,
    ),
    sampler_config=train_sampler_config,
)

eval_task = TaskSpecConfig(
    task_name="eval_task",
    agent_loop_config=eval_agent_loop_config,
    judger_config=judger_config,
    produce_strategy_config=SyncProduceStrategyConfig(),
    sampler_config=eval_sampler_config,
)
```

Configuration example:

```{code-block} python
:caption: Configure a disaggregated Trainer
from xtuner.v1.rl.replay_buffer import AsyncReplayBufferConfig
from xtuner.v1.rl.utils import AcceleratorResourcesConfig
from xtuner.v1.train.rl_trainer import RLDisaggregatedTrainerConfig

train_resources = AcceleratorResourcesConfig(
    accelerator="GPU",
    num_workers=4,
)
rollout_resources = AcceleratorResourcesConfig(
    accelerator="GPU",
    num_workers=4,
)

trainer = RLDisaggregatedTrainerConfig(
    train_resources=train_resources,
    rollout_resources=rollout_resources,
    train_worker_cfg=train_worker_cfg,
    rollout_config=rollout_config,
    tokenizer_path=model_path,
    replay_buffer_config=AsyncReplayBufferConfig(),
    agent_loop_manager_cfg=agent_loop_manager_cfg,
    eval_agent_loop_manager_cfg=eval_agent_loop_manager_cfg,
    evaluator_config=evaluator_config,
    load_from=model_path,
    total_train_steps=1000,
    train_batch_size=128,
    sync_weights_interval=4,
    enable_evaluate=True,
    evaluate_step=20,
    work_dir=work_dir,
)
```

Common modes:

| Mode | Key configuration | Description |
| --- | --- | --- |
| Disaggregated on-policy | `AsyncProduceStrategyConfig(over_sample_threshold=0, max_staleness=0)`, `sync_weights_interval=1` | Synchronizes weights every step. Training and rollout resources are separated, but both sides lose the benefit of disaggregation due to waiting. Currently not supported. |
| Stream off-policy | `AsyncProduceStrategyConfig(over_sample_threshold=0, max_staleness=0)`, `sync_weights_interval>1` | Reduces synchronization frequency. `max_staleness=0` still allows the natural lag within the current synchronization cycle. |
| Async stale | `AsyncProduceStrategyConfig(over_sample_threshold>0, max_staleness>0)` | Allows the background producer to run moderately ahead and keeps oversampled data usable across additional synchronization cycles. |
| Async partial rollout | `AsyncProduceStrategyConfig(over_sample_threshold>0, enable_partial_rollout=True, max_staleness>0)` | Suitable for long responses or long tool-chain tasks. It uses partial rollout to reduce rerun cost after weight synchronization interruptions. |

When using disaggregated training, note that:

- `train_resources` and `rollout_resources` should be different resource pools.
- The training task's `TaskSpecConfig.produce_strategy_config` should use `AsyncProduceStrategyConfig`; evaluation
  tasks should use `SyncProduceStrategyConfig`.
- When enabled, `evaluate_step`, `checkpoint_interval`, and `hf_interval` all need to be multiples of
  `sync_weights_interval`.
- Evaluation runs before the background producer resumes, avoiding competition between evaluation and background
  rollout for the same rollout resources.
- The current disaggregated weight synchronization entry point is encapsulated by the trainer. Users usually only
  need to configure `sync_weights_interval`.

## How to Choose

| Scenario | Recommendation |
| -------- | -------------- |
| Configuring RL training for the first time | `RLColocateTrainerConfig` + `SyncProduceStrategyConfig()`. |
| Limited GPU count and you want to first get the task running | Colocated mode. |
| Rollout is clearly slower than training and you have extra rollout resources | Disaggregated mode. |
| Need higher rollout throughput | `AsyncProduceStrategyConfig(over_sample_threshold>0)`. |
| Long responses are often interrupted by synchronization | Enable `enable_partial_rollout=True`. |
| Strong on-policy requirement | Keep `sync_weights_interval=1` and use little or no stale / oversampling. |

Complete configurations can be found in:

- `examples/v1/config/rl_grpo_gsm8k_judge.py`: colocated synchronous GRPO.
- `examples/v1/config/rl_grpo_gsm8k_async.py`: colocated asynchronous production.
- `examples/v1/config/rl_disagg_single.py`: disaggregated single-task training.
- `examples/v1/config/rl_disagg_multi.py`: disaggregated multi-task training.
