使用开源数据集训练#

HuggingFace Hub 中有众多优秀的开源数据，本节将以 timdettmers/openassistant-guanaco 开源指令微调数据集为例，讲解如何开始训练。为便于介绍，本节以 internlm2_chat_7b_qlora_oasst1_e3 配置文件为基础进行讲解。

适配开源数据集#

不同的开源数据集有不同的数据「载入方式」和「字段格式」，因此我们需要针对所使用的开源数据集进行一些适配。

针对载入方式，XTuner 使用上游库 datasets 的统一载入接口 load_dataset。

data_path = 'timdettmers/openassistant-guanaco'
train_dataset = dict(
    type=process_hf_dataset,
    dataset=dict(type=load_dataset, path=data_path),
    ...)

一般来说，若想要使用不同的开源数据集，用户只需修改 dataset=dict(type=load_dataset, path=data_path) 中的 path 参数即可。

针对字段格式，XTuner 基于内部的 map_fn 逻辑确保数据格式的一致。

from xtuner.dataset.map_fns import oasst1_map_fn
train_dataset = dict(
    type=process_hf_dataset,
    ...
    dataset_map_fn=oasst1_map_fn,
    ...)

XTuner 内置了众多 map_fn （这里），可以满足大多数开源数据集的需要。此处我们罗列一些常用 map_fn 及其对应的原始字段和参考数据集：

map_fn	Columns	Reference Datasets
alpaca_map_fn	[‘instruction’, ‘input’, ‘output’, …]	tatsu-lab/alpaca
alpaca_zh_map_fn	[‘instruction_zh’, ‘input_zh’, ‘output_zh’, …]	silk-road/alpaca-data-gpt4-chinese
oasst1_map_fn	[‘text’, …]	timdettmers/openassistant-guanaco
openai_map_fn	[‘messages’, …]	DavidLanz/fine_tuning_datraset_4_openai
code_alpaca_map_fn	[‘prompt’, ‘completion’, …]	HuggingFaceH4/CodeAlpaca_20K
medical_map_fn	[‘instruction’, ‘input’, ‘output’, …]	shibing624/medical
tiny_codes_map_fn	[‘prompt’, ‘response’, …]	nampdn-ai/tiny-codes
default_map_fn	[‘input’, ‘output’, …]	/

例如，针对 timdettmers/openassistant-guanaco 数据集，XTuner 内置了 oasst1_map_fn，以对其进行字段格式统一。具体实现如下：

def oasst1_map_fn(example):
    r"""Example before preprocessing:
        example['text'] = ('### Human: Can you explain xxx'
                           '### Assistant: Sure! xxx'
                           '### Human: I didn't understand how xxx'
                           '### Assistant: It has to do with a process xxx.')

    Example after preprocessing:
        example['conversation'] = [
            {
                'input': 'Can you explain xxx',
                'output': 'Sure! xxx'
            },
            {
                'input': 'I didn't understand how xxx',
                'output': 'It has to do with a process xxx.'
            }
        ]
    """
    data = []
    for sentence in example['text'].strip().split('###'):
        sentence = sentence.strip()
        if sentence[:6] == 'Human:':
            data.append(sentence[6:].strip())
        elif sentence[:10] == 'Assistant:':
            data.append(sentence[10:].strip())
    if len(data) % 2:
        # The last round of conversation solely consists of input
        # without any output.
        # Discard the input part of the last round, as this part is ignored in
        # the loss calculation.
        data.pop()
    conversation = []
    for i in range(0, len(data), 2):
        single_turn_conversation = {'input': data[i], 'output': data[i + 1]}
        conversation.append(single_turn_conversation)
    return {'conversation': conversation}

通过代码可以看到，oasst1_map_fn 对原数据中的 text 字段进行处理，进而构造了一个 conversation 字段，以此确保了后续数据处理流程的统一。

值得注意的是，如果部分开源数据集依赖特殊的 map_fn，则需要用户自行参照以提供的 map_fn 进行自定义开发，实现字段格式的对齐。

训练#

用户可以使用 xtuner train 启动训练。假设所使用的配置文件路径为 ./config.py，并使用 DeepSpeed ZeRO-2 优化。

单机单卡

xtuner train ./config.py --deepspeed deepspeed_zero2

单机多卡

NPROC_PER_NODE=${GPU_NUM} xtuner train ./config.py --deepspeed deepspeed_zero2

多机多卡（以 2 * 8 GPUs 为例）

方法 1：torchrun

# excuete on node 0
NPROC_PER_NODE=8 NNODES=2 PORT=$PORT ADDR=$NODE_0_ADDR NODE_RANK=0 xtuner train mixtral_8x7b_instruct_full_oasst1_e3 --deepspeed deepspeed_zero2

# excuete on node 1
NPROC_PER_NODE=8 NNODES=2 PORT=$PORT ADDR=$NODE_0_ADDR NODE_RANK=1 xtuner train mixtral_8x7b_instruct_full_oasst1_e3 --deepspeed deepspeed_zero2

注意：$PORT 表示通信端口、$NODE_0_ADDR 表示 node 0 的 IP 地址。

方法 2：slurm

srun -p $PARTITION --nodes=2 --gres=gpu:8 --ntasks-per-node=8 xtuner train internlm2_chat_7b_qlora_oasst1_e3 --launcher slurm --deepspeed deepspeed_zero2

模型转换#

模型训练后会自动保存成 PTH 模型（例如 iter_500.pth），我们需要利用 xtuner convert pth_to_hf 将其转换为 HuggingFace 模型，以便于后续使用。具体命令为：

xtuner convert pth_to_hf ${CONFIG_NAME_OR_PATH} ${PTH} ${SAVE_PATH}
# 例如：xtuner convert pth_to_hf ./config.py ./iter_500.pth ./iter_500_hf

模型合并（可选）#

如果您使用了 LoRA / QLoRA 微调，则模型转换后将得到 adapter 参数，而并不包含原 LLM 参数。如果您期望获得合并后的模型权重，那么可以利用 xtuner convert merge ：

xtuner convert merge ${LLM} ${ADAPTER_PATH} ${SAVE_PATH}
# 例如：xtuner convert merge internlm/internlm2-chat-7b ./iter_500_hf ./iter_500_merged_llm

对话#

用户可以利用 xtuner chat 实现与微调后的模型对话：

xtuner chat ${NAME_OR_PATH_TO_LLM} --adapter ${NAME_OR_PATH_TO_ADAPTER} --prompt-template ${PROMPT_TEMPLATE} [optional arguments]
# 例如：
# xtuner chat internlm2/internlm2-chat-7b --adapter ./iter_500_hf --prompt-template internlm2_chat
# xtuner chat ./iter_500_merged_llm --prompt-template internlm2_chat

使用开源数据集训练

目录

使用开源数据集训练#

适配开源数据集#

训练#

模型转换#

模型合并（可选）#

对话#