PatrickStar分布式深度学习训练工具
PatrickStar 是一款腾讯开发的分布式深度学习训练工具,它的设计目标是支持以 GPT、Bert 为代表的超大预训练模型训练。
用法
PatrickStar 基于 PyTorch,这使得迁移 pytorch 项目变得容易。以下是 PatrickStar 的示例:
from patrickstar.runtime import initialize_engine
config = {
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": (0.9, 0.999),
"eps": 1e-6,
"weight_decay": 0,
"use_hybrid_adam": True,
},
},
"fp16": { # loss scaler params
"enabled": True,
"loss_scale": 0,
"initial_scale_power": 2 ** 3,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1,
},
"default_chunk_size": 64 * 1024 * 1024,
"release_after_init": True,
"use_cpu_embedding": False,
}
def model_func():
# MyModel is a derived class for torch.nn.Module
return MyModel(...)
model, optimizer = initialize_engine(model_func=model_func, local_rank=0, config=config)
...
for data in dataloader:
optimizer.zero_grad()
loss = model(data)
model.backward(loss)
optimizer.step()
使用与 DeepSpeed 配置 JSON 相同的config格式,主要包括优化器、损失缩放器和一些 PatrickStar 特定配置的参数。
引用我们
@article{fang2021patrickstar,
title={PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management},
author={Fang, Jiarui and Yu, Yang and Zhu, Zilin and Li, Shenggui and You, Yang and Zhou, Jie},
journal={arXiv preprint arXiv:2108.05818},
year={2021}
}评论
