跳到主要内容

2 篇博文 含有标签「LLM」

查看所有标签

Distributed SFT Part 2: Scaling Locally

· 阅读需 13 分钟
Junlin Zhou
Fullstack Engineer @ ZJU ICI

Introduction

In the first part of this series, we covered the basics of setting up a local SFT experiment using trl. We learned how to format datasets for trl's SFTTrainer and preprocess them to fit the required structure.

Now, it's time to take the next step. In this post, we'll focus on scaling the SFT setup to handle larger tasks. Specifically, we'll explore how to fine-tune an LLM in a single-node, multi-GPU environment. Along the way, we'll discuss optimization techniques to reduce memory usage, speed up training, and enable fine-tuning of even larger models. Let's get started!

Distributed SFT Part 1: Starting Locally

· 阅读需 8 分钟
Junlin Zhou
Fullstack Engineer @ ZJU ICI

Introduction

Welcome to this series of articles documenting the lessons I learned during my first attempt at running distributed supervised fine-tuning (SFT) tasks using trl and DeepSpeed.

This series will walk you through my journey, starting with a simple local experiment and progressively scaling up to a distributed environment. The three parts of this series are:

  • Part 1: The Local Experiment -- I will show you how I ran my very first local SFT experiment, following the official trl documentation.

  • Part 2: Multi GPU -- We will leverage single-machine, multi-GPU parallel training to complete a full SFT task in our local environment.

  • Part 3: Multi Machine -- We'll take things a step further by submitting the same training task to a Kubernetes cluster, utilizing multi-machine, multi-GPU training with Kubeflow's Training Operator.

A quick note about myself: I'm a software development engineer who is fairly new to the field of deep learning. If these articles seem too basic for you, I appreciate your patience as I navigate this learning journey.