Blog | Edwardzjl

Distributed SFT Part 2: Scaling Locally

February 7, 2025 · 13 min read

Fullstack Engineer @ ZJU ICI

Introduction

In the first part of this series, we covered the basics of setting up a local SFT experiment using trl. We learned how to format datasets for trl's SFTTrainer and preprocess them to fit the required structure.

Now, it's time to take the next step. In this post, we'll focus on scaling the SFT setup to handle larger tasks. Specifically, we'll explore how to fine-tune an LLM in a single-node, multi-GPU environment. Along the way, we'll discuss optimization techniques to reduce memory usage, speed up training, and enable fine-tuning of even larger models. Let's get started!

Distributed SFT Part 1: Starting Locally

January 23, 2025 · 8 min read

Junlin Zhou

Fullstack Engineer @ ZJU ICI

Introduction

Welcome to this series of articles documenting the lessons I learned during my first attempt at running distributed supervised fine-tuning (SFT) tasks using trl and DeepSpeed.

This series will walk you through my journey, starting with a simple local experiment and progressively scaling up to a distributed environment. The three parts of this series are:

Part 1: The Local Experiment -- I will show you how I ran my very first local SFT experiment, following the official trl documentation.
Part 2: Multi GPU -- We will leverage single-machine, multi-GPU parallel training to complete a full SFT task in our local environment.
Part 3: Multi Machine -- We'll take things a step further by submitting the same training task to a Kubernetes cluster, utilizing multi-machine, multi-GPU training with Kubeflow's Training Operator.

A quick note about myself: I'm a software development engineer who is fairly new to the field of deep learning. If these articles seem too basic for you, I appreciate your patience as I navigate this learning journey.

[译] JSON格式作为配置文件的缺点

August 9, 2019 · 5 min read

Junlin Zhou

Fullstack Engineer @ ZJU ICI

翻译自[这篇文章][1]

我最近接触到许多项目将 JSON 用作配置文件。我认为这不是一个好主意。

JSON 从设计之初就不是用于做配置文件的，这也不是它擅长的领域。JSON 的目标是 "轻量级数据交换格式", 同时具有 "易于人类读写", "易于代码解析和生成" 的特点。它在对 "人类而言的便利性" 和 "对机器而言的便利性" 之间取得了较好的平衡, 在许多应用场景下都是比 XML 更好的替代方案。

然而，将 JSON 用于其他目的有点类似于说 "嘿，这把锤子非常适合钉钉子！我喜欢它！为什么不用它来拧螺丝！" 当然它不是完全不能用，只是不合适做这样的工作。

系统中状态为 static 的服务

July 4, 2019 · One min read

Junlin Zhou

Fullstack Engineer @ ZJU ICI

最近开始接触 Linux 运维的工作，第一件事情就是看看系统中跑了多少服务。

[译] javax.persistence.Id 和 org.springframework.data.annotation.Id 的区别

June 27, 2019 · One min read

Junlin Zhou

Fullstack Engineer @ ZJU ICI

org.springframework.data.annotation.Id

org.springframework.data.annotation.Id 是 Spring 定义的 annotation，用来支持 "没有像 JPA 那样的持久化 API" 的非关系型数据库或是框架的持久化，因此它常被用于其它 spring-data 项目，例如 spring-data-mongodb 和 spring-data-solr 等。

javax.persistence.Id

javax.persistence.Id 是由 JPA 定义的 annotation，JPA 仅适用于关系数据的管理。

Install postgres on OSX

April 13, 2019 · One min read

Junlin Zhou

Fullstack Engineer @ ZJU ICI

If you installed Postgres from homebrew, the default user postgres isn't automatically created, you need to run following command in your terminal:

Ubuntu / CentOS 编译安装带剪贴板支持的 Vim

March 14, 2019 · 2 min read

Junlin Zhou

Fullstack Engineer @ ZJU ICI

通过 Ubuntu 或 CentOS 系统自带的软件源安装 Vim，往往只能得到较旧的版本（通常是 7.4.x）。而从 Vim 8.0 开始，官网推荐的安装方式是通过 Git 克隆源码自行编译。

不过需要注意，默认编译出来的 Vim 并不包含剪贴板支持（clipboard support），因此无法与系统剪贴板交互（例如复制粘贴到其他程序）。

Introduction​

Introduction​

org.springframework.data.annotation.Id​

javax.persistence.Id​

Introduction

Introduction

org.springframework.data.annotation.Id

javax.persistence.Id