Developers Guide

Code Organization

In this section, we describe the structure of the codebase.

Representations

In saturn/core, we define the following templates & structures. In executors/Technique.py, we provide the abstract skeleton for implementing parallelism techniques. In representations/Strategy.py, we define a data class that allows us to easily aggregate per-model decisions (e.g. how many GPUs, which technique). In representations/Task.py, we provide a data class that lets us collect data on user-submitted models.

Library

In saturn/library/library.py, we define the basic functions for registering and accessing parallelism techniques. These functions are exposed to the user as part of our API. We would welcome new contributions to build a “default” library implementing popular parallelisms (e.g. FSDP, GPipe, etc).

Solver

In saturn/solver/milp.py, we define the MILP for our joint optimization problem of parallelism selection, resource apportioning, and scheduling. Alternative solvers (e.g. an RL-based one) must follow the same specification (i.e. input/outputs.)

Trial Runner

In saturn/trial_runner/PerformanceEvaluator.py, we define the empirical profiler system.

Executor

In saturn/executor/executor.py, we define the execution engine and introspection scheme.

Tests & Examples

In the examples folder, we provide example training pipelines with Saturn. If you contribute new parallelisms, you will have to register them with the library before use. You can model your tests after our example WikiText job.