On-demand distributed clusters: API-driven scaling

Overview and goals

When a Domino workspace starts, you can attach a Spark, Ray, or Dask cluster and configure autoscaling with minimum and maximum node values. This works well for many scenarios, but certain workloads surface challenges that are difficult to capture with generic heuristics.

For example:

CPU-only signals: The Horizontal Pod Autoscaler (HPA) reacts to CPU utilization. For data science and ML workloads that are often I/O-bound or GPU-heavy, CPU is not always the best leading indicator of resource pressure.
Threshold sensitivity: The default trigger of ~80% CPU may be too high for interactive or latency-sensitive tasks, where waiting for utilization to climb can slow down workflows.
Stepwise scaling: Clusters typically grow or shrink one node at a time per minute. For bursty or large-scale jobs, this gradual approach can delay the resources actually needed.

These behaviors reflect the trade-offs of heuristic-based scaling. The challenge is that AI/ML workloads often benefit from more direct, intentional control over cluster elasticity.

This blueprint introduces an API-driven scaling model. Instead of relying solely on heuristics, you can call an API to right-size a cluster at the exact moment your workflow demands it. The goals are:

Minimize idle resources by scaling only when needed.
Speed up responsiveness by bypassing slow, utilization-based triggers.
Align scaling to workflow context, not just system heuristics.

By the end, you’ll see how API-driven scaling complements Domino’s autoscaling capabilities, giving teams the precision and flexibility needed to handle ML/AI workloads efficiently.

When should you consider using this approach?

This API-driven scaling model is most useful when workloads don’t continuously need large clusters, but still benefit from having them on demand. A few common cases:

Long-running workspaces with intermittent heavy compute

You may have a workspace that runs for hours or days, where GPU workers or large CPU clusters are only needed for certain phases. With API-based scaling, you decide exactly when to grow or shrink the cluster instead of keeping resources idle.

Jobs with mixed input sizes

In a single workspace session you may need to run multiple jobs with varying loads (data or number of tasks). Rather than provisioning a cluster with max size upfront and letting it remain underutilized when not running jobs that need that much capacity, you can dynamically resize the cluster based on your needs at the moment.

Jobs with mixed compute patterns

Many Domino jobs combine stages that need distributed compute (Spark, Ray, Dask) with stages that don’t. Rather than provisioning a cluster upfront and letting it sit unused, you can:

Start with a zero-node cluster.
Run your non-cluster work.
Scale the cluster up when distributed compute is required.
Scale it back down once the cluster phase is complete.

In both scenarios, the benefit is precision: you match cluster size to workload demand at each step, minimizing idle time and reducing cost without compromising performance.

Why API-based scaling works

Imagine you’re a data scientist working on one of these tasks:

Hyper-parameter tuning on GPU-backed models
Fine-tuning an LLM with DeepSpeed managed by Ray
Running SQL queries on massive datasets in Spark
Launching a large multi-worker operation on a Ray cluster

You start your day by spinning up a Domino workspace with a cluster attached. Most of the time, you’re not running a full-scale job. You’re debugging, iterating on smaller datasets, or exploring a narrower parameter space. The big, expensive cluster is only required when you’re ready to validate or run end-to-end tests.

In this scenario, the ideal behavior is clear:

Zero workers when the cluster isn’t needed
Right-sized scaling when moving from development to full test or production runs

With API-based scaling, you get that control. Instead of relying on utilization thresholds, you decide exactly when to grow or shrink the cluster. The result is the best of both worlds: access to Domino’s distributed compute capabilities and the ability to manage costs with precision.

Using API-based distributed compute scaling in Domino

To get started, you’ll need to install the ddl-cluster-scaler service in your Domino kubernetes cluster. This is done via a simple Helm install — follow the installation instructions provided in this document.

Once installed, you’re ready to scale clusters programmatically. Domino is a code-first platform, so this capability is exposed as native API endpoints. For convenience, we also provide a lightweight Python client that wraps those endpoints and handles the repetitive details.

We recommend copying the client file into your Domino project (for example, under a client package) so you can import and use it directly. In the examples that follow, we’ll illustrate the workflow through a Jupyter notebook.

We will now illustrate the workflow for a data scientist using distributed compute clusters from a workspace.

1. The first step is to start a cluster-attached workspace. To minimize costs during development:

Set the minimum number of workers to 1 (enough to keep the cluster active without overprovisioning).
Set the maximum number of workers to the largest cluster size you anticipate needing during the session.
Choose the smallest available hardware tier for the workers.

For example, in my Domino installation the smallest tier is named “Small” (1 core, 1 worker). Starting with this baseline keeps the cluster lightweight and inexpensive until you decide to scale it up through the API.

We assume that the Python client described earlier is included in your codebase under the package name client. If you’re working in another language, you can call the REST API directly or build a lightweight client in the language of your choice.

2. Now you are ready to use the API based cluster scale. The diagram below illustrates the flow.

The process for scaling the cluster is illustrated in the notebook.

a. Scale the cluster up — The user invokes the scaling endpoint (via the Python client) and provides:

The hardware tier of the head node
The hardware tier for those replicas
The desired number of replicas (≤ the maximum configured for the cluster)

This is the key advantage of API-based scaling:

You avoid the high cost of starting an entire cluster with expensive hardware (for example, GPU tiers).
You also avoid the waste of running even a single idle worker on an expensive tier.
Instead, you start with a minimal, inexpensive cluster and only scale into costly tiers when your workload truly needs it.
You can also resize the head node to match the scale of the cluster. A larger cluster may need a more capable head node to coordinate effectively, but when the cluster is idle or reduced, the head node can be scaled back to the smallest tier.

Example code:

from client import ddl_cluster_scaling_client
import json

#One of the three kinds “rayclusters”,”sparkclusters” or “daskclusters”
cluster_kind=”rayclusters”

j = ddl_cluster_scaling_client.scale_cluster(cluster_kind=cluster_kind,                                              
                                             worker_hw_tier_name="Medium", 
                                             replicas=3)
scale_start_ts = j['started_at']
ddl_cluster_scaling_client.wait_until_scaling_complete(cluster_kind=cluster_kind,
                                                   scale_start_ts=scale_start_ts)

b. Restarting the head node (optional) — After scaling the cluster with the API, the cluster is fully functional, but occasionally the cluster admin UI may not update correctly to show the new number of workers.

As a hygiene step, you can restart the head node to refresh the UI state. This ensures the cluster dashboard reflects the correct number of active workers, even though the underlying cluster is already operating as expected.

Example code:

from client import ddl_cluster_scaling_client
import json

#One of the three kinds “rayclusters”,”sparkclusters” or “daskclusters”
cluster_kind=”rayclusters”

j = ddl_cluster_scaling_client.restart_head_node(cluster_kind=cluster_kind,
                                               head_hw_tier_name="Medium")
restarts_at = j['started_at']

ddl_cluster_scaling_client.wait_until_head_restart_complete(
                                            cluster_kind=cluster_kind,           
                                            restart_ts=restarts_at)

c. Using the cluster — Once the cluster is scaled to the desired size and hardware tier, the data scientist can proceed with their workload. Typical tasks include:

Running hyperparameter tuning experiments on GPU-backed models
Performing LLM fine-tuning with DeepSpeed managed by Ray
Executing large-scale SQL queries on a Spark cluster

At this point, the cluster behaves exactly like any other Domino-managed distributed compute environment — the difference is that you’ve sized it precisely to the task at hand.

d. Scale the cluster down — When the heavy workload completes, you can reduce the cluster back to its baseline configuration. This avoids leaving expensive hardware tiers or multiple workers running idle.

A recommended pattern is to scale the cluster down to 1 replica on the “Small” hardware tier. This keeps the cluster alive while minimizing cost.

e. Restarting the head node again (optional) — Just as with scaling up, after scaling down the cluster the cluster admin UI may occasionally show an incorrect worker count. The cluster itself is fully functional, but the UI can lag behind the actual state.

Restarting the head node at this point is a good hygiene step to ensure the UI reflects the reduced number of workers. This keeps the admin dashboard consistent with the actual cluster configuration.

f. Perform non-cluster work in your workspace — After scaling the cluster down, you can continue working in your Domino workspace without relying on distributed compute. Typical tasks include:

Exploratory data analysis on smaller datasets
Preprocessing or feature engineering that fits on a single machine
Debugging, unit testing, or developing code before scaling up again

Running these steps on the baseline workspace (with a single, low-cost worker) keeps costs minimal while still giving you a seamless environment. When you’re ready to run the next distributed phase, you can scale the cluster back up on demand.

Domino Jobs often run under service accounts and are triggered by external automation tools. In these cases, the lifecycle of the attached cluster (Spark, Ray, or Dask) matches the lifecycle of the job itself, such that when the job ends, the cluster is automatically terminated.

For many workloads, this behavior is sufficient: if the job uses the cluster for its entire runtime, there’s no need for additional control.

However, some jobs only require distributed compute for a fraction of their lifecycle. In those cases, API-based scaling provides a way to optimize cost and performance:

Start the job with a one-node cluster with the smallest hardware tier possible for the head node and the worker node.
Scale up the cluster (with the required hardware tier) only when distributed compute is needed.
Scale it back down when returning to single-node work.

The alternative approach is to restructure the workload using Domino Flows. In this approach you break the larger job into smaller sub-jobs, with the cluster-dependent stages isolated. API-based scaling offers a simpler option: fine-grained control within a single job, without refactoring into multiple flows.

Choosing between API-based scaling and Domino Flows depends on the shape of your workload and how much refactoring you’re willing to do. The table below highlights the trade-offs so you can quickly decide which approach best fits your use case.

Scenario

API-based scaling

Domino Flows

Job uses the cluster for its entire lifecycle

Not needed

Not applicable

Job uses the cluster only for part of its lifecycle

✓

Scale cluster up/down within the same job

✗

Requires restructuring into multiple jobs

You want minimal code changes

✓

Call scaling APIs inline with your job logic

✗

Requires restructuring into multiple jobs

Workload already fits a pipeline pattern

✗

Possible but not necessary

✓

Natural fit — each stage runs as its own Flow step

Primary goal is cost optimization without job refactoring

✓

Best fit

✗

Adds complexity

On-demand distributed clusters: API-driven scaling

Author

Article topics

Intended audience

Source code repository

Overview and goals

When should you consider using this approach?

Why API-based scaling works

Using API-based distributed compute scaling in Domino

Checkout the Github repo

Sameer Wadkar