Releases · dstackai/dstack-enterprise

07 Aug 19:36

jvstme

0.19.22-v1

89fee21

0.19.22-v1 Latest

Latest

Services

Probes

You can now configure HTTP probes to check the health of your service.

type: service
name: my-service
port: 80
image: my-app:latest
probes:
- type: http
  url: /health
  interval: 15s

Probe statuses are displayed in dstack ps --verbose and are considered during rolling deployments. This enables you to deploy new versions of your service with zero downtime.

> dstack ps --verbose

 NAME                            BACKEND          STATUS   PROBES  SUBMITTED
 my-service deployment=1                          running          11 mins ago
   replica=0 job=0 deployment=0  aws (us-west-2)  running  ✓       11 mins ago
   replica=1 job=0 deployment=1  aws (us-west-2)  running  ×       1 min ago

Learn more about probes in the docs.

Accelerators

NVIDIA GPU health checks

dstack now monitors NVIDIA GPU health using DCGM background health checks:

> dstack fleet

 FLEET     INSTANCE  BACKEND          RESOURCES  PRICE   STATUS          CREATED
 my-fleet  0         aws (us-east-1)  T4:16GB:1  $0.526  idle            11 mins ago
           1         aws (us-east-1)  T4:16GB:1  $0.526  idle (warning)  11 mins ago
           2         aws (us-east-1)  T4:16GB:1  $0.526  idle (failure)  11 mins ago

In this example, the first instance is healthy, the second has a non-fatal issue and can still be used, and the last has a fatal error that makes it inoperable.

Note

GPU health checks are supported on AWS (except with custom os_images), Azure (except for A10 GPUs), GCP, and OCI, as well as SSH fleet instances with DCGM installed and configured for background health checks. To use GPU health checks, re-create the fleets that were created before 0.19.22.

Tenstorrent Galaxy

dstack now supports Tenstorrent Galaxy cards via SSH fleets.

Backends

Hot Aisle

This release features an integration with Hot Aisle, a cloud provider that offers on-demand access to AMD MI300x GPUs at competitive prices.

> dstack offer -b hotaisle                   

 #  BACKEND                   RESOURCES                                     INSTANCE TYPE                     PRICE   
 1  hotaisle (us-michigan-1)  cpu=13 mem=224GB disk=12288GB MI300X:192GB:1  1x MI300X 13x Xeon Platinum 8470  $1.99
 2  hotaisle (us-michigan-1)  cpu=8 mem=224GB disk=12288GB MI300X:192GB:1   1x MI300X 8x Xeon Platinum 8470   $1.99

Refer to the docs for instructions on configuring the hotaisle backend in your dstack project.

CLI

Reading configurations from stdin

dstack apply can now read configurations from stdin using the -y -f - flags. This allows configuration files to be parameterized in arbitrary ways:

> cat .dstack/volume.dstack.yml
type: volume
name: my-vol

backend: aws
region: us-east-1
size: $VOL_SIZE

> export VOL_SIZE=50
> envsubst '$VOL_SIZE' < .dstack/volume.dstack.yml | dstack apply -y -f -

Debug logs

The dstack CLI now saves debug logs to the ~/.dstack/logs/cli/ directory. These logs can be useful for troubleshooting failed commands or submitting bug reports.

UI

Secrets

The project settings page now has a section to manage secrets.

Logs improvements

The UI can now optionally display timestamps in front of each message in run logs. This can be a lifesaver when debugging runs that write log messages without built-in timestamps.

Additionally, if the dstack server is configured to use external log storage, such as AWS CloudWatch or GCP Logging, a button will appear in the UI to view the logs in that storage system.

What's changed

[Feature]: Add UI for managing Secrets #2882 by @olgenn in dstackai/dstack#2911
[Blog]: Benchmarking AMD GPUs: bare-metal, VMs by @peterschmidt85 in dstackai/dstack#2924
[Feature]: Implement reading apply configuration from stdin by @r4victor in dstackai/dstack#2938
Fix precommit by @olgenn in dstackai/dstack#2936
Fix gateway docs URL by @jspablo in dstackai/dstack#2941
[Feature]: Service probes by @jvstme in dstackai/dstack#2927
Return logs external_url for AWS and GCP by @r4victor in dstackai/dstack#2944
[Feature]: Default CLI log level is DEBUG; WARNING and above go to STDOUT, DEBUG logs to a file by @peterschmidt85 in dstackai/dstack#2940
[Feature]: Support for Tenstorrent Galaxy by @peterschmidt85 in dstackai/dstack#2943
Disallow duplicate project members by @r4victor in dstackai/dstack#2945
[Feature]: If GCP logging or AWS Cloudwatch logging is configured, show link in the UI to the log stream by @olgenn in dstackai/dstack#2948
Specify sentry-sdk[fastapi]>=2.27.0 to fix missing SamplingContext by @r4victor in dstackai/dstack#2950
[Feature]: Showing timestamp for logs by @olgenn in dstackai/dstack#2937
[Landing]: Highlight dstack Sky + CTA improvements by @peterschmidt85 in dstackai/dstack#2947
Fix Lambda backend instance unreachable after dstack server restart by @Bihan in dstackai/dstack#2946
Fix configuring CLI logging on Python 3.9/3.10 by @jvstme in dstackai/dstack#2953
[Feature]: Add NVIDIA GPU passive health checks by @un-def in dstackai/dstack#2952
Fix _check_instance log spam by @un-def in dstackai/dstack#2956
Add more probe request configuration options by @jvstme in dstackai/dstack#2955
[Feature]: Add Hot Aisle backend by @Bihan in dstackai/dstack#2935
[Internal]: Fix release workflow by @jvstme in dstackai/dstack#2959

New Contributors

@jspablo made their first contribution in dstackai/dstack#2941

Full Changelog: dstackai/dstack@0.19.21...0.19.22

Contributors

un-def, olgenn, and 5 other contributors

Assets 2

02 Jul 12:21

un-def

0.19.17-v1

89fee21

0.19.17-v1

Single Sign-On via Google

dstack Enterprise now supports Single Sign-On via Google. When Google integration is configured, the dstack login page will display the Sign in with Google button. See the Google integration guide for more information.

Secrets

dstack gets support for secrets that allow centralized management of sensitive values such as API keys and credentials. They are project-scoped, managed by project admins, and can be referenced in run configurations to pass sensitive values to runs in a secure manner. Example:

$ dstack secret set my_secret some_secret_value
OK

type: task
nodes: 1
name: test-secrets
env:
  - MY_SECRET=${{ secrets.my_secret }}
commands:
  - echo $MY_SECRET

$ dstack apply -f .dstack/confs/task.dstack.yaml

Submit the run test-task? [y/n]: y
 NAME            BACKEND         RESOURCES              PRICE   STATUS   SUBMITTED
 test-task       aws             cpu=2 mem=8GB          $0.107  running  10:48 
                 (eu-west-1)     disk=100GB                                   

test-secrets provisioning completed (running)
some_secret_value
Exited (0)

For more details on secrets, check out the docs.

Files

By default, dstack automatically mounts the repo directory where you ran dstack init to any run configuration.

However, in some cases, you may not want to mount the entire directory (e.g., if it’s too large), or you might want to mount files outside of it. In such cases, you can use the files property.

type: task
name: trl-sft

files:
  - .:examples  # Maps the directory where `.dstack.yml` to `/workflow/examples`
  - ~/.ssh/id_rsa  # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`

python: 3.12

env:
  - HF_TOKEN
  - HF_HUB_ENABLE_HF_TRANSFER=1
  - MODEL=Qwen/Qwen2.5-0.5B
  - DATASET=stanfordnlp/imdb

commands:
  - uv pip install trl
  - | 
    trl sft \
      --model_name_or_path $MODEL --dataset_name $DATASET
      --num_processes $DSTACK_GPUS_PER_NODE

resources:
  gpu: H100:1

Warning

If you have existing fleets, it's recommended to re-create them after upgrading to version 0.19.17. Otherwise, there is a risk that these instances won't be able to execute jobs if if a run uses files.

Services

Rolling deployment

Rolling deployments introduced in 0.19.15 are now supported when deploying new commits or branches from a Git repo, or when changes are made to the repo contents or files listed in the files section.

Additionally, dstack apply now displays a full list of detected changes:

$ dstack apply -f my-service.dstack.yml

Active run my-service already exists. Detected changes that can be updated in-place:
- Repo state (branch, commit, or other)
- File archives
- Configuration properties:
  - env
  - files

Update the run? [y/n]:

Even when a rolling deployment isn't possible, the list of changes is still shown — making it easier to identify which changes are preventing the deployment from proceeding in-place.

What's changed

[Bug]: Docker In Docker does not work with AMD by @peterschmidt85 in dstackai/dstack#2849
[Feature] Add files property to run configurations by @un-def in dstackai/dstack#2848
[Feature] Implement project secrets by @r4victor in dstackai/dstack#2854
[Internal] Support fleet configurations for the local backend by @jvstme in dstackai/dstack#2856
[Services] Rolling deployments for repo updates by @jvstme in dstackai/dstack#2853
[Internal] Fix package dependency direction by @jvstme in dstackai/dstack#2859
[Internal] Rolling deployments for files by @jvstme in dstackai/dstack#2862
[Internal] Support the local backend with the in-server proxy by @jvstme in dstackai/dstack#2858
[Docs] Added Files documentation by @peterschmidt85 in dstackai/dstack#2866
[Bug] Fix ~ expansion in files by @un-def in dstackai/dstack#2865
[Feature] Allow in-place update for more run properties by @jvstme in dstackai/dstack#2867

Full changelog: dstackai/dstack@0.19.16...0.19.17

Contributors

un-def, r4victor, and 2 other contributors

Assets 2

29 Jul 08:31

r4victor

0.19.21-v1

f087935

0.19.21-v1

Runs

Scheduled runs

Runs get a new schedule property that allows starting runs periodically by specifying a cron expression:

type: task
nodes: 1
schedule:
  cron: "*/15 * * * *"
commands:
  - ...

dstack will start a scheduled run at cron times unless the run is already running. It can then be stopped manually to prevent it from starting again. Learn more about scheduled runs in the docs.

CLI

Startup time

The CLI startup time was significantly improved up to 4 times by optimizing Python imports.

Server

Optimized DB queries

We optimized DB queries issues by the dstack server. This improves API response times and decreases the load on the DB, which was previously noticeable on small Postgres instances.

What's Changed

Support scheduled runs by @r4victor in dstackai/dstack#2914
Autoset UTC timezone for datetimes loaded from the db by @r4victor in dstackai/dstack#2922
Refactor backends module to avoid importing deps on models import by @r4victor in dstackai/dstack#2923
Optimize db queries by @r4victor in dstackai/dstack#2928
Optimize db queries (part 2) by @r4victor in dstackai/dstack#2929
[UI] Add justfile to build frontend by @peterschmidt85 in dstackai/dstack#2897
Fix project loading in _check_instance() by @r4victor in dstackai/dstack#2931
Set up background tasks Sentry tracing by @r4victor in dstackai/dstack#2932

Full Changelog: dstackai/dstack@0.19.20...0.19.21

Contributors

r4victor and peterschmidt85

Assets 2

21 Jul 12:36

r4victor

0.19.20-v1

f087935

0.19.20-v1

User interface

Logs

This is a hotfix release addressing three major issues related to the UI:

The UI didn’t display newer AWS CloudWatch logs if there was a long gap between old and new logs.
Logs received before the 19th appeared as base64-encoded in the UI. The UI now includes a button to decode them automatically.
Logs were loaded from start to end, which made viewing very slow for long runs.

Note

The dstack logs CLI command may still be affected by the issues above. However, it’s less critical and will be addressed separately.

What's changed

[chore]: Drop duplicate utility split_chunks by @jvstme in dstackai/dstack#2912
[backends/CloudRift] Fixed issue with terminating inactive instance by @6erun in dstackai/dstack#2918
Expose GPU metrics collected by runner as Prometheus metrics by @un-def in dstackai/dstack#2916
[UI] Query logs using descending by @peterschmidt85 in dstackai/dstack#2915
[UI] Fix logs loading #2892 by @olgenn in dstackai/dstack#2920

Full changelog: dstackai/dstack@0.19.19...0.19.20

Contributors

un-def, olgenn, and 3 other contributors

Assets 2

17 Jul 05:59

r4victor

0.19.19-v1

f087935

0.19.19-v1

Fleets

SSH fleets in-place updates

You can now add and remove instances in SSH fleets without recreating the entire fleet.

type: fleet
name: ssh-fleet
ssh_config:
  user: dstack
  identity_file: ~/.ssh/dstack
  hosts:
    - 10.0.0.1
    - 10.0.0.2

$ dstack apply -f fleet.dstack.yml
...
Fleet ssh-fleet does not exist yet.
Create the fleet? [y/n]: y
...
 FLEET      INSTANCE  BACKEND       RESOURCES                PRICE  STATUS  CREATED
 ssh-fleet  0         ssh (remote)  cpu=4 mem=4GB disk=30GB  $0     idle    09:08
            1         ssh (remote)  cpu=2 mem=4GB disk=30GB  $0     idle    09:08

Then, if you update the hosts configuration property to

  hosts:
    #- 10.0.0.1  # removed
    - 10.0.0.2
    - 10.0.0.3  # added

and apply the same configuration again, the fleet will be updated in-place, meaning that you don't need to stop runs on the fleet instances if they are not affected by the changes (in this example, it's okay if the instance 1 is currenty busy, you can still apply the configuration).

$ dstack apply -f fleet.dstack.yml
...
Found fleet ssh-fleet. Configuration changes detected.
Update the fleet in-place? [y/n]: y
...
 FLEET      INSTANCE  BACKEND       RESOURCES                PRICE  STATUS  CREATED
 ssh-fleet  1         ssh (remote)  cpu=2 mem=4GB disk=30GB  $0     idle    09:08
            2         ssh (remote)  cpu=8 mem=4GB disk=30GB  $0     idle    09:12

Note

For in-place updates it's only allowed to add and/or remove instances, the root configuration and configurations of hosts that are not changed must not be changed, otherwise the full fleet recreation is triggered, as before. This restriction may be lifted in the future.

Volumes

Automatic cleanup of unused volumes

The volume configuration gets a new auto_cleanup_duration property:

type: volume
name: my-volume
backend: aws
region: eu-west-1
availability_zone: eu-west-1a
auto_cleanup_duration: 1h

The volume will be automatically deleted after it's not being used for the specified duration.

Logs

Browsable, queryable, and searchable logs

dstack now stores run logs in plaintext, which were previously base64-encoded. This allows you to use the configured log storage, be it AWS CloudWatch or GCP Logging, to browse and query dstack run logs.

Note

Logs generated before this release will be shown as base64-encoded in the UI and CLI after the update.

Server

Faster API response times

The dstack server API has been optimized to serialize json responses faster. The API endpoints are up to 2x faster than before.

Benchmarks

Benchmarking AMD GPUs: bare-metal, containers, partitions

Our new benchmark explores two important areas for optimizing AI workloads on AMD GPUs: First, do containers introduce a performance penalty for network-intensive tasks compared to a bare-metal setup? Second, how does partitioning a powerful GPU like the MI300X affect its real-world performance for different types of AI workloads?

What's Changed

[Internal] Some runner tests fail on macOS by @peterschmidt85 in dstackai/dstack#2879
Introduce job_submissions_limit for /api/runs/list by @r4victor in dstackai/dstack#2883
Speed up json serialization with orjson and custom FastAPI responses by @r4victor in dstackai/dstack#2880
[Docs]: Service rolling deployments by @jvstme in dstackai/dstack#2870
Do not lose provisioning gateways on restart by @jvstme in dstackai/dstack#2887
Add/remove SSH instances via in-place update by @un-def in dstackai/dstack#2884
[Docs]: Add example of setting a PostgreSQL URL by @jvstme in dstackai/dstack#2888
[Blog] Added new changelog by @peterschmidt85 in dstackai/dstack#2891
Fix job_submissions_limit backward compatibility by @r4victor in dstackai/dstack#2894
Fix run and job status_message calculation by @r4victor in dstackai/dstack#2889
Fix 500 errors when requesting file logs by @r4victor in dstackai/dstack#2896
Rolling deployments for port by @jvstme in dstackai/dstack#2893
[Feature] Strip ANSI codes from run logs and store them as plain text instead of bytes by @peterschmidt85 in dstackai/dstack#2876
[Feature]: Add ability to disable background processing and only run Web UI and API server #2901 by @james-boydell in dstackai/dstack#2902
[shim] Don't check image downloaded size by @un-def in dstackai/dstack#2903
Fix rolling deployment migration locking by @r4victor in dstackai/dstack#2904
feat: add volume idle duration cleanup feature (#2497) by @haydnli-shopify in dstackai/dstack#2842
[Blog] Benchmarking AMD GPUs: bare-metal, containers, partitions by @peterschmidt85 in dstackai/dstack#2905
Fix /users/list by @r4victor in dstackai/dstack#2908
Return logs in base64 for backward compatibility by @r4victor in dstackai/dstack#2910

Full Changelog: dstackai/dstack@0.19.18...0.19.19

Contributors

un-def, r4victor, and 4 other contributors

Assets 2

09 Jul 09:59

r4victor

0.19.18-v1

f087935

0.19.18-v1

Server

Optimized resources processing

This release includes major improvements that allow the dstack server process more resources quickly. It also allows scaling processing rates of one server replica to take advantage of big Postgres instances by setting the DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR environment variable.

The result is:

Faster processing rates: provisioning 100 runs on SQLite with default settings went from ~5m to ~2m.
Better scaling: provisioning additional 100 runs is even quicker due to warm cache. Before, it was slower than the first 100 runs.
Ability to process more runs per server replica: provisioning 300 runs on Postgres with DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR=4 is ~4m.

For more details on scaling backgraound processing rates, see the Server deployment guide.

Backends

Private GCP gateways

It's now possible to create GCP gateways without public IPs:

type: gateway
name: example
domain: gateway.example.com
backend: gcp
region: europe-west9
public_ip: false
certificate: null

Note that configuring HTTPS certificates for private GCP gateways is not yet supported, so you need to specify certificate: null.

What's Changed

Ignore SSH keys when calculating fleet conf diff by @un-def in dstackai/dstack#2869
[Blog] Refactoring by @peterschmidt85 in dstackai/dstack#2873
Implemented fronted precommit linting by @olgenn in dstackai/dstack#2868
Support processing more resources per replica by @r4victor in dstackai/dstack#2871
Use uvloop by default by @r4victor in dstackai/dstack#2874
Add server profiling by @r4victor in dstackai/dstack#2875
Fix NVIDIA container toolkit bug in all backends by @jvstme in dstackai/dstack#2877
Private GCP gateways by @jvstme in dstackai/dstack#2881
Switch to e2-medium for GCP gateways by @jvstme in dstackai/dstack#2886

Full Changelog: dstackai/dstack@0.19.17...0.19.18

Contributors

un-def, olgenn, and 3 other contributors

Assets 2

26 Jun 11:24

r4victor

0.19.16-v1

f087935

0.19.16-v1

Docker

Docker in Docker

Using Docker in a run configuration is now much easier. Just set docker to true:

type: task
name: docker-nvidia-smi

docker: true

commands:
  - docker run --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

resources:
  gpu: 1

This works with all run configuration types and supports both AMD and NVIDIA GPUs. It’s especially useful if you want to use the docker CLI in your commands—for example, to build Docker images.

The docker property is supported on all backends except vastai, runpod, and kubernetes, and is fully supported on SSH fleets as well.

Backends

CloudRift

The CloudRift team has added support for their GPU cloud, which can now be used with dstack.

To configure it, use a CloudRift API key in the backend configuration:

projects:
  - name: main
    backends:
      - type: cloudrift
        creds:
          type: api_key
          api_key: rift_2prgY1d0laOrf2BblTwx2B2d1zcf1zIp4tZYpj5j88qmNgz38pxNlpX3vAo

CloudRift offers competitive on-demand GPU pricing, with more GPUs and regions coming soon.

dstack apply -f examples/.dstack.yml -b cloudrift

 #  BACKEND                      RESOURCES                                    INSTANCE TYPE   PRICE
 1  cloudrift (us-east-nc-nr-1)  cpu=16 mem=100GB disk=1000GB RTX5090:32GB:1  rtx59-16c-nr.1  $0.65

If you encounter any issues with this backend, please report them.

Server

Public projects

You can now create public projects that any user on the server can join or leave without approval. Previously, all projects were private, and adding new members required manual action by an admin or manager—a step that’s redundant in high-trust environments.

Admins can change a project’s visibility at any time in the project settings.

Metrics

The server exports new Prometheus metrics:

dstack_submit_to_provision_duration_seconds: Time from when a run has been submitted and first job provisioning
dstack_pending_runs_total: Total number of pending runs

What's changed

[Feature]: Property filter on Fleets, Models, Volumes pages by @olgenn in dstackai/dstack#2824
[Bug]: Run/job status in UI/CLI is shown as provisioning instead of pulling by @peterschmidt85 in dstackai/dstack#2834
[chore]: Fix annotation in update_service_desired_replica_count by @jvstme in dstackai/dstack#2840
Add CloudRift backend by @6erun in dstackai/dstack#2771
Fix Postgres deadlocks by @r4victor in dstackai/dstack#2843
[UX] Simplify the use of Docker inside containers #2468 by @peterschmidt85 in dstackai/dstack#2828
[Docs] Update docs and examples to reflect the docker property by @peterschmidt85 in dstackai/dstack#2831
Add support for Tenstorrent n300 GPUs by @peterschmidt85 in dstackai/dstack#2827
[Feature]: Property filter on Instances page by @olgenn in dstackai/dstack#2826
[UI] Allow to hide the Tour panel by @olgenn in dstackai/dstack#2816
Pr3 add join leave UI buttons by @haydnli-shopify in dstackai/dstack#2795
Health metrics (Part 2) by @Nadine-H in dstackai/dstack#2796
[Bug]: Use a unique token for log pagination instead of a timestamp by @peterschmidt85 in dstackai/dstack#2845
Fix update project required permissions by @r4victor in dstackai/dstack#2846

New contributors

@6erun made their first contribution in dstackai/dstack#2771

Full changelog: dstackai/dstack@0.19.15...0.19.16

Contributors

olgenn, Nadine-H, and 5 other contributors

Assets 2

19 Jun 20:50

jvstme

0.19.15-v1

f087935

0.19.15-v1

Services

Rolling deployments

This update introduces rolling deployments, which help avoid downtime when deploying new versions of your services.

When you apply an updated service configuration, dstack will gradually replace old service replicas with new ones. You can track the progress in the dstack apply output — the deployment number will be lower for old replicas and higher for new ones.

> dstack apply -f my-service.dstack.yml

Active run my-service already exists. Detected configuration changes that can be updated in-place: ['image', 'env', 'commands']
Update the run? [y/n]: y

⠋ Launching my-service...
 NAME                            BACKEND          RESOURCES                        PRICE    STATUS       SUBMITTED
 my-service deployment=1                                                                    running      11 mins ago
   replica=0 job=0 deployment=0  aws (us-west-2)  cpu=2 mem=1GB disk=100GB (spot)  $0.0026  terminating  11 mins ago
   replica=1 job=0 deployment=1  aws (us-west-2)  cpu=2 mem=1GB disk=100GB (spot)  $0.0026  running      1 min ago

Currently, the following service configuration properties can be updated using rolling deployments: resources, volumes, image, user, privileged, entrypoint, python, nvcc, single_branch, env, shell, and commands.

Future releases will allow updating more properties and deploying new git repo commits.

Clusters

Updated default Docker images

If you don't specify a custom image in the run configuration, dstack uses its default images. These images have been improved for cluster environments and now include mpirun and NCCL tests. Additionally, if you are running on AWS EFA-capable instances, dstack will now automatically select an image with the appropriate EFA drivers. See our new AWS EFA guide for more details.

Server

Health metrics

The dstack server now exports some operational Prometheus metrics that allow to monitor its health. If you are running your own production-grade dstack server installation, refer to the metrics docs for details.

What's changed

Set logsWaitDuration to 5m by @r4victor in dstackai/dstack#2794
Add health metrics (Part 1) by @Nadine-H in dstackai/dstack#2760
Add public projects by @haydnli-shopify in dstackai/dstack#2759
Fix is_public allowing null by @r4victor in dstackai/dstack#2798
Retry on VOLUME_ERROR and INSTANCE_UNREACHABLE by @jvstme in dstackai/dstack#2805
Rework default Docker images by @peterschmidt85 in dstackai/dstack#2799
Fix volume error status message by @jvstme in dstackai/dstack#2806
[Docs] Added EFA example by @peterschmidt85 in dstackai/dstack#2820
[Bug]: Empty spaces on User Details page by @olgenn in dstackai/dstack#2815
Rolling deployment for services by @jvstme in dstackai/dstack#2821
Fix building dstack package by @jvstme in dstackai/dstack#2823

New Contributors

@haydnli-shopify made their first contribution in dstackai/dstack#2759

Full Changelog: dstackai/dstack@0.19.13...0.19.15

Contributors

olgenn, Nadine-H, and 4 other contributors

Assets 2

11 Jun 10:28

r4victor

0.19.13-v1

f087935

0.19.13-v1

Clusters

Built-in InfiniBand support in `dstack` Docker images

The dstack default Docker images now come with built-in InfiniBand support, which includes the necessary libibverbs library and InfiniBand utilities from rdma-core. This means you can run torch distributed and other workloads utilizing NCCL, and they'll take full advantage of InfiniBand without custom Docker images.

You can try InfiniBand clusters with dstack on Nebius.

Built-in EFA support in `dstack` VM images

dstack switches to DLAMI as the default AWS GPU VM image from a custom one. DLAMI supports EFA out-of-the-box, so you no longer need to use a custom VM image to take advantage of EFA.

Server

GCS support for code uploads

It's now possible to configure the dstack server to use GCP Cloud Storage for code uploads. Previously, only DB and S3 storages were supported. Learn more in the Server deployment guide.

What's Changed

Support file upload to gcs bucket by @colinjc in dstackai/dstack#2737
Document File storage by @r4victor in dstackai/dstack#2755
[Docs] Minor update of Clusters and Distributed tasks sections by @peterschmidt85 in dstackai/dstack#2741
Fix CLI exiting while master starting by @r4victor in dstackai/dstack#2757
[UI] Implement property filter on Run list page by @olgenn in dstackai/dstack#2762
[Bug]: Text is unavailable for selection on run logs page by @olgenn in dstackai/dstack#2763
Preinstall rdma-core packages into dstack Docker image by @r4victor in dstackai/dstack#2764
[UX] Show status message as retrying in case a run or job is being retired by @peterschmidt85 in dstackai/dstack#2758
[Docs] Minor improvements by @peterschmidt85 in dstackai/dstack#2766
[Feature]: Include priority to the list of runs and sort runs by priority by @olgenn in dstackai/dstack#2768
[Feature]: The Run details page should display the same fields as the Run list page by @olgenn in dstackai/dstack#2769
[Feature]: Show Quickstart button if user don't have any runs by @olgenn in dstackai/dstack#2770
[Feature]: Implement links for elements that have details page by @olgenn in dstackai/dstack#2772
[Feature]: Add Refresh button on Run details page by @olgenn in dstackai/dstack#2773
[Bug]: Tab Billing changes to Settings after top up balance by @olgenn in dstackai/dstack#2774
Exclude backward incompatible fields from rest plugin calls by @colinjc in dstackai/dstack#2767
[UI] Minor fixes by @peterschmidt85 in dstackai/dstack#2775
Pin dkms by @r4victor in dstackai/dstack#2776
Use DLAMI on AWS by @r4victor in dstackai/dstack#2782
2674 prop filter by @olgenn in dstackai/dstack#2778
Fixed defect #2752 by @olgenn in dstackai/dstack#2784
Update base image to 0.9 by @r4victor in dstackai/dstack#2786
Fix status_message with missing on_events by @r4victor in dstackai/dstack#2788
[Bug]: UI doesn't show Resources for instances of SSH fleets by @peterschmidt85 in dstackai/dstack#2785
Ignore AWS quotas when hitting rate limits by @r4victor in dstackai/dstack#2791

Full Changelog: dstackai/dstack@0.19.12...0.19.13

Contributors

olgenn, colinjc, and 2 other contributors

Assets 2

04 Jun 11:18

r4victor

0.19.12-v1

f087935

0.19.12-v1

Clusters

Simplified use of MPI

`startup_order` and `stop_criteria`

New run configuration properties are introduced:

startup_order: any/master-first/workers-first specifies the order in which master and workers jobs are started.
stop_criteria: all-done/master-done specifies the criteria when a multi-node run should be considered finished.

These properties simplify running certain multi-node workloads. For example, MPI requires that workers are up and running when the master runs mpirun, so you'd use startup_order: workers-first. MPI workload can be considered done when the master is done, so you'd use stop_criteria: master-done and dstack won't wait for workers to exit.

`DSTACK_MPI_HOSTFILE`

dstack now automatically creates an MPI hostfile and exposes the DSTACK_MPI_HOSTFILE environment variable with the hostfile path. It can be used directly as mpirun --hostfile $DSTACK_MPI_HOSTFILE.

CLI

We've also updated how the CLI displays run and job status. Previously, the CLI displayed the internal status code which was hard to interpret. Now, the the STATUS column in dstack ps and dstack apply displays a status code which is easy to understand why run or job was terminated.

dstack ps -n 10
 NAME               BACKEND             RESOURCES                            PRICE    STATUS        SUBMITTED
 oom-task                                                                             no offers     yesterday
 oom-task           nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  exited (127)  yesterday
 oom-task           nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  exited (127)  yesterday
 heavy-wolverine-1                                                                    done          yesterday
   replica=0 job=0  aws (us-east-1)     cpu=4 mem=16GB disk=100GB T4:16GB:1  $0.526   exited (0)    yesterday
   replica=0 job=1  aws (us-east-1)     cpu=4 mem=16GB disk=100GB T4:16GB:1  $0.526   exited (0)    yesterday
 cursor             nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  stopped       yesterday
 cursor             nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  error         yesterday
 cursor             nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  interrupted   yesterday
 cursor             nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  aborted       yesterday

Examples

Simplified NCCL tests

With this release improvements, it became much easier to run MPI workloads with dstack. This includes NCCL tests that can now be run using the following configuration:

type: task
name: nccl-tests

nodes: 2
startup_order: workers-first
stop_criteria: master-done

image: dstackai/efa
env:
  - NCCL_DEBUG=INFO
commands:
  - cd /root/nccl-tests/build
  - |
    if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
      mpirun \
        --allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE \
        -n ${DSTACK_GPUS_NUM} \
        -N ${DSTACK_GPUS_PER_NODE} \
        --mca btl_tcp_if_exclude lo,docker0 \
        --bind-to none \
        ./all_reduce_perf -b 8 -e 8G -f 2 -g 1
    else
      sleep infinity
    fi

resources:
  gpu: nvidia:4:16GB
  shm_size: 16GB

See the updated NCCL tests example for more details.

Distributed training

TRL

The new TRL example walks you through how to run distributed fine-tune using TRL, Accelerate and Deepspeed.

Axolotl

The new Axolotl example walks you through how to run distributed fine-tune using Axolotl with dstack.

What's changed

[Feature] Update .gitignore logic to catch more cases by @colinjc in dstackai/dstack#2695
[Bug] Increase upload_code client timeout by @r4victor in dstackai/dstack#2709
[Bug] Fix missing apt-get update by @r4victor in dstackai/dstack#2710
[Internal]: Update git hooks and package.json by @olgenn in dstackai/dstack#2706
[Examples] Add distributed Axolotl and TRL example by @Bihan in dstackai/dstack#2703
[Docs] Update dstack-proxy contributing guide by @jvstme in dstackai/dstack#2683
[Feature] Implement DSTACK_MPI_HOSTFILE by @r4victor in dstackai/dstack#2718
[Feature] Implement startup_order and stop_criteria by @r4victor in dstackai/dstack#2714
[Bug] Fix CLI exiting while master starting by @r4victor in dstackai/dstack#2720
[Examples] Simplify NCCL tests example by @r4victor in dstackai/dstack#2723
[Examples] Update TRL Single Node example to uv by @Bihan in dstackai/dstack#2715
[Bug] Fix backward compatibility when creating fleets by @jvstme in dstackai/dstack#2727
[UX]: Make run status in UI and CLI easier to understand by @peterschmidt85 in dstackai/dstack#2716
[Bug] Fix relative paths in dstack apply --repo by @jvstme in dstackai/dstack#2733
[Internal]: Drop hardcoded regions from the backend template by @jvstme in dstackai/dstack#2734
[Internal]: Update backend template to match ruff formatting by @jvstme in dstackai/dstack#2735

Full changelog: dstackai/dstack@0.19.11...0.19.12

Contributors

olgenn, Bihan, and 4 other contributors

Assets 2

Releases: dstackai/dstack-enterprise

0.19.22-v1

Services

Probes

Accelerators

NVIDIA GPU health checks

Tenstorrent Galaxy

Backends

Hot Aisle

CLI

Reading configurations from stdin

Debug logs

UI

Secrets

Logs improvements

What's changed

New Contributors

Contributors

Uh oh!

0.19.17-v1

Single Sign-On via Google

Secrets

Files

Services

Rolling deployment

What's changed

Contributors

Uh oh!

0.19.21-v1

Runs

Scheduled runs

CLI

Startup time

Server

Optimized DB queries

What's Changed

Contributors

Uh oh!

0.19.20-v1

User interface

Logs

What's changed

Contributors

Uh oh!

0.19.19-v1

Fleets

SSH fleets in-place updates

Volumes

Automatic cleanup of unused volumes

Logs

Browsable, queryable, and searchable logs

Server

Faster API response times

Benchmarks

Benchmarking AMD GPUs: bare-metal, containers, partitions

What's Changed

Contributors

Uh oh!

0.19.18-v1

Server

Optimized resources processing

Backends

Private GCP gateways

What's Changed

Contributors

Uh oh!

0.19.16-v1

Docker

Docker in Docker

Backends

CloudRift

Server

Public projects

Metrics

What's changed

New contributors

Contributors

Uh oh!

0.19.15-v1

Services

Built-in InfiniBand support in `dstack` Docker images

Built-in EFA support in `dstack` VM images

`startup_order` and `stop_criteria`

`DSTACK_MPI_HOSTFILE`