From 0145e60a6eb85d26a649296b4fbd0c4a2932e5cd Mon Sep 17 00:00:00 2001 From: Anda Date: Fri, 22 Nov 2024 12:37:36 -0800 Subject: [PATCH 1/4] docs: add release notes for 0.38.0 --- docs/release-notes.rst | 358 ++++++++++++++++++ docs/release-notes/9966-fix-grid.rst | 7 - .../add-host-port-scheme-to-helm.rst | 9 - docs/release-notes/api-cli-access-token.rst | 28 -- docs/release-notes/config-policies.rst | 15 - docs/release-notes/helm-db-snapshot.rst | 6 - docs/release-notes/log-signal.rst | 10 - .../pytorch-tensorboard-plugin.rst | 10 - .../rbac-new-tokenCreator-role.rst | 7 - docs/release-notes/remove-custom-searcher.rst | 7 - .../searcher-context-removal.rst | 72 ---- docs/release-notes/ssh-crypto-system.rst | 8 - .../unsupport-aurora-postgres-reminder.rst | 19 - 13 files changed, 358 insertions(+), 198 deletions(-) delete mode 100644 docs/release-notes/9966-fix-grid.rst delete mode 100644 docs/release-notes/add-host-port-scheme-to-helm.rst delete mode 100644 docs/release-notes/api-cli-access-token.rst delete mode 100644 docs/release-notes/config-policies.rst delete mode 100644 docs/release-notes/helm-db-snapshot.rst delete mode 100644 docs/release-notes/log-signal.rst delete mode 100644 docs/release-notes/pytorch-tensorboard-plugin.rst delete mode 100644 docs/release-notes/rbac-new-tokenCreator-role.rst delete mode 100644 docs/release-notes/remove-custom-searcher.rst delete mode 100644 docs/release-notes/searcher-context-removal.rst delete mode 100644 docs/release-notes/ssh-crypto-system.rst delete mode 100644 docs/release-notes/unsupport-aurora-postgres-reminder.rst diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 98bf8843186..fdf884afa8c 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -6,6 +6,364 @@ Release Notes ############### +************** + Version 0.38 +************** + +Version 0.38.0 +============== + +**Release Date:** November 22, 2024 + +**Breaking Changes** + +- ASHA: All experiments using ASHA hyperparameter search must now configure ``max_time`` and + ``time_metric`` in the experiment config, instead of ``max_length``. Additionally, training code + must report the configured ``time_metric`` in validation metrics. As a convenience, Determined + training loops now automatically report ``batches`` and ``epochs`` with metrics, which you can + use as your ``time_metric``. ASHA experiments without this modification will no longer run. + +- Custom Searchers: all custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0`` + and are now being removed. Users are encouraged to use a preset searcher, which can be easily + :ref:`configured ` for any experiment. + +- API: Custom Searcher (including DeepSpeed AutoTune) was deprecated in 0.36.0 and is now removed. + We will maintain first-class support for a variety of preset searchers, which can be easily + configured for any experiment. Visit :ref:`search-methods` for details. + +**New Features** + +- API/CLI: Add support for access tokens. Add the ability to create and administer access tokens + for users to authenticate in automated workflows. Users can define the lifespan of these tokens, + making it easier to securely authenticate and run processes. Users can set global defaults and + limits for the validity of access tokens by configuring ``default_lifespan_days`` and + ``max_lifespan_days`` in the master configuration. Setting ``max_lifespan_days`` to ``-1`` + indicates an **infinite** lifespan for the access token. This feature enhances automation while + maintaining strong security protocols by allowing tighter control over token usage and + expiration. This feature requires Determined Enterprise Edition. + + - CLI: + + - ``det token create``: Create a new access token. + - ``det token login``: Sign in with an access token. + - ``det token edit``: Update an access token's description. + - ``det token list``: List all active access tokens, with options for displaying revoked + tokens. + - ``det token describe``: Show details of specific access tokens. + - ``det token revoke``: Revoke an access token. + + - API: + + - ``POST /api/v1/tokens``: Create a new access token. + - ``GET /api/v1/tokens``: Retrieve a list of access tokens. + - ``PATCH /api/v1/tokens/{token_id}``: Edit an existing access token. + +- API: introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that + integrates Keras training code with Determined through a single :ref:`Keras Callback + `. + +- API: introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that + allows for Python-side training loop configurations and includes support for local training. + +- Cluster: In the enterprise edition of Determined, add :ref:`config policies ` to + enable administrators to set limits on how users can define workloads (e.g., experiments, + notebooks, TensorBoards, shells, and commands). Administrators can define two types of + configurations: + + - **Invariant Configs for Experiments**: Settings applied to all experiments within a specific + scope (global or workspace). Invariant configs for other tasks (e.g. notebooks, TensorBoards, + shells, and commands) is not yet supported. + + - **Constraints**: Restrictions that prevent users from exceeding resource limits within a + scope. Constraints can be set independently for experiments and tasks. + +- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and + ``determined_master_scheme``. These control how tasks address the Determined API server and are + useful when installations span multiple Kubernetes clusters or there are proxies in between tasks + and the master. Also, ``determined_master_host`` now defaults to the service host, + ``..svc.cluster.local``, instead of the service IP. + +- Helm: Add support for capturing and restoring snapshots of the database persistent volume. Visit + :ref:`helm-config-reference` for more details. + +- New RBAC role: In the enterprise edition of Determined, add a ``TokenCreator`` RBAC role, which + allows users to create, view, and revoke their own :ref:`access tokens `. This + role can only be assigned globally. + +- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows + as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in + both the run table and run detail views. + + In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string. + For more details, refer to :ref:`log_policies `. + +**Improvements** + +- Master Configuration: Add support for crypto system configuration for ssh connection. + ``security.key_type`` now accepts ``RSA``, ``ECDSA`` or ``ED25519``. Default key type is changed + from ``1024-bit RSA`` to ``ED25519``, since ``ED25519`` keys are faster and more secure than the + old default, and ``ED25519`` is also the default key type for ``ssh-keygen``. + +**Removed Features** + +- WebUI: "Continue Training" no longer supports configurable number of batches in the Web UI and + will simply resume the trial from the last checkpoint. + +**Known Issues** + +- PyTorch has `deprecated + ` + their Profiler TensorBoard Plugin (``tb_plugin``), so some features may not be compatible with + PyTorch 2.0 and above. Our current default environment image comes with PyTorch 2.3. If users are + experiencing issues with this plugin, we suggest using an image with a PyTorch version earlier + than 2.0. + +**Bug Fixes** + +- Previously, during a grid search, if a hyperparameter contained an empty nested hyperparameter + (that is, just an empty map), that hyperparameter would not appear in the hparams passed to the + trial. + +**Deprecations** + +- Experiment Config: the ``max_length`` field of the searcher configuration section has been + deprecated for all experiments and searchers. Users are expected to configure the desired + training length directly in training code. + +- Experiment Config: the ``optimizations`` config has been deprecated. Please see :ref:`Training + APIs ` to configure supported optimizations through training code directly. + +- Experiment Config: the ``scheduling_unit``, ``min_checkpoint_period``, and + ``min_validation_period`` config fields have been deprecated. Instead, these configuration + options should be specified in training code. + +- Experiment Config: the ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial + definitions. Please invoke your training script directly (``python3 train.py``). + +- Core API: the ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no + longer requires ``core.searcher.operations`` to run, and progress should be reported through + ``core.train.report_progress``. + +- DeepSpeed: the ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes + on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and + ``get_num_micro_batches_per_slot()``. + +- Horovod: the horovod distributed training backend has been deprecated. Users are encouraged to + migrate to the native distributed backend of their training framework (``torch.distributed`` or + ``tf.distribute``). + +- Trial APIs: ``TFKerasTrial`` has been deprecated. Users are encouraged to migrate to the new + :ref:`Keras Callback `. + +- Launchers: the ``--trial`` argument in Determined launchers has been deprecated. Please invoke + your training script directly. + +- ASHA: the ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated. + All ASHA searches are now early-stopping based (``stop_once: true``) instead of promotion based. + +- CLI: The ``--test`` and ``--local`` flags for ``det experiment create`` have been deprecated. All + training APIs now support local execution (``python3 train.py``). Please see ``training apis`` + for details specific to your framework. + +- Web UI: previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web + UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback + option. + +- Cluster: A reminder that Amazon Aurora V1 will reach End of Life at the end of 2024. It is no + longer supported as the default persistent storage for AWS Determined deployments. We recommend + that users migrate to Amazon RDS for PostgreSQL. For more information, visit the `migration + instructions `_. + +- Cluster: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det deploy + aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which uses + Amazon RDS for PostgreSQL. Changes to the deployment code will ensure this transition to the new + default. + +- Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November + 14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later + to maintain compatibility. The application will log a warning if it detects a connection to any + PostgreSQL version older than 12, and this warning will be updated to include PostgreSQL 12 once + it is End of Life. + +************** + Version 0.38 +************** + +Version 0.38.0 +============== + +**Release Date:** November 22, 2024 + +**Breaking Changes** + +- ASHA: All experiments using ASHA hyperparameter search must now configure ``max_time`` and + ``time_metric`` in the experiment config, instead of ``max_length``. Additionally, training code + must report the configured ``time_metric`` in validation metrics. As a convenience, Determined + training loops now automatically report ``batches`` and ``epochs`` with metrics, which you can + use as your ``time_metric``. ASHA experiments without this modification will no longer run. + +- Custom Searchers: all custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0`` + and are now being removed. Users are encouraged to use a preset searcher, which can be easily + :ref:`configured ` for any experiment. + +- API: Custom Searcher (including DeepSpeed AutoTune) was deprecated in 0.36.0 and is now removed. + We will maintain first-class support for a variety of preset searchers, which can be easily + configured for any experiment. Visit :ref:`search-methods` for details. + +**New Features** + +- API/CLI: Add support for access tokens. Add the ability to create and administer access tokens + for users to authenticate in automated workflows. Users can define the lifespan of these tokens, + making it easier to securely authenticate and run processes. Users can set global defaults and + limits for the validity of access tokens by configuring ``default_lifespan_days`` and + ``max_lifespan_days`` in the master configuration. Setting ``max_lifespan_days`` to ``-1`` + indicates an **infinite** lifespan for the access token. This feature enhances automation while + maintaining strong security protocols by allowing tighter control over token usage and + expiration. This feature requires Determined Enterprise Edition. + + - CLI: + + - ``det token create``: Create a new access token. + - ``det token login``: Sign in with an access token. + - ``det token edit``: Update an access token's description. + - ``det token list``: List all active access tokens, with options for displaying revoked + tokens. + - ``det token describe``: Show details of specific access tokens. + - ``det token revoke``: Revoke an access token. + + - API: + + - ``POST /api/v1/tokens``: Create a new access token. + - ``GET /api/v1/tokens``: Retrieve a list of access tokens. + - ``PATCH /api/v1/tokens/{token_id}``: Edit an existing access token. + +- API: introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that + integrates Keras training code with Determined through a single :ref:`Keras Callback + `. + +- API: introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that + allows for Python-side training loop configurations and includes support for local training. + +- Cluster: In the enterprise edition of Determined, add :ref:`config policies ` to + enable administrators to set limits on how users can define workloads (e.g., experiments, + notebooks, TensorBoards, shells, and commands). Administrators can define two types of + configurations: + + - **Invariant Configs for Experiments**: Settings applied to all experiments within a specific + scope (global or workspace). Invariant configs for other tasks (e.g. notebooks, TensorBoards, + shells, and commands) is not yet supported. + + - **Constraints**: Restrictions that prevent users from exceeding resource limits within a + scope. Constraints can be set independently for experiments and tasks. + +- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and + ``determined_master_scheme``. These control how tasks address the Determined API server and are + useful when installations span multiple Kubernetes clusters or there are proxies in between tasks + and the master. Also, ``determined_master_host`` now defaults to the service host, + ``..svc.cluster.local``, instead of the service IP. + +- Helm: Add support for capturing and restoring snapshots of the database persistent volume. Visit + :ref:`helm-config-reference` for more details. + +- New RBAC role: In the enterprise edition of Determined, add a ``TokenCreator`` RBAC role, which + allows users to create, view, and revoke their own :ref:`access tokens `. This + role can only be assigned globally. + +- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows + as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in + both the run table and run detail views. + + In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string. + For more details, refer to :ref:`log_policies `. + +**Improvements** + +- Master Configuration: Add support for crypto system configuration for ssh connection. + ``security.key_type`` now accepts ``RSA``, ``ECDSA`` or ``ED25519``. Default key type is changed + from ``1024-bit RSA`` to ``ED25519``, since ``ED25519`` keys are faster and more secure than the + old default, and ``ED25519`` is also the default key type for ``ssh-keygen``. + +**Removed Features** + +- WebUI: "Continue Training" no longer supports configurable number of batches in the Web UI and + will simply resume the trial from the last checkpoint. + +**Known Issues** + +- PyTorch has `deprecated + ` + their Profiler TensorBoard Plugin (``tb_plugin``), so some features may not be compatible with + PyTorch 2.0 and above. Our current default environment image comes with PyTorch 2.3. If users are + experiencing issues with this plugin, we suggest using an image with a PyTorch version earlier + than 2.0. + +**Bug Fixes** + +- Previously, during a grid search, if a hyperparameter contained an empty nested hyperparameter + (that is, just an empty map), that hyperparameter would not appear in the hparams passed to the + trial. + +**Deprecations** + +- Experiment Config: the ``max_length`` field of the searcher configuration section has been + deprecated for all experiments and searchers. Users are expected to configure the desired + training length directly in training code. + +- Experiment Config: the ``optimizations`` config has been deprecated. Please see :ref:`Training + APIs ` to configure supported optimizations through training code directly. + +- Experiment Config: the ``scheduling_unit``, ``min_checkpoint_period``, and + ``min_validation_period`` config fields have been deprecated. Instead, these configuration + options should be specified in training code. + +- Experiment Config: the ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial + definitions. Please invoke your training script directly (``python3 train.py``). + +- Core API: the ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no + longer requires ``core.searcher.operations`` to run, and progress should be reported through + ``core.train.report_progress``. + +- DeepSpeed: the ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes + on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and + ``get_num_micro_batches_per_slot()``. + +- Horovod: the horovod distributed training backend has been deprecated. Users are encouraged to + migrate to the native distributed backend of their training framework (``torch.distributed`` or + ``tf.distribute``). + +- Trial APIs: ``TFKerasTrial`` has been deprecated. Users are encouraged to migrate to the new + :ref:`Keras Callback `. + +- Launchers: the ``--trial`` argument in Determined launchers has been deprecated. Please invoke + your training script directly. + +- ASHA: the ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated. + All ASHA searches are now early-stopping based (``stop_once: true``) instead of promotion based. + +- CLI: The ``--test`` and ``--local`` flags for ``det experiment create`` have been deprecated. All + training APIs now support local execution (``python3 train.py``). Please see ``training apis`` + for details specific to your framework. + +- Web UI: previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web + UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback + option. + +- Cluster: A reminder that Amazon Aurora V1 will reach End of Life at the end of 2024. It is no + longer supported as the default persistent storage for AWS Determined deployments. We recommend + that users migrate to Amazon RDS for PostgreSQL. For more information, visit the `migration + instructions `_. + +- Cluster: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det deploy + aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which uses + Amazon RDS for PostgreSQL. Changes to the deployment code will ensure this transition to the new + default. + +- Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November + 14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later + to maintain compatibility. The application will log a warning if it detects a connection to any + PostgreSQL version older than 12, and this warning will be updated to include PostgreSQL 12 once + it is End of Life. + ************** Version 0.37 ************** diff --git a/docs/release-notes/9966-fix-grid.rst b/docs/release-notes/9966-fix-grid.rst deleted file mode 100644 index f36dc1b8dc6..00000000000 --- a/docs/release-notes/9966-fix-grid.rst +++ /dev/null @@ -1,7 +0,0 @@ -:orphan: - -**Fixes** - -- Previously, during a grid search, if a hyperparameter contained an empty nested hyperparameter - (that is, just an empty map), that hyperparameter would not appear in the hparams passed to the - trial. diff --git a/docs/release-notes/add-host-port-scheme-to-helm.rst b/docs/release-notes/add-host-port-scheme-to-helm.rst deleted file mode 100644 index d0f49a72c86..00000000000 --- a/docs/release-notes/add-host-port-scheme-to-helm.rst +++ /dev/null @@ -1,9 +0,0 @@ -:orphan: - -**New Features** - -- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and - ``determined_master_scheme``. These control how tasks address the Determined API server and are - useful when installations span multiple Kubernetes clusters or there are proxies in between tasks - and the master. Also, ``determined_master_host`` now defaults to the service host, - ``..svc.cluster.local``, instead of the service IP. diff --git a/docs/release-notes/api-cli-access-token.rst b/docs/release-notes/api-cli-access-token.rst deleted file mode 100644 index 67fb614c350..00000000000 --- a/docs/release-notes/api-cli-access-token.rst +++ /dev/null @@ -1,28 +0,0 @@ -:orphan: - -**New Features** - -- API/CLI: Add support for access tokens. Add the ability to create and administer access tokens - for users to authenticate in automated workflows. Users can define the lifespan of these tokens, - making it easier to securely authenticate and run processes. Users can set global defaults and - limits for the validity of access tokens by configuring ``default_lifespan_days`` and - ``max_lifespan_days`` in the master configuration. Setting ``max_lifespan_days`` to ``-1`` - indicates an **infinite** lifespan for the access token. This feature enhances automation while - maintaining strong security protocols by allowing tighter control over token usage and - expiration. This feature requires Determined Enterprise Edition. - - - CLI: - - - ``det token create``: Create a new access token. - - ``det token login``: Sign in with an access token. - - ``det token edit``: Update an access token's description. - - ``det token list``: List all active access tokens, with options for displaying revoked - tokens. - - ``det token describe``: Show details of specific access tokens. - - ``det token revoke``: Revoke an access token. - - - API: - - - ``POST /api/v1/tokens``: Create a new access token. - - ``GET /api/v1/tokens``: Retrieve a list of access tokens. - - ``PATCH /api/v1/tokens/{token_id}``: Edit an existing access token. diff --git a/docs/release-notes/config-policies.rst b/docs/release-notes/config-policies.rst deleted file mode 100644 index 66e768b62d8..00000000000 --- a/docs/release-notes/config-policies.rst +++ /dev/null @@ -1,15 +0,0 @@ -:orphan: - -**New Features** - -- Cluster: In the enterprise edition of Determined, add :ref:`config policies ` to - enable administrators to set limits on how users can define workloads (e.g., experiments, - notebooks, TensorBoards, shells, and commands). Administrators can define two types of - configurations: - - - **Invariant Configs for Experiments**: Settings applied to all experiments within a specific - scope (global or workspace). Invariant configs for other tasks (e.g. notebooks, TensorBoards, - shells, and commands) is not yet supported. - - - **Constraints**: Restrictions that prevent users from exceeding resource limits within a - scope. Constraints can be set independently for experiments and tasks. diff --git a/docs/release-notes/helm-db-snapshot.rst b/docs/release-notes/helm-db-snapshot.rst deleted file mode 100644 index c9e276d68e1..00000000000 --- a/docs/release-notes/helm-db-snapshot.rst +++ /dev/null @@ -1,6 +0,0 @@ -:orphan: - -**New Features** - -- Helm: Add support for capturing and restoring snapshots of the database persistent volume. Visit - :ref:`helm-config-reference` for more details. diff --git a/docs/release-notes/log-signal.rst b/docs/release-notes/log-signal.rst deleted file mode 100644 index 743b0c6c56b..00000000000 --- a/docs/release-notes/log-signal.rst +++ /dev/null @@ -1,10 +0,0 @@ -:orphan: - -**New Features** - -- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows - as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in - both the run table and run detail views. - - In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string. - For more details, refer to :ref:`log_policies `. diff --git a/docs/release-notes/pytorch-tensorboard-plugin.rst b/docs/release-notes/pytorch-tensorboard-plugin.rst deleted file mode 100644 index c0f06b2118c..00000000000 --- a/docs/release-notes/pytorch-tensorboard-plugin.rst +++ /dev/null @@ -1,10 +0,0 @@ -:orphan: - -**Known Issue** - -- PyTorch has `deprecated - ` - their Profiler TensorBoard Plugin (``tb_plugin``), so some features may not be compatible with - PyTorch 2.0 and above. Our current default environment image comes with PyTorch 2.3. If users are - experiencing issues with this plugin, we suggest using an image with a PyTorch version earlier - than 2.0. diff --git a/docs/release-notes/rbac-new-tokenCreator-role.rst b/docs/release-notes/rbac-new-tokenCreator-role.rst deleted file mode 100644 index 0813232b0e9..00000000000 --- a/docs/release-notes/rbac-new-tokenCreator-role.rst +++ /dev/null @@ -1,7 +0,0 @@ -:orphan: - -**New Features** - -- New RBAC role: In the enterprise edition of Determined, add a ``TokenCreator`` RBAC role, which - allows users to create, view, and revoke their own :ref:`access tokens `. This - role can only be assigned globally. diff --git a/docs/release-notes/remove-custom-searcher.rst b/docs/release-notes/remove-custom-searcher.rst deleted file mode 100644 index 3e6c1a642e5..00000000000 --- a/docs/release-notes/remove-custom-searcher.rst +++ /dev/null @@ -1,7 +0,0 @@ -:orphan: - -**Breaking Changes** - -- API: Custom Searcher (including DeepSpeed AutoTune) was deprecated in 0.36.0 and is now removed. - We will maintain first-class support for a variety of preset searchers, which can be easily - configured for any experiment. Visit :ref:`search-methods` for details. diff --git a/docs/release-notes/searcher-context-removal.rst b/docs/release-notes/searcher-context-removal.rst deleted file mode 100644 index 74c81a746b2..00000000000 --- a/docs/release-notes/searcher-context-removal.rst +++ /dev/null @@ -1,72 +0,0 @@ -:orphan: - -**Breaking Changes** - -- ASHA: All experiments using ASHA hyperparameter search must now configure ``max_time`` and - ``time_metric`` in the experiment config, instead of ``max_length``. Additionally, training code - must report the configured ``time_metric`` in validation metrics. As a convenience, Determined - training loops now automatically report ``batches`` and ``epochs`` with metrics, which you can - use as your ``time_metric``. ASHA experiments without this modification will no longer run. - -- Custom Searchers: all custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0`` - and are now being removed. Users are encouraged to use a preset searcher, which can be easily - :ref:`configured ` for any experiment. - -**New Features** - -- API: introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that - integrates Keras training code with Determined through a single :ref:`Keras Callback - `. - -- API: introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that - allows for Python-side training loop configurations and includes support for local training. - -**Deprecations** - -- Experiment Config: the ``max_length`` field of the searcher configuration section has been - deprecated for all experiments and searchers. Users are expected to configure the desired - training length directly in training code. - -- Experiment Config: the ``optimizations`` config has been deprecated. Please see :ref:`Training - APIs ` to configure supported optimizations through training code directly. - -- Experiment Config: the ``scheduling_unit``, ``min_checkpoint_period``, and - ``min_validation_period`` config fields have been deprecated. Instead, these configuration - options should be specified in training code. - -- Experiment Config: the ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial - definitions. Please invoke your training script directly (``python3 train.py``). - -- Core API: the ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no - longer requires ``core.searcher.operations`` to run, and progress should be reported through - ``core.train.report_progress``. - -- DeepSpeed: the ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes - on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and - ``get_num_micro_batches_per_slot()``. - -- Horovod: the horovod distributed training backend has been deprecated. Users are encouraged to - migrate to the native distributed backend of their training framework (``torch.distributed`` or - ``tf.distribute``). - -- Trial APIs: ``TFKerasTrial`` has been deprecated. Users are encouraged to migrate to the new - :ref:`Keras Callback `. - -- Launchers: the ``--trial`` argument in Determined launchers has been deprecated. Please invoke - your training script directly. - -- ASHA: the ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated. - All ASHA searches are now early-stopping based (``stop_once: true``) instead of promotion based. - -- CLI: The ``--test`` and ``--local`` flags for ``det experiment create`` have been deprecated. All - training APIs now support local execution (``python3 train.py``). Please see ``training apis`` - for details specific to your framework. - -- Web UI: previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web - UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback - option. - -**Removed Features** - -- WebUI: "Continue Training" no longer supports configurable number of batches in the Web UI and - will simply resume the trial from the last checkpoint. diff --git a/docs/release-notes/ssh-crypto-system.rst b/docs/release-notes/ssh-crypto-system.rst deleted file mode 100644 index acd54812832..00000000000 --- a/docs/release-notes/ssh-crypto-system.rst +++ /dev/null @@ -1,8 +0,0 @@ -:orphan: - -**Improvements** - -- Master Configuration: Add support for crypto system configuration for ssh connection. - ``security.key_type`` now accepts ``RSA``, ``ECDSA`` or ``ED25519``. Default key type is changed - from ``1024-bit RSA`` to ``ED25519``, since ``ED25519`` keys are faster and more secure than the - old default, and ``ED25519`` is also the default key type for ``ssh-keygen``. diff --git a/docs/release-notes/unsupport-aurora-postgres-reminder.rst b/docs/release-notes/unsupport-aurora-postgres-reminder.rst deleted file mode 100644 index b82c739d064..00000000000 --- a/docs/release-notes/unsupport-aurora-postgres-reminder.rst +++ /dev/null @@ -1,19 +0,0 @@ -:orphan: - -**Deprecations** - -- Cluster: A reminder that Amazon Aurora V1 will reach End of Life at the end of 2024. It is no - longer supported as the default persistent storage for AWS Determined deployments. We recommend - that users migrate to Amazon RDS for PostgreSQL. For more information, visit the `migration - instructions `_. - -- Cluster: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det deploy - aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which uses - Amazon RDS for PostgreSQL. Changes to the deployment code will ensure this transition to the new - default. - -- Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November - 14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later - to maintain compatibility. The application will log a warning if it detects a connection to any - PostgreSQL version older than 12, and this warning will be updated to include PostgreSQL 12 once - it is End of Life. From b5b303191c1e00e0f52ea50a44cf3c0c64008e37 Mon Sep 17 00:00:00 2001 From: Anda Date: Fri, 22 Nov 2024 12:41:46 -0800 Subject: [PATCH 2/4] fix dupe --- docs/release-notes.rst | 179 ----------------------------------------- 1 file changed, 179 deletions(-) diff --git a/docs/release-notes.rst b/docs/release-notes.rst index fdf884afa8c..0eea065ba26 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -126,185 +126,6 @@ Version 0.38.0 **Deprecations** -- Experiment Config: the ``max_length`` field of the searcher configuration section has been - deprecated for all experiments and searchers. Users are expected to configure the desired - training length directly in training code. - -- Experiment Config: the ``optimizations`` config has been deprecated. Please see :ref:`Training - APIs ` to configure supported optimizations through training code directly. - -- Experiment Config: the ``scheduling_unit``, ``min_checkpoint_period``, and - ``min_validation_period`` config fields have been deprecated. Instead, these configuration - options should be specified in training code. - -- Experiment Config: the ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial - definitions. Please invoke your training script directly (``python3 train.py``). - -- Core API: the ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no - longer requires ``core.searcher.operations`` to run, and progress should be reported through - ``core.train.report_progress``. - -- DeepSpeed: the ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes - on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and - ``get_num_micro_batches_per_slot()``. - -- Horovod: the horovod distributed training backend has been deprecated. Users are encouraged to - migrate to the native distributed backend of their training framework (``torch.distributed`` or - ``tf.distribute``). - -- Trial APIs: ``TFKerasTrial`` has been deprecated. Users are encouraged to migrate to the new - :ref:`Keras Callback `. - -- Launchers: the ``--trial`` argument in Determined launchers has been deprecated. Please invoke - your training script directly. - -- ASHA: the ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated. - All ASHA searches are now early-stopping based (``stop_once: true``) instead of promotion based. - -- CLI: The ``--test`` and ``--local`` flags for ``det experiment create`` have been deprecated. All - training APIs now support local execution (``python3 train.py``). Please see ``training apis`` - for details specific to your framework. - -- Web UI: previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web - UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback - option. - -- Cluster: A reminder that Amazon Aurora V1 will reach End of Life at the end of 2024. It is no - longer supported as the default persistent storage for AWS Determined deployments. We recommend - that users migrate to Amazon RDS for PostgreSQL. For more information, visit the `migration - instructions `_. - -- Cluster: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det deploy - aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which uses - Amazon RDS for PostgreSQL. Changes to the deployment code will ensure this transition to the new - default. - -- Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November - 14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later - to maintain compatibility. The application will log a warning if it detects a connection to any - PostgreSQL version older than 12, and this warning will be updated to include PostgreSQL 12 once - it is End of Life. - -************** - Version 0.38 -************** - -Version 0.38.0 -============== - -**Release Date:** November 22, 2024 - -**Breaking Changes** - -- ASHA: All experiments using ASHA hyperparameter search must now configure ``max_time`` and - ``time_metric`` in the experiment config, instead of ``max_length``. Additionally, training code - must report the configured ``time_metric`` in validation metrics. As a convenience, Determined - training loops now automatically report ``batches`` and ``epochs`` with metrics, which you can - use as your ``time_metric``. ASHA experiments without this modification will no longer run. - -- Custom Searchers: all custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0`` - and are now being removed. Users are encouraged to use a preset searcher, which can be easily - :ref:`configured ` for any experiment. - -- API: Custom Searcher (including DeepSpeed AutoTune) was deprecated in 0.36.0 and is now removed. - We will maintain first-class support for a variety of preset searchers, which can be easily - configured for any experiment. Visit :ref:`search-methods` for details. - -**New Features** - -- API/CLI: Add support for access tokens. Add the ability to create and administer access tokens - for users to authenticate in automated workflows. Users can define the lifespan of these tokens, - making it easier to securely authenticate and run processes. Users can set global defaults and - limits for the validity of access tokens by configuring ``default_lifespan_days`` and - ``max_lifespan_days`` in the master configuration. Setting ``max_lifespan_days`` to ``-1`` - indicates an **infinite** lifespan for the access token. This feature enhances automation while - maintaining strong security protocols by allowing tighter control over token usage and - expiration. This feature requires Determined Enterprise Edition. - - - CLI: - - - ``det token create``: Create a new access token. - - ``det token login``: Sign in with an access token. - - ``det token edit``: Update an access token's description. - - ``det token list``: List all active access tokens, with options for displaying revoked - tokens. - - ``det token describe``: Show details of specific access tokens. - - ``det token revoke``: Revoke an access token. - - - API: - - - ``POST /api/v1/tokens``: Create a new access token. - - ``GET /api/v1/tokens``: Retrieve a list of access tokens. - - ``PATCH /api/v1/tokens/{token_id}``: Edit an existing access token. - -- API: introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that - integrates Keras training code with Determined through a single :ref:`Keras Callback - `. - -- API: introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that - allows for Python-side training loop configurations and includes support for local training. - -- Cluster: In the enterprise edition of Determined, add :ref:`config policies ` to - enable administrators to set limits on how users can define workloads (e.g., experiments, - notebooks, TensorBoards, shells, and commands). Administrators can define two types of - configurations: - - - **Invariant Configs for Experiments**: Settings applied to all experiments within a specific - scope (global or workspace). Invariant configs for other tasks (e.g. notebooks, TensorBoards, - shells, and commands) is not yet supported. - - - **Constraints**: Restrictions that prevent users from exceeding resource limits within a - scope. Constraints can be set independently for experiments and tasks. - -- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and - ``determined_master_scheme``. These control how tasks address the Determined API server and are - useful when installations span multiple Kubernetes clusters or there are proxies in between tasks - and the master. Also, ``determined_master_host`` now defaults to the service host, - ``..svc.cluster.local``, instead of the service IP. - -- Helm: Add support for capturing and restoring snapshots of the database persistent volume. Visit - :ref:`helm-config-reference` for more details. - -- New RBAC role: In the enterprise edition of Determined, add a ``TokenCreator`` RBAC role, which - allows users to create, view, and revoke their own :ref:`access tokens `. This - role can only be assigned globally. - -- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows - as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in - both the run table and run detail views. - - In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string. - For more details, refer to :ref:`log_policies `. - -**Improvements** - -- Master Configuration: Add support for crypto system configuration for ssh connection. - ``security.key_type`` now accepts ``RSA``, ``ECDSA`` or ``ED25519``. Default key type is changed - from ``1024-bit RSA`` to ``ED25519``, since ``ED25519`` keys are faster and more secure than the - old default, and ``ED25519`` is also the default key type for ``ssh-keygen``. - -**Removed Features** - -- WebUI: "Continue Training" no longer supports configurable number of batches in the Web UI and - will simply resume the trial from the last checkpoint. - -**Known Issues** - -- PyTorch has `deprecated - ` - their Profiler TensorBoard Plugin (``tb_plugin``), so some features may not be compatible with - PyTorch 2.0 and above. Our current default environment image comes with PyTorch 2.3. If users are - experiencing issues with this plugin, we suggest using an image with a PyTorch version earlier - than 2.0. - -**Bug Fixes** - -- Previously, during a grid search, if a hyperparameter contained an empty nested hyperparameter - (that is, just an empty map), that hyperparameter would not appear in the hparams passed to the - trial. - -**Deprecations** - - Experiment Config: the ``max_length`` field of the searcher configuration section has been deprecated for all experiments and searchers. Users are expected to configure the desired training length directly in training code. From 549b3194fb9a8f8358a0330891bd7c8c9ef92347 Mon Sep 17 00:00:00 2001 From: Anda Date: Fri, 22 Nov 2024 12:45:12 -0800 Subject: [PATCH 3/4] fixes --- docs/release-notes.rst | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 0eea065ba26..c58fe9c1c4f 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -148,7 +148,7 @@ Version 0.38.0 on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and ``get_num_micro_batches_per_slot()``. -- Horovod: the horovod distributed training backend has been deprecated. Users are encouraged to +- Horovod: the Horovod distributed training backend has been deprecated. Users are encouraged to migrate to the native distributed backend of their training framework (``torch.distributed`` or ``tf.distribute``). @@ -169,15 +169,11 @@ Version 0.38.0 UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback option. -- Cluster: A reminder that Amazon Aurora V1 will reach End of Life at the end of 2024. It is no - longer supported as the default persistent storage for AWS Determined deployments. We recommend - that users migrate to Amazon RDS for PostgreSQL. For more information, visit the `migration - instructions `_. - -- Cluster: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det deploy - aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which uses - Amazon RDS for PostgreSQL. Changes to the deployment code will ensure this transition to the new - default. +- Database: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det + deploy aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which + uses Amazon RDS for PostgreSQL. We recommend that users migrate to Amazon RDS for PostgreSQL. For + more information, visit the `migration instructions + `_. - Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November 14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later From 1ab80c593794005d7e3004a85644001958ad4c29 Mon Sep 17 00:00:00 2001 From: Anda Date: Fri, 22 Nov 2024 12:48:59 -0800 Subject: [PATCH 4/4] capitalize first letter --- docs/release-notes.rst | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/release-notes.rst b/docs/release-notes.rst index c58fe9c1c4f..c86a80c14a2 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -23,7 +23,7 @@ Version 0.38.0 training loops now automatically report ``batches`` and ``epochs`` with metrics, which you can use as your ``time_metric``. ASHA experiments without this modification will no longer run. -- Custom Searchers: all custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0`` +- Custom Searchers: All custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0`` and are now being removed. Users are encouraged to use a preset searcher, which can be easily :ref:`configured ` for any experiment. @@ -58,11 +58,11 @@ Version 0.38.0 - ``GET /api/v1/tokens``: Retrieve a list of access tokens. - ``PATCH /api/v1/tokens/{token_id}``: Edit an existing access token. -- API: introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that +- API: Introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that integrates Keras training code with Determined through a single :ref:`Keras Callback `. -- API: introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that +- API: Introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that allows for Python-side training loop configurations and includes support for local training. - Cluster: In the enterprise edition of Determined, add :ref:`config policies ` to @@ -126,46 +126,46 @@ Version 0.38.0 **Deprecations** -- Experiment Config: the ``max_length`` field of the searcher configuration section has been +- Experiment Config: The ``max_length`` field of the searcher configuration section has been deprecated for all experiments and searchers. Users are expected to configure the desired training length directly in training code. -- Experiment Config: the ``optimizations`` config has been deprecated. Please see :ref:`Training +- Experiment Config: The ``optimizations`` config has been deprecated. Please see :ref:`Training APIs ` to configure supported optimizations through training code directly. -- Experiment Config: the ``scheduling_unit``, ``min_checkpoint_period``, and +- Experiment Config: The ``scheduling_unit``, ``min_checkpoint_period``, and ``min_validation_period`` config fields have been deprecated. Instead, these configuration options should be specified in training code. -- Experiment Config: the ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial +- Experiment Config: The ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial definitions. Please invoke your training script directly (``python3 train.py``). -- Core API: the ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no +- Core API: The ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no longer requires ``core.searcher.operations`` to run, and progress should be reported through ``core.train.report_progress``. -- DeepSpeed: the ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes +- DeepSpeed: The ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and ``get_num_micro_batches_per_slot()``. -- Horovod: the Horovod distributed training backend has been deprecated. Users are encouraged to +- Horovod: The Horovod distributed training backend has been deprecated. Users are encouraged to migrate to the native distributed backend of their training framework (``torch.distributed`` or ``tf.distribute``). - Trial APIs: ``TFKerasTrial`` has been deprecated. Users are encouraged to migrate to the new :ref:`Keras Callback `. -- Launchers: the ``--trial`` argument in Determined launchers has been deprecated. Please invoke +- Launchers: The ``--trial`` argument in Determined launchers has been deprecated. Please invoke your training script directly. -- ASHA: the ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated. +- ASHA: The ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated. All ASHA searches are now early-stopping based (``stop_once: true``) instead of promotion based. - CLI: The ``--test`` and ``--local`` flags for ``det experiment create`` have been deprecated. All training APIs now support local execution (``python3 train.py``). Please see ``training apis`` for details specific to your framework. -- Web UI: previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web +- Web UI: Previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback option.