Skip to content

Commit 369cec0

Browse files
authored
Merge branch 'master' into llm-ner-test
2 parents eae6937 + 7d81830 commit 369cec0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+1803
-302
lines changed

cpp/BUILD.bazel

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -361,7 +361,7 @@ cc_test(
361361
"metric_example.so",
362362
],
363363
linkstatic = True,
364-
tags = ["team:serverless"],
364+
tags = ["team:core"],
365365
deps = [
366366
"ray_cpp_lib",
367367
"@boost//:callable_traits",

doc/source/cluster/kubernetes/user-guides/kuberay-gcs-persistent-ft.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ to set `appendfsync` to `always` so Redis stores all writes immediately.
110110
## Putting it together
111111

112112
Edit
113-
[the full YAML](https://github.com/ray-project/kuberay/blob/master/config/samples/ray-cluster.persistent-redis.yaml)
113+
[the full YAML](https://github.com/ray-project/kuberay/blob/release-1.3/ray-operator/config/samples/ray-cluster.persistent-redis.yaml)
114114
to your satisfaction and apply it:
115115

116116
```

doc/source/data/working-with-llms.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -243,3 +243,39 @@ Data for the following features and attributes is collected to improve Ray Data
243243

244244
If you would like to opt-out from usage data collection, you can follow :ref:`Ray usage stats <ref-usage-stats>`
245245
to turn it off.
246+
247+
.. _production_guide:
248+
249+
Production guide
250+
--------------------------------------------------
251+
252+
.. _model_cache:
253+
254+
Caching model weight to remote object storage
255+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
256+
257+
While deploying Ray Data LLM to large scale clusters, model loading may be rate
258+
limited by HuggingFace. In this case, you can cache the model to remote object
259+
storage (AWS S3 or Google Cloud Storage) for more stable model loading.
260+
261+
Ray Data LLM provides the following utility to help uploading models to remote object storage.
262+
263+
.. testcode::
264+
265+
# Download model from HuggingFace, and upload to GCS
266+
python -m ray.llm.utils.upload_model \
267+
--model-source facebook/opt-350m \
268+
--bucket-uri gs://my-bucket/path/to/facebook-opt-350m
269+
# Or upload a local custom model to S3
270+
python -m ray.llm.utils.upload_model \
271+
--model-source local/path/to/model \
272+
--bucket-uri s3://my-bucket/path/to/model_name
273+
274+
And later you can use remote object store URI as `model_source` in the config.
275+
276+
.. testcode::
277+
278+
config = vLLMEngineProcessorConfig(
279+
model_source="gs://my-bucket/path/to/facebook-opt-350m", # or s3://my-bucket/path/to/model_name
280+
...
281+
)

doc/source/workflows/comparison.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,12 @@ Specifying the workflow id allows for resuming of the workflow by its id in case
2121
Other Workflow Engines
2222
----------------------
2323

24-
Note: these comparisons are inspired by the `Serverless workflows comparisons repo <https://github.com/serverlessworkflow/specification/tree/main/comparisons>`__.
24+
Note: these comparisons are inspired by the `Serverless workflows comparisons repo <https://github.com/serverlessworkflow/specification/tree/0.9.x/comparisons>`__.
2525

2626
Argo API Comparison
2727
~~~~~~~~~~~~~~~~~~~
2828

29-
The original source of these comparisons can be `found here <https://github.com/serverlessworkflow/specification/blob/main/comparisons/comparison-argo.md>`__.
29+
The original source of these comparisons can be `found here <https://github.com/serverlessworkflow/specification/blob/0.9.x/comparisons/comparison-argo.md>`__.
3030

3131
Conditionals
3232
^^^^^^^^^^^^
@@ -162,7 +162,7 @@ Trip Booking
162162
Google Cloud Workflows API Comparison
163163
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
164164

165-
The original source of these comparisons can be `found here <https://github.com/serverlessworkflow/specification/blob/main/comparisons/comparison-google-cloud-workflows.md>`__.
165+
The original source of these comparisons can be `found here <https://github.com/serverlessworkflow/specification/blob/0.9.x/comparisons/comparison-google-cloud-workflows.md>`__.
166166

167167
Data Conditional
168168
^^^^^^^^^^^^^^^^

python/ray/_private/label_utils.py

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
import json
2+
import re
3+
import yaml
4+
from typing import Dict
5+
6+
import ray._private.ray_constants as ray_constants
7+
8+
# Regex patterns used to validate that labels conform to Kubernetes label syntax rules.
9+
# https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set
10+
11+
# Regex for mandatory name (DNS label) or value
12+
# Examples:
13+
# Valid matches: "a", "label-name", "a-._b", "123", "this_is_a_valid_label"
14+
# Invalid matches: "-abc", "abc-", "my@label", "a" * 64
15+
LABEL_REGEX = re.compile(r"[a-zA-Z0-9]([a-zA-Z0-9_.-]*[a-zA-Z0-9]){0,62}")
16+
17+
# Regex for optional prefix (DNS subdomain)
18+
# Examples:
19+
# Valid matches: "abc", "sub.domain.example", "my-label", "123.456.789"
20+
# Invalid matches: "-abc", "prefix_", "sub..domain", sub.$$.example
21+
LABEL_PREFIX_REGEX = rf"^({LABEL_REGEX.pattern}?(\.{LABEL_REGEX.pattern}?)*)$"
22+
23+
24+
def parse_node_labels_json(labels_json: str) -> Dict[str, str]:
25+
labels = json.loads(labels_json)
26+
if not isinstance(labels, dict):
27+
raise ValueError("The format after deserialization is not a key-value pair map")
28+
for key, value in labels.items():
29+
if not isinstance(key, str):
30+
raise ValueError("The key is not string type.")
31+
if not isinstance(value, str):
32+
raise ValueError(f'The value of the "{key}" is not string type')
33+
34+
return labels
35+
36+
37+
def parse_node_labels_string(labels_str: str) -> Dict[str, str]:
38+
labels = {}
39+
40+
# Remove surrounding quotes if they exist
41+
if len(labels_str) > 1 and labels_str.startswith('"') and labels_str.endswith('"'):
42+
labels_str = labels_str[1:-1]
43+
44+
if labels_str == "":
45+
return labels
46+
47+
# Labels argument should consist of a string of key=value pairs
48+
# separated by commas. Labels follow Kubernetes label syntax.
49+
label_pairs = labels_str.split(",")
50+
for pair in label_pairs:
51+
# Split each pair by `=`
52+
key_value = pair.split("=")
53+
if len(key_value) != 2:
54+
raise ValueError("Label string is not a key-value pair.")
55+
key = key_value[0].strip()
56+
value = key_value[1].strip()
57+
labels[key] = value
58+
59+
# Validate parsed node labels follow expected Kubernetes label syntax
60+
validate_node_label_syntax(labels)
61+
62+
return labels
63+
64+
65+
def parse_node_labels_from_yaml_file(path: str) -> Dict[str, str]:
66+
if path == "":
67+
return {}
68+
with open(path, "r") as file:
69+
# Expects valid YAML content
70+
labels = yaml.safe_load(file)
71+
if not isinstance(labels, dict):
72+
raise ValueError(
73+
"The format after deserialization is not a key-value pair map."
74+
)
75+
for key, value in labels.items():
76+
if not isinstance(key, str):
77+
raise ValueError("The key is not string type.")
78+
if not isinstance(value, str):
79+
raise ValueError(f'The value of "{key}" is not string type.')
80+
81+
# Validate parsed node labels follow expected Kubernetes label syntax
82+
validate_node_label_syntax(labels)
83+
84+
return labels
85+
86+
87+
def validate_node_labels(labels: Dict[str, str]):
88+
if labels is None:
89+
return
90+
for key in labels.keys():
91+
if key.startswith(ray_constants.RAY_DEFAULT_LABEL_KEYS_PREFIX):
92+
raise ValueError(
93+
f"Custom label keys `{key}` cannot start with the prefix "
94+
f"`{ray_constants.RAY_DEFAULT_LABEL_KEYS_PREFIX}`. "
95+
f"This is reserved for Ray defined labels."
96+
)
97+
98+
99+
# TODO (ryanaoleary@): This function will replace `validate_node_labels` after
100+
# the migration from NodeLabelSchedulingPolicy to the Label Selector API is complete.
101+
def validate_node_label_syntax(labels: Dict[str, str]):
102+
if labels is None:
103+
return
104+
for key in labels.keys():
105+
if "/" in key:
106+
prefix, name = key.rsplit("/")
107+
if len(prefix) > 253 or not re.match(LABEL_PREFIX_REGEX, prefix):
108+
raise ValueError(
109+
f"Invalid label key prefix `{prefix}`. Prefix must be a series of DNS labels "
110+
f"separated by dots (.),not longer than 253 characters in total."
111+
)
112+
else:
113+
name = key
114+
if len(name) > 63 or not re.match(LABEL_REGEX, name):
115+
raise ValueError(
116+
f"Invalid label key name `{name}`. Name must be 63 chars or less beginning and ending "
117+
f"with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_),"
118+
f"dots (.), and alphanumerics between."
119+
)
120+
value = labels.get(key)
121+
if value is None or value == "":
122+
return
123+
if len(value) > 63 or not re.match(LABEL_REGEX, value):
124+
raise ValueError(
125+
f"Invalid label key value `{value}`. Value must be 63 chars or less beginning and ending "
126+
f"with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_),"
127+
f"dots (.), and alphanumerics between."
128+
)

python/ray/_private/node.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
from ray._raylet import GcsClient, get_session_key_from_storage
2727
from ray._private.resource_spec import ResourceSpec
2828
from ray._private.services import serialize_config, get_address
29+
from ray._private.resource_isolation_config import ResourceIsolationConfig
2930
from ray._private.utils import (
3031
open_log,
3132
try_to_create_directory,
@@ -89,6 +90,9 @@ def __init__(
8990
self.kernel_fate_share = bool(
9091
spawn_reaper and ray._private.utils.detect_fate_sharing_support()
9192
)
93+
self.resource_isolation_config: ResourceIsolationConfig = (
94+
ray_params.resource_isolation_config
95+
)
9296
self.all_processes: dict = {}
9397
self.removal_lock = threading.Lock()
9498

@@ -1236,7 +1240,6 @@ def start_raylet(
12361240
object_store_memory: int,
12371241
use_valgrind: bool = False,
12381242
use_profiler: bool = False,
1239-
enable_physical_mode: bool = False,
12401243
):
12411244
"""Start the raylet.
12421245
@@ -1320,7 +1323,7 @@ def start_raylet(
13201323
node_name=self._ray_params.node_name,
13211324
webui=self._webui_url,
13221325
labels=self._get_node_labels(),
1323-
enable_physical_mode=enable_physical_mode,
1326+
resource_isolation_config=self.resource_isolation_config,
13241327
)
13251328
assert ray_constants.PROCESS_TYPE_RAYLET not in self.all_processes
13261329
self.all_processes[ray_constants.PROCESS_TYPE_RAYLET] = [process_info]
@@ -1489,6 +1492,7 @@ def start_ray_processes(self):
14891492
# Make sure we don't call `determine_plasma_store_config` multiple
14901493
# times to avoid printing multiple warnings.
14911494
resource_spec = self.get_resource_spec()
1495+
14921496
(
14931497
plasma_directory,
14941498
fallback_directory,
@@ -1500,6 +1504,11 @@ def start_ray_processes(self):
15001504
fallback_directory=self._fallback_directory,
15011505
huge_pages=self._ray_params.huge_pages,
15021506
)
1507+
1508+
# add plasma store memory to the total system reserved memory
1509+
if self.resource_isolation_config.is_enabled():
1510+
self.resource_isolation_config.add_object_store_memory(object_store_memory)
1511+
15031512
self.start_raylet(plasma_directory, fallback_directory, object_store_memory)
15041513
if self._ray_params.include_log_monitor:
15051514
self.start_log_monitor()

python/ray/_private/parameter.py

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@
33
from typing import Dict, List, Optional
44

55
import ray._private.ray_constants as ray_constants
6-
from ray._private.utils import (
7-
validate_node_labels,
8-
check_ray_client_dependencies_installed,
9-
)
6+
7+
from ray._private.resource_isolation_config import ResourceIsolationConfig
8+
from ray._private.label_utils import validate_node_labels
9+
from ray._private.utils import check_ray_client_dependencies_installed
1010

1111

1212
logger = logging.getLogger(__name__)
@@ -127,9 +127,8 @@ class RayParams:
127127
session_name: The name of the session of the ray cluster.
128128
webui: The url of the UI.
129129
cluster_id: The cluster ID in hex string.
130-
enable_physical_mode: Whether physical mode is enabled, which applies
131-
constraint to tasks' resource consumption. As of now, only memory resource
132-
is supported.
130+
resource_isolation_config: settings for cgroupv2 based isolation of ray
131+
system processes (defaults to no isolation if config not provided)
133132
"""
134133

135134
def __init__(
@@ -193,7 +192,7 @@ def __init__(
193192
webui: Optional[str] = None,
194193
cluster_id: Optional[str] = None,
195194
node_id: Optional[str] = None,
196-
enable_physical_mode: bool = False,
195+
resource_isolation_config: Optional[ResourceIsolationConfig] = None,
197196
):
198197
self.redis_address = redis_address
199198
self.gcs_address = gcs_address
@@ -257,7 +256,12 @@ def __init__(
257256
self._check_usage()
258257
self.cluster_id = cluster_id
259258
self.node_id = node_id
260-
self.enable_physical_mode = enable_physical_mode
259+
260+
self.resource_isolation_config = resource_isolation_config
261+
if not self.resource_isolation_config:
262+
self.resource_isolation_config = ResourceIsolationConfig(
263+
enable_resource_isolation=False
264+
)
261265

262266
# Set the internal config options for object reconstruction.
263267
if enable_object_reconstruction:

python/ray/_private/ray_constants.py

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,13 +70,38 @@ def env_set_by_user(key):
7070
# The default maximum number of bytes to allocate to the object store unless
7171
# overridden by the user.
7272
DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES = env_integer(
73-
"RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES", 200 * 10**9 # 200 GB
73+
"RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES", (200) * (10**9) # 200 GB
7474
)
7575
# The default proportion of available memory allocated to the object store
7676
DEFAULT_OBJECT_STORE_MEMORY_PROPORTION = env_float(
7777
"RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION",
7878
0.3,
7979
)
80+
81+
# The following values are only used when resource isolation is enabled
82+
# ===== The default number of bytes to reserve for ray system processes
83+
DEFAULT_SYSTEM_RESERVED_MEMORY_BYTES = env_integer(
84+
"RAY_DEFAULT_DEFAULT_SYSTEM_RESERVED_MEMORY_BYTES", (25) * (10**9)
85+
)
86+
# The default proportion available memory to reserve for ray system processes
87+
DEFAULT_SYSTEM_RESERVED_MEMORY_PROPORTION = env_integer(
88+
"RAY_DEFAULT_SYSTEM_RESERVED_MEMORY_PROPORTION", 0.10
89+
)
90+
# The default number of cpu cores to reserve for ray system processes
91+
DEFAULT_SYSTEM_RESERVED_CPU_CORES = env_float(
92+
"RAY_DEFAULT_SYSTEM_RESERVED_CPU_CORES", 1.0
93+
)
94+
# The default proportion of cpu cores to reserve for ray system processes
95+
DEFAULT_SYSTEM_RESERVED_CPU_PROPORTION = env_float(
96+
"RAY_DEFAULT_SYSTEM_RESERVED_CPU_PROPORTION", 0.05
97+
)
98+
# The smallest number of cores that ray system processes can be guaranteed
99+
MINIMUM_SYSTEM_RESERVED_CPU_CORES = 0.5
100+
# The smallest number of bytes that ray system processes can be guaranteed
101+
MINIMUM_SYSTEM_RESERVED_MEMORY_BYTES = (100) * (10**6)
102+
# The default path for cgroupv2
103+
DEFAULT_CGROUP_PATH = "/sys/fs/cgroup"
104+
80105
# The smallest cap on the memory used by the object store that we allow.
81106
# This must be greater than MEMORY_RESOURCE_UNIT_BYTES
82107
OBJECT_STORE_MINIMUM_MEMORY_BYTES = 75 * 1024 * 1024
@@ -95,7 +120,7 @@ def env_set_by_user(key):
95120
# (see https://github.com/ray-project/ray/issues/20388 for details)
96121
# The workaround here is to limit capacity to 2GB for Mac by default,
97122
# and raise error if the capacity is overwritten by user.
98-
MAC_DEGRADED_PERF_MMAP_SIZE_LIMIT = 2 * 2**30
123+
MAC_DEGRADED_PERF_MMAP_SIZE_LIMIT = (2) * (2**30)
99124
# If a user does not specify a port for the primary Ray service,
100125
# we attempt to start the service running at this port.
101126
DEFAULT_PORT = 6379

0 commit comments

Comments
 (0)