-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add support for kserve #877
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -60,8 +60,9 @@ Generate specified configuration format for running the AI Model as a service | |||||
|
||||||
| Key | Description | | ||||||
| ------------ | -------------------------------------------------------------------------| | ||||||
| quadlet | Podman supported container definition for running AI Model under systemd | | ||||||
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| kube | Kubernetes YAML definition for running the AI Model as a service | | ||||||
| quadlet | Podman supported container definition for running AI Model under systemd | | ||||||
| quadlet/kube | Kubernetes YAML definition for running the AI Model as a service and Podman supported container definition for running the Kube YAML specified pod under systemd| | ||||||
|
||||||
#### **--help**, **-h** | ||||||
|
@@ -119,7 +120,7 @@ llama.cpp explains this as: | |||||
|
||||||
The higher the number is the more creative the response is, but more likely to hallucinate when set too high. | ||||||
|
||||||
Usage: Lower numbers are good for virtual assistants where we need deterministic responses. Higher numbers are good for roleplay or creative tasks like editing stories | ||||||
Usage: Lower numbers are good for virtual assistants where we need deterministic responses. Higher numbers are good for roleplay or creative tasks like editing stories | ||||||
|
||||||
#### **--tls-verify**=*true* | ||||||
require HTTPS and verify certificates when contacting OCI registries | ||||||
|
@@ -140,6 +141,73 @@ CONTAINER ID IMAGE COMMAND CREATED | |||||
3f64927f11a5 quay.io/ramalama/ramalama:latest /usr/bin/ramalama... 17 seconds ago Up 17 seconds 0.0.0.0:8082->8082/tcp ramalama_YMPQvJxN97 | ||||||
``` | ||||||
|
||||||
### Generate kserve service off of OCI Model car quay.io/ramalama/granite:1.0 | ||||||
``` | ||||||
./bin/ramalama serve --port 8081 --generate kserve oci://quay.io/ramalama/granite:1.0 | ||||||
Generating kserve runtime file: granite-1.0-kserve-runtime.yaml | ||||||
Generating kserve file: granite-1.0-kserve.yaml | ||||||
$ cat granite-1.0-kserve-runtime.yaml | ||||||
apiVersion: serving.kserve.io/v1alpha1 | ||||||
kind: ServingRuntime | ||||||
metadata: | ||||||
name: llama.cpp-runtime | ||||||
annotations: | ||||||
openshift.io/display-name: KServe ServingRuntime for quay.io/ramalama/granite:1.0 | ||||||
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' | ||||||
labels: | ||||||
opendatahub.io/dashboard: 'true' | ||||||
spec: | ||||||
annotations: | ||||||
prometheus.io/port: '8081' | ||||||
prometheus.io/path: '/metrics' | ||||||
multiModel: false | ||||||
supportedModelFormats: | ||||||
- autoSelect: true | ||||||
name: vLLM | ||||||
containers: | ||||||
- name: kserve-container | ||||||
image: quay.io/ramalama/ramalama:latest | ||||||
command: | ||||||
- python | ||||||
- -m | ||||||
- vllm.entrypoints.openai.api_server | ||||||
args: | ||||||
- "--port=8081" | ||||||
- "--model=/mnt/models" | ||||||
- "--served-model-name={.Name}" | ||||||
env: | ||||||
- name: HF_HOME | ||||||
value: /tmp/hf_home | ||||||
ports: | ||||||
- containerPort: 8081 | ||||||
protocol: TCP | ||||||
$ cat granite-1.0-kserve.yaml | ||||||
# RamaLama quay.io/ramalama/granite:1.0 AI Model Service | ||||||
# kubectl create -f to import this kserve file into Kubernetes. | ||||||
# | ||||||
apiVersion: serving.kserve.io/v1beta1 | ||||||
kind: InferenceService | ||||||
metadata: | ||||||
name: huggingface-quay.io/ramalama/granite:1.0 | ||||||
spec: | ||||||
predictor: | ||||||
model: | ||||||
modelFormat: | ||||||
name: vLLM | ||||||
storageUri: "oci://quay.io/ramalama/granite:1.0" | ||||||
resources: | ||||||
limits: | ||||||
cpu: "6" | ||||||
memory: 24Gi | ||||||
nvidia.com/gpu: "1" | ||||||
requests: | ||||||
cpu: "6" | ||||||
memory: 24Gi | ||||||
nvidia.com/gpu: "1" | ||||||
``` | ||||||
|
||||||
### Generate quadlet service off of HuggingFace granite Model | ||||||
``` | ||||||
$ ramalama serve --name MyGraniteServer --generate=quadlet granite | ||||||
|
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,105 @@ | ||||||||||||
import os | ||||||||||||
|
||||||||||||
from ramalama.common import get_env_vars | ||||||||||||
|
||||||||||||
|
||||||||||||
class Kserve: | ||||||||||||
def __init__(self, model, image, args, exec_args): | ||||||||||||
self.ai_image = model | ||||||||||||
if hasattr(args, "MODEL"): | ||||||||||||
self.ai_image = args.MODEL | ||||||||||||
self.ai_image = self.ai_image.removeprefix("oci://") | ||||||||||||
if args.name: | ||||||||||||
self.name = args.name | ||||||||||||
else: | ||||||||||||
self.name = os.path.basename(self.ai_image) | ||||||||||||
|
||||||||||||
self.model = model.removeprefix("oci://") | ||||||||||||
self.args = args | ||||||||||||
self.exec_args = exec_args | ||||||||||||
self.image = image | ||||||||||||
self.runtime = args.runtime | ||||||||||||
|
||||||||||||
def generate(self): | ||||||||||||
env_var_string = "" | ||||||||||||
for k, v in get_env_vars().items(): | ||||||||||||
env_var_string += f"Environment={k}={v}\n" | ||||||||||||
|
||||||||||||
_gpu = "" | ||||||||||||
if os.getenv("CUDA_VISIBLE_DEVICES") != "": | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue (bug_risk): GPU env var check may be flawed. The condition os.getenv("CUDA_VISIBLE_DEVICES") != "" will return True even if the variable is not set (i.e. returns None). Consider using a check such as if os.getenv("CUDA_VISIBLE_DEVICES") to better capture whether the variable is defined and non-empty. |
||||||||||||
_gpu = 'nvidia.com/gpu' | ||||||||||||
elif os.getenv("HIP_VISIBLE_DEVICES") != "": | ||||||||||||
_gpu = 'amd.com/gpu' | ||||||||||||
if _gpu != "": | ||||||||||||
gpu = f'\n {_gpu}: "1"' | ||||||||||||
|
||||||||||||
outfile = self.name + "-kserve-runtime.yaml" | ||||||||||||
outfile = outfile.replace(":", "-") | ||||||||||||
print(f"Generating kserve runtime file: {outfile}") | ||||||||||||
with open(outfile, 'w') as c: | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue (complexity): Consider using a templating engine like Jinja2 to generate the YAML files, which will reduce code duplication and improve readability. Consider abstracting YAML creation into a dedicated templating helper to reduce the inline formatting repetition. For example, you might use Jinja2 templates (or PyYAML with dictionaries) to consolidate and reuse the YAML structure. Here’s a concise example using Jinja2: from jinja2 import Template
def create_yaml(template_str, **params):
return Template(template_str).render(**params)
# Define your runtime YAML template once.
KSERVE_RUNTIME_TMPL = """
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: {{ runtime }}-runtime
annotations:
openshift.io/display-name: "KServe ServingRuntime for {{ model }}"
opendatahub.io/recommended-accelerators: '["{{ gpu }}"]'
labels:
opendatahub.io/dashboard: 'true'
spec:
annotations:
prometheus.io/port: '{{ port }}'
prometheus.io/path: '/metrics'
multiModel: false
supportedModelFormats:
- autoSelect: true
name: vLLM
containers:
- name: kserve-container
image: {{ image }}
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args: ["--port={{ port }}", "--model=/mnt/models", "--served-model-name={{ name }}"]
env:
- name: HF_HOME
value: /tmp/hf_home
ports:
- containerPort: {{ port }}
protocol: TCP
"""
# In your generate() method:
yaml_content = create_yaml(KSERVE_RUNTIME_TMPL,
runtime=self.runtime,
model=self.model,
gpu=_gpu if _gpu else "",
port=self.args.port,
image=self.image,
name=self.name)
with open(self.name + "-kserve-runtime.yaml".replace(":", "-"), 'w') as c:
c.write(yaml_content) Repeat a similar approach for the second YAML. This not only reduces repetition but also improves readability and maintainability. |
||||||||||||
c.write( | ||||||||||||
f"""\ | ||||||||||||
apiVersion: serving.kserve.io/v1alpha1 | ||||||||||||
kind: ServingRuntime | ||||||||||||
metadata: | ||||||||||||
name: {self.runtime}-runtime | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is misleading, it produces a |
||||||||||||
annotations: | ||||||||||||
openshift.io/display-name: KServe ServingRuntime for {self.model} | ||||||||||||
opendatahub.io/recommended-accelerators: '["{_gpu}"]' | ||||||||||||
labels: | ||||||||||||
opendatahub.io/dashboard: 'true' | ||||||||||||
Comment on lines
+46
to
+50
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Let's remove them for now, they are openshift/openshift ai specific. |
||||||||||||
spec: | ||||||||||||
annotations: | ||||||||||||
prometheus.io/port: '{self.args.port}' | ||||||||||||
prometheus.io/path: '/metrics' | ||||||||||||
multiModel: false | ||||||||||||
supportedModelFormats: | ||||||||||||
- autoSelect: true | ||||||||||||
name: vLLM | ||||||||||||
containers: | ||||||||||||
- name: kserve-container | ||||||||||||
image: {self.image} | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code looks wrong, as far as I understand checking the example code, it will return ramalama as image and not vLLM. |
||||||||||||
command: | ||||||||||||
- python | ||||||||||||
- -m | ||||||||||||
- vllm.entrypoints.openai.api_server | ||||||||||||
args: | ||||||||||||
- "--port={self.args.port}" | ||||||||||||
- "--model=/mnt/models" | ||||||||||||
- "--served-model-name={{.Name}}" | ||||||||||||
env: | ||||||||||||
- name: HF_HOME | ||||||||||||
value: /tmp/hf_home | ||||||||||||
ports: | ||||||||||||
- containerPort: {self.args.port} | ||||||||||||
protocol: TCP | ||||||||||||
""") | ||||||||||||
|
||||||||||||
outfile = self.name + "-kserve.yaml" | ||||||||||||
outfile = outfile.replace(":", "-") | ||||||||||||
print(f"Generating kserve file: {outfile}") | ||||||||||||
with open(outfile, 'w') as c: | ||||||||||||
c.write( | ||||||||||||
f"""\ | ||||||||||||
# RamaLama {self.model} AI Model Service | ||||||||||||
# kubectl create -f to import this kserve file into Kubernetes. | ||||||||||||
# | ||||||||||||
apiVersion: serving.kserve.io/v1beta1 | ||||||||||||
kind: InferenceService | ||||||||||||
metadata: | ||||||||||||
name: huggingface-{self.model} | ||||||||||||
spec: | ||||||||||||
predictor: | ||||||||||||
model: | ||||||||||||
modelFormat: | ||||||||||||
name: vLLM | ||||||||||||
storageUri: "oci://{self.model}" | ||||||||||||
resources: | ||||||||||||
limits: | ||||||||||||
cpu: "6" | ||||||||||||
memory: 24Gi{gpu} | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue (bug_risk): Potential undefined variable 'gpu'. If neither CUDA_VISIBLE_DEVICES nor HIP_VISIBLE_DEVICES is set, the variable 'gpu' will not be defined before it's used in the f-string. Initializing 'gpu' to an empty string by default would prevent a potential NameError. |
||||||||||||
requests: | ||||||||||||
cpu: "6" | ||||||||||||
memory: 24Gi{gpu} | ||||||||||||
""" | ||||||||||||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Typo: "Kserve" should be "KServe".