Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for kserve #877

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 70 additions & 2 deletions docs/ramalama-serve.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,9 @@ Generate specified configuration format for running the AI Model as a service

| Key | Description |
| ------------ | -------------------------------------------------------------------------|
| quadlet | Podman supported container definition for running AI Model under systemd |
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Typo: "Kserve" should be "KServe".

Suggested change
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes |
| kserve | KServe YAML definition for running the AI Model as a KServe service in Kubernetes |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes |
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes using vLLM |

| kube | Kubernetes YAML definition for running the AI Model as a service |
| quadlet | Podman supported container definition for running AI Model under systemd |
| quadlet/kube | Kubernetes YAML definition for running the AI Model as a service and Podman supported container definition for running the Kube YAML specified pod under systemd|

#### **--help**, **-h**
Expand Down Expand Up @@ -119,7 +120,7 @@ llama.cpp explains this as:

The higher the number is the more creative the response is, but more likely to hallucinate when set too high.

Usage: Lower numbers are good for virtual assistants where we need deterministic responses. Higher numbers are good for roleplay or creative tasks like editing stories
Usage: Lower numbers are good for virtual assistants where we need deterministic responses. Higher numbers are good for roleplay or creative tasks like editing stories

#### **--tls-verify**=*true*
require HTTPS and verify certificates when contacting OCI registries
Expand All @@ -140,6 +141,73 @@ CONTAINER ID IMAGE COMMAND CREATED
3f64927f11a5 quay.io/ramalama/ramalama:latest /usr/bin/ramalama... 17 seconds ago Up 17 seconds 0.0.0.0:8082->8082/tcp ramalama_YMPQvJxN97
```

### Generate kserve service off of OCI Model car quay.io/ramalama/granite:1.0
```
./bin/ramalama serve --port 8081 --generate kserve oci://quay.io/ramalama/granite:1.0
Generating kserve runtime file: granite-1.0-kserve-runtime.yaml
Generating kserve file: granite-1.0-kserve.yaml
$ cat granite-1.0-kserve-runtime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: llama.cpp-runtime
annotations:
openshift.io/display-name: KServe ServingRuntime for quay.io/ramalama/granite:1.0
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
labels:
opendatahub.io/dashboard: 'true'
spec:
annotations:
prometheus.io/port: '8081'
prometheus.io/path: '/metrics'
multiModel: false
supportedModelFormats:
- autoSelect: true
name: vLLM
containers:
- name: kserve-container
image: quay.io/ramalama/ramalama:latest
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- "--port=8081"
- "--model=/mnt/models"
- "--served-model-name={.Name}"
env:
- name: HF_HOME
value: /tmp/hf_home
ports:
- containerPort: 8081
protocol: TCP
$ cat granite-1.0-kserve.yaml
# RamaLama quay.io/ramalama/granite:1.0 AI Model Service
# kubectl create -f to import this kserve file into Kubernetes.
#
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-quay.io/ramalama/granite:1.0
spec:
predictor:
model:
modelFormat:
name: vLLM
storageUri: "oci://quay.io/ramalama/granite:1.0"
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
```

### Generate quadlet service off of HuggingFace granite Model
```
$ ramalama serve --name MyGraniteServer --generate=quadlet granite
Expand Down
2 changes: 1 addition & 1 deletion ramalama/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -859,7 +859,7 @@ def serve_parser(subparsers):
parser.add_argument("--host", default=config.get('host', "0.0.0.0"), help="IP address to listen")
parser.add_argument(
"--generate",
choices=["quadlet", "kube", "quadlet/kube"],
choices=["kserve", "kube", "quadlet", "quadlet/kube", ],
help="generate specified configuration format for running the AI Model as a service",
)
parser.add_argument(
Expand Down
105 changes: 105 additions & 0 deletions ramalama/kserve.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
import os

from ramalama.common import get_env_vars


class Kserve:
def __init__(self, model, image, args, exec_args):
self.ai_image = model
if hasattr(args, "MODEL"):
self.ai_image = args.MODEL
self.ai_image = self.ai_image.removeprefix("oci://")
if args.name:
self.name = args.name
else:
self.name = os.path.basename(self.ai_image)

self.model = model.removeprefix("oci://")
self.args = args
self.exec_args = exec_args
self.image = image
self.runtime = args.runtime

def generate(self):
env_var_string = ""
for k, v in get_env_vars().items():
env_var_string += f"Environment={k}={v}\n"

_gpu = ""
if os.getenv("CUDA_VISIBLE_DEVICES") != "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): GPU env var check may be flawed.

The condition os.getenv("CUDA_VISIBLE_DEVICES") != "" will return True even if the variable is not set (i.e. returns None). Consider using a check such as if os.getenv("CUDA_VISIBLE_DEVICES") to better capture whether the variable is defined and non-empty.

_gpu = 'nvidia.com/gpu'
elif os.getenv("HIP_VISIBLE_DEVICES") != "":
_gpu = 'amd.com/gpu'
if _gpu != "":
gpu = f'\n {_gpu}: "1"'

outfile = self.name + "-kserve-runtime.yaml"
outfile = outfile.replace(":", "-")
print(f"Generating kserve runtime file: {outfile}")
with open(outfile, 'w') as c:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider using a templating engine like Jinja2 to generate the YAML files, which will reduce code duplication and improve readability.

Consider abstracting YAML creation into a dedicated templating helper to reduce the inline formatting repetition. For example, you might use Jinja2 templates (or PyYAML with dictionaries) to consolidate and reuse the YAML structure. Here’s a concise example using Jinja2:

from jinja2 import Template

def create_yaml(template_str, **params):
    return Template(template_str).render(**params)

# Define your runtime YAML template once.
KSERVE_RUNTIME_TMPL = """
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: {{ runtime }}-runtime
  annotations:
    openshift.io/display-name: "KServe ServingRuntime for {{ model }}"
    opendatahub.io/recommended-accelerators: '["{{ gpu }}"]'
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  annotations:
    prometheus.io/port: '{{ port }}'
    prometheus.io/path: '/metrics'
  multiModel: false
  supportedModelFormats:
    - autoSelect: true
      name: vLLM
  containers:
    - name: kserve-container
      image: {{ image }}
      command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
      args: ["--port={{ port }}", "--model=/mnt/models", "--served-model-name={{ name }}"]
      env:
        - name: HF_HOME
          value: /tmp/hf_home
      ports:
        - containerPort: {{ port }}
          protocol: TCP
"""

# In your generate() method:
yaml_content = create_yaml(KSERVE_RUNTIME_TMPL,
                           runtime=self.runtime,
                           model=self.model,
                           gpu=_gpu if _gpu else "",
                           port=self.args.port,
                           image=self.image,
                           name=self.name)
with open(self.name + "-kserve-runtime.yaml".replace(":", "-"), 'w') as c:
    c.write(yaml_content)

Repeat a similar approach for the second YAML. This not only reduces repetition but also improves readability and maintainability.

c.write(
f"""\
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: {self.runtime}-runtime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is misleading, it produces a llama.cpp-runtime value but this is supposed to be vLLM

annotations:
openshift.io/display-name: KServe ServingRuntime for {self.model}
opendatahub.io/recommended-accelerators: '["{_gpu}"]'
labels:
opendatahub.io/dashboard: 'true'
Comment on lines +46 to +50

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
annotations:
openshift.io/display-name: KServe ServingRuntime for {self.model}
opendatahub.io/recommended-accelerators: '["{_gpu}"]'
labels:
opendatahub.io/dashboard: 'true'

Let's remove them for now, they are openshift/openshift ai specific.
Sorry I should have excluded them in the example I have shared

spec:
annotations:
prometheus.io/port: '{self.args.port}'
prometheus.io/path: '/metrics'
multiModel: false
supportedModelFormats:
- autoSelect: true
name: vLLM
containers:
- name: kserve-container
image: {self.image}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks wrong, as far as I understand checking the example code, it will return ramalama as image and not vLLM.
Moreover vLLM is specific per accelerator so the same if/else condition to produce the gpu requirements should be applied here to select the right image

command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- "--port={self.args.port}"
- "--model=/mnt/models"
- "--served-model-name={{.Name}}"
env:
- name: HF_HOME
value: /tmp/hf_home
ports:
- containerPort: {self.args.port}
protocol: TCP
""")

outfile = self.name + "-kserve.yaml"
outfile = outfile.replace(":", "-")
print(f"Generating kserve file: {outfile}")
with open(outfile, 'w') as c:
c.write(
f"""\
# RamaLama {self.model} AI Model Service
# kubectl create -f to import this kserve file into Kubernetes.
#
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-{self.model}
spec:
predictor:
model:
modelFormat:
name: vLLM
storageUri: "oci://{self.model}"
resources:
limits:
cpu: "6"
memory: 24Gi{gpu}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential undefined variable 'gpu'.

If neither CUDA_VISIBLE_DEVICES nor HIP_VISIBLE_DEVICES is set, the variable 'gpu' will not be defined before it's used in the f-string. Initializing 'gpu' to an empty string by default would prevent a potential NameError.

requests:
cpu: "6"
memory: 24Gi{gpu}
"""
)
27 changes: 17 additions & 10 deletions ramalama/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from ramalama.kube import Kube
from ramalama.model_inspect import GGUFModelInfo, ModelInfoBase
from ramalama.quadlet import Quadlet
from ramalama.kserve import Kserve
from ramalama.version import version

MODEL_TYPES = ["file", "https", "http", "oci", "huggingface", "hf", "ollama"]
Expand Down Expand Up @@ -360,7 +361,6 @@ def get_model_path(self, args):

if args.dryrun:
return "/path/to/model"

model_path = self.pull(args)

return model_path
Expand Down Expand Up @@ -486,16 +486,15 @@ def handle_runtime(self, args, exec_args, exec_model_path):

def generate_container_config(self, model_path, args, exec_args):
self.image = self._image(args)
if args.generate == "kserve":
return self.kserve(model_path, args, exec_args)
if args.generate == "quadlet":
self.quadlet(model_path, args, exec_args)
elif args.generate == "kube":
self.kube(model_path, args, exec_args)
elif args.generate == "quadlet/kube":
self.quadlet_kube(model_path, args, exec_args)
else:
return False

return True
return self.quadlet(model_path, args, exec_args)
if args.generate == "kube":
return self.kube(model_path, args, exec_args)
if args.generate == "quadlet/kube":
return self.quadlet_kube(model_path, args, exec_args)
return False

def execute_command(self, model_path, exec_args, args):
try:
Expand Down Expand Up @@ -526,19 +525,27 @@ def serve(self, args):

self.execute_command(model_path, exec_args, args)

def kserve(self, model, args, exec_args):
kserve = Kserve(model, self.image, args, exec_args)
kserve.generate()
return True

def quadlet(self, model, args, exec_args):
quadlet = Quadlet(model, self.image, args, exec_args)
quadlet.generate()
return True

def quadlet_kube(self, model, args, exec_args):
kube = Kube(model, self.image, args, exec_args)
kube.generate()
quadlet = Quadlet(model, self.image, args, exec_args)
quadlet.kube()
return True

def kube(self, model, args, exec_args):
kube = Kube(model, self.image, args, exec_args)
kube.generate()
return True

def path(self, args):
return self.model_path(args)
Expand Down
Loading