Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for kserve #877

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

rhatdan
Copy link
Member

@rhatdan rhatdan commented Feb 24, 2025

Summary by Sourcery

This pull request introduces support for generating KServe configurations, allowing users to deploy AI models on Kubernetes using the KServe framework. It adds a new kserve option to the ramalama serve --generate command, which generates the necessary YAML files for deploying a model as a KServe service.

New Features:

  • Adds support for generating KServe YAML definitions for running AI models as a service in Kubernetes, enabling deployment and management of models using the KServe framework.

Enhancements:

  • The ramalama serve command now accepts a --generate kserve option to generate KServe YAML files.
  • The generated KServe YAML files include resource requests and limits for CPU, memory, and GPU (if available).

Copy link
Contributor

sourcery-ai bot commented Feb 24, 2025

Reviewer's Guide by Sourcery

This pull request adds support for generating KServe YAML definitions, enabling users to deploy AI models as KServe services within a Kubernetes environment. It introduces a new Kserve class responsible for generating the necessary YAML files and integrates it into the existing ramalama serve command-line interface.

Sequence diagram for generating KServe configuration

sequenceDiagram
    participant CLI
    participant Model
    participant Kserve

    CLI->>Model: serve(args)
    Model->>Model: execute_command(model_path, exec_args, args)
    alt args.generate == "kserve"
        Model->>Model: kserve(model_path, args, exec_args)
        Model->>Kserve: __init__(model_path, image, args, exec_args)
        Kserve->>Kserve: generate()
        Kserve-->>Model: True
        Model-->>CLI: None
    else
        Model-->>CLI: None
    end
Loading

File-Level Changes

Change Details Files
Introduces support for generating KServe YAML definitions for deploying AI models as KServe services in Kubernetes.
  • Adds 'kserve' as a choice to the --generate argument in the serve command.
  • Creates a new Kserve class to handle the generation of KServe YAML files.
  • Implements the generate method in the Kserve class to create the necessary YAML files for deploying a model with KServe.
  • Adds logic to the generate_container_config method to call the new KServe functionality.
  • Adds documentation for generating kserve service.
docs/ramalama-serve.1.md
ramalama/model.py
ramalama/cli.py
ramalama/kserve.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rhatdan - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider generating the KServe runtime file only when necessary, instead of every time.
  • The KServe code duplicates some logic from other modules; consider refactoring to share code.
Here's what I looked at during the review
  • 🟡 General issues: 3 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

env_var_string += f"Environment={k}={v}\n"

_gpu = ""
if os.getenv("CUDA_VISIBLE_DEVICES") != "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): GPU env var check may be flawed.

The condition os.getenv("CUDA_VISIBLE_DEVICES") != "" will return True even if the variable is not set (i.e. returns None). Consider using a check such as if os.getenv("CUDA_VISIBLE_DEVICES") to better capture whether the variable is defined and non-empty.

resources:
limits:
cpu: "6"
memory: 24Gi{gpu}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential undefined variable 'gpu'.

If neither CUDA_VISIBLE_DEVICES nor HIP_VISIBLE_DEVICES is set, the variable 'gpu' will not be defined before it's used in the f-string. Initializing 'gpu' to an empty string by default would prevent a potential NameError.

@@ -60,8 +60,9 @@ Generate specified configuration format for running the AI Model as a service

| Key | Description |
| ------------ | -------------------------------------------------------------------------|
| quadlet | Podman supported container definition for running AI Model under systemd |
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Typo: "Kserve" should be "KServe".

Suggested change
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes |
| kserve | KServe YAML definition for running the AI Model as a KServe service in Kubernetes |

outfile = self.name + "-kserve-runtime.yaml"
outfile = outfile.replace(":", "-")
print(f"Generating kserve runtime file: {outfile}")
with open(outfile, 'w') as c:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider using a templating engine like Jinja2 to generate the YAML files, which will reduce code duplication and improve readability.

Consider abstracting YAML creation into a dedicated templating helper to reduce the inline formatting repetition. For example, you might use Jinja2 templates (or PyYAML with dictionaries) to consolidate and reuse the YAML structure. Here’s a concise example using Jinja2:

from jinja2 import Template

def create_yaml(template_str, **params):
    return Template(template_str).render(**params)

# Define your runtime YAML template once.
KSERVE_RUNTIME_TMPL = """
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: {{ runtime }}-runtime
  annotations:
    openshift.io/display-name: "KServe ServingRuntime for {{ model }}"
    opendatahub.io/recommended-accelerators: '["{{ gpu }}"]'
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  annotations:
    prometheus.io/port: '{{ port }}'
    prometheus.io/path: '/metrics'
  multiModel: false
  supportedModelFormats:
    - autoSelect: true
      name: vLLM
  containers:
    - name: kserve-container
      image: {{ image }}
      command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
      args: ["--port={{ port }}", "--model=/mnt/models", "--served-model-name={{ name }}"]
      env:
        - name: HF_HOME
          value: /tmp/hf_home
      ports:
        - containerPort: {{ port }}
          protocol: TCP
"""

# In your generate() method:
yaml_content = create_yaml(KSERVE_RUNTIME_TMPL,
                           runtime=self.runtime,
                           model=self.model,
                           gpu=_gpu if _gpu else "",
                           port=self.args.port,
                           image=self.image,
                           name=self.name)
with open(self.name + "-kserve-runtime.yaml".replace(":", "-"), 'w') as c:
    c.write(yaml_content)

Repeat a similar approach for the second YAML. This not only reduces repetition but also improves readability and maintainability.

Signed-off-by: Daniel J Walsh <[email protected]>
Comment on lines +46 to +50
annotations:
openshift.io/display-name: KServe ServingRuntime for {self.model}
opendatahub.io/recommended-accelerators: '["{_gpu}"]'
labels:
opendatahub.io/dashboard: 'true'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
annotations:
openshift.io/display-name: KServe ServingRuntime for {self.model}
opendatahub.io/recommended-accelerators: '["{_gpu}"]'
labels:
opendatahub.io/dashboard: 'true'

Let's remove them for now, they are openshift/openshift ai specific.
Sorry I should have excluded them in the example I have shared

name: vLLM
containers:
- name: kserve-container
image: {self.image}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks wrong, as far as I understand checking the example code, it will return ramalama as image and not vLLM.
Moreover vLLM is specific per accelerator so the same if/else condition to produce the gpu requirements should be applied here to select the right image

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: {self.runtime}-runtime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is misleading, it produces a llama.cpp-runtime value but this is supposed to be vLLM

@@ -60,8 +60,9 @@ Generate specified configuration format for running the AI Model as a service

| Key | Description |
| ------------ | -------------------------------------------------------------------------|
| quadlet | Podman supported container definition for running AI Model under systemd |
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes |
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes using vLLM |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants