feat: add oci genai service as chat inference provider #3876

dkennetzoracle · 2025-10-21T17:45:06Z

What does this PR do?

Adds OCI GenAI PaaS models for openai chat completion endpoints.

Test Plan

In an OCI tenancy with access to GenAI PaaS, perform the following steps:

Ensure you have IAM policies in place to use service (check docs included in this PR)
For local development, setup OCI cli and configure the CLI with your region, tenancy, and auth here
Once configured, go through llama-stack setup and run llama-stack (uses config based auth) like:

OCI_AUTH_TYPE=config_file \
OCI_CLI_PROFILE=CHICAGO \
OCI_REGION=us-chicago-1 \
OCI_COMPARTMENT_OCID=ocid1.compartment.oc1..aaaaaaaa5...5a \
llama stack run oci

Hit the models endpoint to list models after server is running:

curl http://localhost:8321/v1/models | jq
...
{
      "identifier": "meta.llama-4-scout-17b-16e-instruct",
      "provider_resource_id": "ocid1.generativeaimodel.oc1.us-chicago-1.am...q",
      "provider_id": "oci",
      "type": "model",
      "metadata": {
        "display_name": "meta.llama-4-scout-17b-16e-instruct",
        "capabilities": [
          "CHAT"
        ],
        "oci_model_id": "ocid1.generativeaimodel.oc1.us-chicago-1.a...q"
      },
      "model_type": "llm"
},
   ...

Use the "display_name" field to use the model in a /chat/completions request:

# Streaming result
curl -X POST http://localhost:8321/v1/chat/completions   -H "Content-Type: application/json"   -d '{
        "model": "meta.llama-4-scout-17b-16e-instruct",
       "stream": true,
       "temperature": 0.9,
      "messages": [
         {
           "role": "system",
           "content": "You are a funny comedian. You can be crass."
         },
          {
           "role": "user",
          "content": "Tell me a funny joke about programming."
         }
       ]
}'

# Non-streaming result
curl -X POST http://localhost:8321/v1/chat/completions   -H "Content-Type: application/json"   -d '{
        "model": "meta.llama-4-scout-17b-16e-instruct",
       "stream": false,
       "temperature": 0.9,
      "messages": [
         {
           "role": "system",
           "content": "You are a funny comedian. You can be crass."
         },
          {
           "role": "user",
          "content": "Tell me a funny joke about programming."
         }
       ]
}'

Try out other models from the /models endpoint.

ashwinb · 2025-10-21T18:58:03Z

@github-actions run precommit

github-actions · 2025-10-21T18:58:19Z

⏳ Running pre-commit hooks on PR #3876...

github-actions · 2025-10-21T18:59:41Z

✅ Pre-commit hooks completed successfully!

🔧 Changes have been committed and pushed to the PR branch.

dkennetzoracle · 2025-10-21T20:21:27Z

Removing docs additions at request of @raghotham

ashwinb · 2025-10-22T19:48:30Z

cc @mattf for a review since this touches the inference system

dkennetzoracle · 2025-10-27T02:23:49Z

Any updates here?

llama_stack/providers/remote/inference/oci/config.py

llama_stack/providers/remote/inference/oci/oci.py

mattf

@dkennetzoracle -

does oci provide an openai compatible endpoint?
please include output of the inference tests against the remote::oci provider

ashwinb · 2025-10-27T17:22:17Z

I think it would be much preferable if we can work against an OpenAI compatible endpoint. Otherwise, at the very least we need a set of recorded tests against the provider. But before recordings, let's make sure at least the tests pass "live". Here's a command to run (roughly):

pytest -sv tests/integration/inference/  \
   --stack-config <your_distro> \
   --text-model <oci/...>   \
   --embedding-model sentence-transformers/nomic-ai/... \
   --inference-mode live

dkennetzoracle · 2025-10-27T18:28:37Z

@leseb @ashwinb @mattf thanks for reviewing. If it would be strongly preferable for me to use an OpenAI compatible endpoint, I can make those changes. I'll refactor and re-request when this is done.

Sorry also, I started the PR a few weeks ago before a conference, and when I got back inference providers had changed significantly, although it seems like for the better. I'll align on the changes and re-request!

dkennetzoracle · 2025-11-05T09:47:14Z

Hi @leseb @ashwinb @raghotham , I've updated the OCI inference provider to use openai compatible endpoints. It isn't quite using the "mixin", but I am extending OpenAI's base client to use our auth, so I am making the calls directly from the OpenAI base client. I needed to still use the ModelRegistryHelper so I could list models, as our API endpoint doesn't give /models. Additionally, embeddings isn't supported because the service URL which I've used to support the openAI compatible endpoints shadows an internal API which calls a "chat" API, which leaves embeddings unsupported.

However, it does work for chat_completions, and I've got it skipping tests which should not pass.

To test, I've run:

OCI_COMPARTMENT_OCID="ocid1.compartment.oc1..aaaaaaaa5rwhi5.......5a" OCI_REGION="us-chicago-1" OCI_AUTH_TYPE=instance_principal pytest -sv tests/integration/inference/ --stack-config oci --text-model meta.llama-3.3-70b-instruct --inference-mode live

Which results in (warning is that OTEL collector not set):

===== 18 passed, 68 skipped, 1 warning in 16.92s =====

Additionally, functionally running (returns many more models than this):

curl localhost:8321/v1/models | jq
...
{
      "id": "meta.llama-3.3-70b-instruct",
      "object": "model",
      "created": 1762335531,
      "owned_by": "llama_stack",
      "custom_metadata": {
        "model_type": "llm",
        "provider_id": "oci",
        "provider_resource_id": "ocid1.generativeaimodel.oc1.us-chicago-1.amaaaaaask77ujdnqfjq6hzo2loq",
        "display_name": "meta.llama-3.3-70b-instruct",
        "capabilities": [
          "CHAT"
        ],
        "oci_model_id": "ocid1.generativeaimodel.oc1.us-chicago-1.amaaaaaask7dceyajqi2pl7ujdnqfjq6hzo2loq"
      }
},
{
      "id": "openai.gpt-4o",
      "object": "model",
      "created": 1762335531,
      "owned_by": "llama_stack",
      "custom_metadata": {
        "model_type": "llm",
        "provider_id": "oci",
        "provider_resource_id": "ocid1.generativeaimodel.oc1.us-chicago-1.amaaaaaask6n7caii5lnvcpjlwr2s2q",
        "display_name": "openai.gpt-4o",
        "capabilities": [
          "CHAT"
        ],
        "oci_model_id": "ocid1.generativeaimodel.oc1.us-chicago-1.amaaaaaask7dceya663lnvcpjlwr2s2q"
      }
},
...

And using each of these models:

curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "meta.llama-3.3-70b-instruct", "messages": [{"role": "user", "content": "Write a haiku about gpus on the cloud."}]}' | jq

{
  "id": "1c653419cc3146448e5c6d3771ff1ba9",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Virtual cores rise\nCloud GPUs softly humming\nRemote power unfurls",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": null
      },
      "matched_stop": 128009
    }
  ],
  "created": 1762335708,
  "model": "meta.llama-3.3-70b-instruct",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 14,
    "prompt_tokens": 46,
    "total_tokens": 60,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  },
  "metrics": [
    {
      "trace_id": "52ce0c3ea7c416e3744f722509107e5a",
      "span_id": "28bc5e4e963c8f9d",
      "timestamp": "2025-11-05T09:41:48.294940Z",
      "attributes": {
        "model_id": "meta.llama-3.3-70b-instruct",
        "provider_id": "oci"
      },
      "type": "metric",
      "metric": "prompt_tokens",
      "value": 46,
      "unit": "tokens"
    },
    {
      "trace_id": "52ce0c3ea7c416e3744f722509107e5a",
      "span_id": "28bc5e4e963c8f9d",
      "timestamp": "2025-11-05T09:41:48.294952Z",
      "attributes": {
        "model_id": "meta.llama-3.3-70b-instruct",
        "provider_id": "oci"
      },
      "type": "metric",
      "metric": "completion_tokens",
      "value": 14,
      "unit": "tokens"
    },
    {
      "trace_id": "52ce0c3ea7c416e3744f722509107e5a",
      "span_id": "28bc5e4e963c8f9d",
      "timestamp": "2025-11-05T09:41:48.294955Z",
      "attributes": {
        "model_id": "meta.llama-3.3-70b-instruct",
        "provider_id": "oci"
      },
      "type": "metric",
      "metric": "total_tokens",
      "value": 60,
      "unit": "tokens"
    }
  ]
}

And:

curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "openai.gpt-4o", "messages": [{"role": "user", "cont
ent": "Write a haiku about gpus on the cloud."}]}' | jq

{
  "id": "chatcmpl-CYUKuk9bRrbwovof1PAMhiZgAEVCV",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Silicon whispers,  \nIn cloud's embrace, power soars—  \nDreams rendered in light.",
        "refusal": null,
        "role": "assistant",
        "annotations": [],
        "audio": null,
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "created": 1762335800,
  "model": "openai.gpt-4o",
  "object": "chat.completion",
  "service_tier": "default",
  "system_fingerprint": "fp_cbf1785567",
  "usage": {
    "completion_tokens": 21,
    "prompt_tokens": 18,
    "total_tokens": 39,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "audio_tokens": 0,
      "reasoning_tokens": 0,
      "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": {
      "audio_tokens": 0,
      "cached_tokens": 0
    }
  },
  "metrics": [
    {
      "trace_id": "0c14652cb02eb4e408cfbc8bf7db4a57",
      "span_id": "caba87321dc9a687",
      "timestamp": "2025-11-05T09:43:21.490856Z",
      "attributes": {
        "model_id": "openai.gpt-4o",
        "provider_id": "oci"
      },
      "type": "metric",
      "metric": "prompt_tokens",
      "value": 18,
      "unit": "tokens"
    },
    {
      "trace_id": "0c14652cb02eb4e408cfbc8bf7db4a57",
      "span_id": "caba87321dc9a687",
      "timestamp": "2025-11-05T09:43:21.490868Z",
      "attributes": {
        "model_id": "openai.gpt-4o",
        "provider_id": "oci"
      },
      "type": "metric",
      "metric": "completion_tokens",
      "value": 21,
      "unit": "tokens"
    },
    {
      "trace_id": "0c14652cb02eb4e408cfbc8bf7db4a57",
      "span_id": "caba87321dc9a687",
      "timestamp": "2025-11-05T09:43:21.490871Z",
      "attributes": {
        "model_id": "openai.gpt-4o",
        "provider_id": "oci"
      },
      "type": "metric",
      "metric": "total_tokens",
      "value": 39,
      "unit": "tokens"
    }
  ]
}

src/llama_stack/providers/remote/inference/oci/auth.py

src/llama_stack/providers/remote/inference/oci/models.py

ashwinb

This looks clean to me. Not using OpenAIMixin directly here also seems reasonable given the extensive customizations but others may disagree. I have a couple small inlines. Also, please make pre-commit green.

mattf

build_oci_model_entries can be OpenAIMixin.list_provider_model_ids
OciOpenAIClient can be setup for OpenAIMixin.get_extra_client_params

dkennetzoracle · 2025-11-07T16:41:28Z

build_oci_model_entries can be OpenAIMixin.list_provider_model_ids

OciOpenAIClient can be setup for OpenAIMixin.get_extra_client_params

I'll get these in, thanks.

dkennetzoracle · 2025-11-07T17:31:55Z

@mattf - made the requested changes, thanks for pointing me in the right direction!

mattf

looking good. please cleanup auth.py and models.py.

fyi, you're including rerank models, but aren't implementing a rerank interface.

dkennetzoracle · 2025-11-08T22:01:18Z

looking good. please cleanup auth.py and models.py.

fyi, you're including rerank models, but aren't implementing a rerank interface.

@mattf Good deal, I actually use auth.py for client authentication - I need to use it for OCI request signing because we don't use API keys as the auth in the request and I don't think it makes sense to smoosh it into the other files.

If you are aligned on auth.py, I've removed models.py and removed rerank from models list. I can follow up with embeddings and rerank in a future PR.

dkennetzoracle requested review from ashwinb, bbrowning, ehhuang, franciscojavierarceo, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, slekkala1, terrytangyuan and yanxi0830 as code owners October 21, 2025 17:45

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 21, 2025

dkennetzoracle force-pushed the oci_inference_provider branch from e17bf00 to acd1008 Compare October 21, 2025 17:57

dkennetzoracle changed the title ~~Oci inference provider~~ feat: add oci genai service as chat inference provider Oct 21, 2025

leseb requested changes Oct 27, 2025

View reviewed changes

llama_stack/providers/remote/inference/oci/config.py Outdated Show resolved Hide resolved

llama_stack/providers/remote/inference/oci/oci.py Outdated Show resolved Hide resolved

llama_stack/providers/remote/inference/oci/oci.py Outdated Show resolved Hide resolved

mattf reviewed Oct 27, 2025

View reviewed changes

dkennetzoracle force-pushed the oci_inference_provider branch 2 times, most recently from 32c0f3b to 62eed3b Compare November 5, 2025 09:45

dkennetzoracle requested review from leseb and mattf November 5, 2025 09:47

ashwinb reviewed Nov 6, 2025

View reviewed changes

src/llama_stack/providers/remote/inference/oci/auth.py Outdated Show resolved Hide resolved

ashwinb reviewed Nov 6, 2025

View reviewed changes

src/llama_stack/providers/remote/inference/oci/models.py Outdated Show resolved Hide resolved

ashwinb approved these changes Nov 6, 2025

View reviewed changes

mattf requested changes Nov 7, 2025

View reviewed changes

dkennetzoracle force-pushed the oci_inference_provider branch from 62eed3b to b6f9788 Compare November 7, 2025 17:30

dkennetzoracle requested a review from mattf November 7, 2025 17:31

dkennetzoracle force-pushed the oci_inference_provider branch from b6f9788 to 46cd644 Compare November 8, 2025 01:56

mattf requested changes Nov 8, 2025

View reviewed changes

dkennetzoracle force-pushed the oci_inference_provider branch from 46cd644 to 6319cce Compare November 8, 2025 22:01

dkennetzoracle requested a review from mattf November 8, 2025 22:02

feat: add oci genai service as chat inference provider

670e4ec

dkennetzoracle force-pushed the oci_inference_provider branch from 6319cce to 670e4ec Compare November 8, 2025 22:43

feat: add oci genai service as chat inference provider #3876

Are you sure you want to change the base?

feat: add oci genai service as chat inference provider #3876

Conversation

dkennetzoracle commented Oct 21, 2025 • edited by raghotham Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

ashwinb commented Oct 21, 2025

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

dkennetzoracle commented Oct 21, 2025

Uh oh!

ashwinb commented Oct 22, 2025

Uh oh!

dkennetzoracle commented Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

ashwinb commented Oct 27, 2025

Uh oh!

dkennetzoracle commented Oct 27, 2025

Uh oh!

dkennetzoracle commented Nov 5, 2025

Uh oh!

Uh oh!

Uh oh!

ashwinb left a comment

Choose a reason for hiding this comment

Uh oh!

mattf left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkennetzoracle commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkennetzoracle commented Nov 7, 2025

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

dkennetzoracle commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dkennetzoracle commented Oct 21, 2025 •

edited by raghotham

Loading

mattf left a comment •

edited

Loading

dkennetzoracle commented Nov 7, 2025 •

edited

Loading