An external provider for Llama Stack allowing for the use of RamaLama for inference.
You can install ramalama-stack
from PyPI via pip install ramalama-stack
This will install Llama Stack and RamaLama as well if they are not installed already.
Warning
The following workaround is currently needed to run this provider - see #53 for more details
curl --create-dirs --output ~/.llama/providers.d/remote/inference/ramalama.yaml https://raw.githubusercontent.com/containers/ramalama-stack/refs/tags/v0.2.5/src/ramalama_stack/providers.d/remote/inference/ramalama.yaml
curl --create-dirs --output ~/.llama/distributions/ramalama/ramalama-run.yaml https://raw.githubusercontent.com/containers/ramalama-stack/refs/tags/v0.2.5/src/ramalama_stack/ramalama-run.yaml
-
First you will need a RamaLama server running - see the RamaLama project docs for more information.
-
Ensure you set your
INFERENCE_MODEL
environment variable to the name of the model you have running via RamaLama. -
You can then run the RamaLama external provider via
llama stack run ~/.llama/distributions/ramalama/ramalama-run.yaml
Note
You can also run the RamaLama external provider inside of a container via Podman
podman run \
--net=host \
--env RAMALAMA_URL=http://0.0.0.0:8080 \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
quay.io/ramalama/llama-stack
This will start a Llama Stack server which will use port 8321 by default. You can test this works by configuring the Llama Stack Client to run against this server and sending a test request.
- If your client is running on the same machine as the server, you can run
llama-stack-client configure --endpoint http://0.0.0.0:8321 --api-key none
- If your client is running on a different machine, you can run
llama-stack-client configure --endpoint http://<hostname>:8321 --api-key none
- The client should give you a message similar to
Done! You can now use the Llama Stack Client CLI with endpoint <endpoint>
- You can then test the server by running
llama-stack-client inference chat-completion --message "tell me a joke"
which should return something like
ChatCompletionResponse(
completion_message=CompletionMessage(
content='A man walked into a library and asked the librarian, "Do you have any books on Pavlov\'s dogs
and Schrödinger\'s cat?" The librarian replied, "It rings a bell, but I\'m not sure if it\'s here or not."',
role='assistant',
stop_reason='end_of_turn',
tool_calls=[]
),
logprobs=None,
metrics=[
Metric(metric='prompt_tokens', value=14.0, unit=None),
Metric(metric='completion_tokens', value=63.0, unit=None),
Metric(metric='total_tokens', value=77.0, unit=None)
]
)
Llama Stack includes an experimental user-interface, check it out here.
To deploy the UI, run this:
podman run -d --rm --network=container:ramalama --name=streamlit quay.io/redhat-et/streamlit_client:0.1.0
Note
If running on MacOS (not Linux), --network=host
doesn't work. You'll need to publish additional ports 8321:8321
and 8501:8501
with the ramalama serve command,
then run with network=container:ramalama
.
If running on Linux use --network=host
or -p 8501:8501
instead. The streamlit container will be able to access the ramalama endpoint with either.