Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add docs for using phi models directly #407

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions examples/python/phi-3-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,60 @@ Are you on a Windows machine with GPU?

This joke plays on the double meaning of "creativity" in the context of AI. Generative AI is often associated with its ability to produce creative content, but in this joke, it\'s humorously suggested that the AI is going to school to enhance its creative skills, as if it were a human student.
```


## Run in your Code/REPL

If you want to run the model directly, as a script or testing in a jupyter notebook, the following code snippet will help you

```py
import onnxruntime_genai as og

chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

# GPU
# model_name = "Phi-3-mini-4k-instruct-onnx/cuda/cuda-int4-rtn-block-32 "

# CPU
model_name = "Phi-3-mini-4k-instruct-onnx/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4"

# DirectML
# model_name = "Phi-3-mini-4k-instruct-onnx\directml\directml-int4-awq-block-128"

model = og.Model(model_name)
tokenizer = og.Tokenizer(model)
```
Then, define your predict function
```py
def predict(text: str) -> str:
# text = "How are you doing today?"
tokenizer_stream = tokenizer.create_stream()
search_options = {
"do_sample": True,
"temperature": 0.6,
"top_p": 1,
}
prompt = f'{chat_template.format(input=text)}'

input_tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.try_use_cuda_graph_with_max_batch_size(1)
params.set_search_options(**search_options)
params.input_ids = input_tokens

generator = og.Generator(model, params)
output = []
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
decoded = tokenizer_stream.decode(new_token)
print(decoded, end='', flush=True)
output.append(decoded)
return "".join(output)
```
Now you can use it, and save the output
```py
model_prediction = predict("Can you tell me a joke?")
```
Loading