Skip to content

Commit 527485d

Browse files
committed
update docs
1 parent 3fdd248 commit 527485d

File tree

4 files changed

+17
-11
lines changed

4 files changed

+17
-11
lines changed

README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,17 +19,21 @@ An embedding is a vector, or a list, of floating-point numbers. The distance bet
1919

2020
This embedding API is created for [Magda](https://github.com/magda-io/magda)'s vector / hybrid search solution. The API interface is compatible with OpenAI's `embeddings` API to make it easier to reuse existing tools & libraries.
2121

22+
> the diagram below shows Magda's use case
23+
24+
![architecture](docs/architecture.png)
25+
2226
### Model Selection & Resources requirements
2327

2428
> Only the default mode, will be included in the docker image to speed up the starting up.
25-
> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjsut `startupProbe` to allow the longer starting up time introduced by the model downloading.
29+
> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjust `startupProbe` to allow the longer starting up time introduced by the model downloading.
2630
2731
Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size (2 times higher).
2832
e.g. For the default 500MB model file, the peak memory usage could up to 1.8GB - 2GB.
29-
However, the memory usage will drop back to much lower (for default model, it's aroudn 800MB-900MB) after the model is loaded.
33+
However, the memory usage will drop back to much lower (for default model, it's around 800MB-900MB) after the model is loaded.
3034
Please make sure your Kubernetes cluster has enough resources to run the service.
3135

32-
When specify `appConfig.modelList`, you can set `quantized` to `false` to use a quantized model. Please refer to helm chart document below for more informaiton. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
36+
When specify `appConfig.modelList`, you can set the value of `dtype` field to select precision (quantizations) of the model. The possible value are: full-precision ("fp32"), half-precision ("fp16"), 8-bit ("q8", "int8", "uint8"), and 4-bit ("q4", "bnb4", "q4f16"). Please refer to helm chart document below for more information. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
3337

3438
> Memory consumption test result for a few selected models can be found: https://github.com/magda-io/magda-embedding-api/issues/2
3539

README.md.gotmpl

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,22 +19,24 @@ An embedding is a vector, or a list, of floating-point numbers. The distance bet
1919

2020
This embedding API is created for [Magda](https://github.com/magda-io/magda)'s vector / hybrid search solution. The API interface is compatible with OpenAI's `embeddings` API to make it easier to reuse existing tools & libraries.
2121

22+
> the diagram below shows Magda's use case
23+
24+
![architecture](docs/architecture.png)
25+
2226
### Model Selection & Resources requirements
2327

24-
> Only the default mode, will be included in the docker image to speed up the starting up.
25-
> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjsut `startupProbe` to allow the longer starting up time introduced by the model downloading.
28+
> Only the default mode, will be included in the docker image to speed up the starting up.
29+
> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjust `startupProbe` to allow the longer starting up time introduced by the model downloading.
2630

2731
Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size (2 times higher).
2832
e.g. For the default 500MB model file, the peak memory usage could up to 1.8GB - 2GB.
29-
However, the memory usage will drop back to much lower (for default model, it's aroudn 800MB-900MB) after the model is loaded.
33+
However, the memory usage will drop back to much lower (for default model, it's around 800MB-900MB) after the model is loaded.
3034
Please make sure your Kubernetes cluster has enough resources to run the service.
3135

32-
When specify `appConfig.modelList`, you can set `quantized` to `false` to use a quantized model. Please refer to helm chart document below for more informaiton. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
36+
When specify `appConfig.modelList`, you can set the value of `dtype` field to select precision (quantizations) of the model. The possible value are: full-precision ("fp32"), half-precision ("fp16"), 8-bit ("q8", "int8", "uint8"), and 4-bit ("q4", "bnb4", "q4f16"). Please refer to helm chart document below for more information. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
3337

3438
> Memory consumption test result for a few selected models can be found: https://github.com/magda-io/magda-embedding-api/issues/2
3539

36-
37-
3840
{{ template "chart.maintainersSection" . }}
3941

4042
{{ template "chart.requirementsSection" . }}

deploy/test-deploy.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ appConfig:
1717
modelList:
1818
# use alternative model for embedding
1919
- name: Xenova/bge-small-en-v1.5
20-
# set quantized to false to use the non-quantized version of the model
21-
# by default, the quantized version of the model will be used
20+
# You can set `dtype` to select the precision of the model
21+
# Available values: "fp32" | "fp16" | "q8" | "int8" | "uint8" | "q4" | "bnb4" | "q4f16"
2222
dtype: "q8"
2323
# optional set max length of the input text
2424
# if not set, the value in model config will be used

docs/architecture.png

46.6 KB
Loading

0 commit comments

Comments
 (0)