You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-3Lines changed: 7 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -19,17 +19,21 @@ An embedding is a vector, or a list, of floating-point numbers. The distance bet
19
19
20
20
This embedding API is created for [Magda](https://github.com/magda-io/magda)'s vector / hybrid search solution. The API interface is compatible with OpenAI's `embeddings` API to make it easier to reuse existing tools & libraries.
21
21
22
+
> the diagram below shows Magda's use case
23
+
24
+

25
+
22
26
### Model Selection & Resources requirements
23
27
24
28
> Only the default mode, will be included in the docker image to speed up the starting up.
25
-
> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjsut`startupProbe` to allow the longer starting up time introduced by the model downloading.
29
+
> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjust`startupProbe` to allow the longer starting up time introduced by the model downloading.
26
30
27
31
Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size (2 times higher).
28
32
e.g. For the default 500MB model file, the peak memory usage could up to 1.8GB - 2GB.
29
-
However, the memory usage will drop back to much lower (for default model, it's aroudn 800MB-900MB) after the model is loaded.
33
+
However, the memory usage will drop back to much lower (for default model, it's around 800MB-900MB) after the model is loaded.
30
34
Please make sure your Kubernetes cluster has enough resources to run the service.
31
35
32
-
When specify `appConfig.modelList`, you can set `quantized` to `false`to use a quantized model. Please refer to helm chart document below for more informaiton. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
36
+
When specify `appConfig.modelList`, you can set the value of `dtype` field to select precision (quantizations) of the model. The possible value are: full-precision ("fp32"), half-precision ("fp16"), 8-bit ("q8", "int8", "uint8"), and 4-bit ("q4", "bnb4", "q4f16"). Please refer to helm chart document below for more information. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
33
37
34
38
> Memory consumption test result for a few selected models can be found: https://github.com/magda-io/magda-embedding-api/issues/2
Copy file name to clipboardExpand all lines: README.md.gotmpl
+8-6Lines changed: 8 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -19,22 +19,24 @@ An embedding is a vector, or a list, of floating-point numbers. The distance bet
19
19
20
20
This embedding API is created for [Magda](https://github.com/magda-io/magda)'s vector / hybrid search solution. The API interface is compatible with OpenAI's `embeddings` API to make it easier to reuse existing tools & libraries.
21
21
22
+
> the diagram below shows Magda's use case
23
+
24
+

25
+
22
26
### Model Selection & Resources requirements
23
27
24
-
> Only the default mode, will be included in the docker image to speed up the starting up.
25
-
> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjsut `startupProbe` to allow the longer starting up time introduced by the model downloading.
28
+
> Only the default mode, will be included in the docker image to speed up the starting up.
29
+
> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjust `startupProbe` to allow the longer starting up time introduced by the model downloading.
26
30
27
31
Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size (2 times higher).
28
32
e.g. For the default 500MB model file, the peak memory usage could up to 1.8GB - 2GB.
29
-
However, the memory usage will drop back to much lower (for default model, it's aroudn 800MB-900MB) after the model is loaded.
33
+
However, the memory usage will drop back to much lower (for default model, it's around 800MB-900MB) after the model is loaded.
30
34
Please make sure your Kubernetes cluster has enough resources to run the service.
31
35
32
-
When specify `appConfig.modelList`, you can set `quantized` to `false` to use a quantized model. Please refer to helm chart document below for more informaiton. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
36
+
When specify `appConfig.modelList`, you can set the value of `dtype` field to select precision (quantizations) of the model. The possible value are: full-precision ("fp32"), half-precision ("fp16"), 8-bit ("q8", "int8", "uint8"), and 4-bit ("q4", "bnb4", "q4f16"). Please refer to helm chart document below for more information. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
33
37
34
38
> Memory consumption test result for a few selected models can be found: https://github.com/magda-io/magda-embedding-api/issues/2
0 commit comments