update docs

t83714 · t83714 · commit 527485d1b178 · 2025-02-23T22:19:15.000+11:00
diff --git a/README.md b/README.md
@@ -19,17 +19,21 @@ An embedding is a vector, or a list, of floating-point numbers. The distance bet
 
 This embedding API is created for [Magda](https://github.com/magda-io/magda)'s vector / hybrid search solution. The API interface is compatible with OpenAI's `embeddings` API to make it easier to reuse existing tools & libraries.
 
+> the diagram below shows Magda's use case
+
+![architecture](docs/architecture.png)
+
 ### Model Selection & Resources requirements
 
 > Only the default mode, will be included in the docker image to speed up the starting up.
-> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjsut `startupProbe` to allow the longer starting up time introduced by the model downloading.
+> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjust `startupProbe` to allow the longer starting up time introduced by the model downloading.
 
 Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size (2 times higher).
 e.g. For the default 500MB model file, the peak memory usage could up to 1.8GB - 2GB.
-However, the memory usage will drop back to much lower (for default model, it's aroudn 800MB-900MB) after the model is loaded.
+However, the memory usage will drop back to much lower (for default model, it's around 800MB-900MB) after the model is loaded.
 Please make sure your Kubernetes cluster has enough resources to run the service.
 
-When specify `appConfig.modelList`, you can set `quantized` to `false` to use a quantized model. Please refer to helm chart document below for more informaiton. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
+When specify `appConfig.modelList`, you can set the value of `dtype` field to select precision (quantizations) of the model. The possible value are: full-precision ("fp32"), half-precision ("fp16"), 8-bit ("q8", "int8", "uint8"), and 4-bit ("q4", "bnb4", "q4f16"). Please refer to helm chart document below for more information. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
 
 > Memory consumption test result for a few selected models can be found: https://github.com/magda-io/magda-embedding-api/issues/2
 
diff --git a/README.md.gotmpl b/README.md.gotmpl
@@ -19,22 +19,24 @@ An embedding is a vector, or a list, of floating-point numbers. The distance bet
 
 This embedding API is created for [Magda](https://github.com/magda-io/magda)'s vector / hybrid search solution. The API interface is compatible with OpenAI's `embeddings` API to make it easier to reuse existing tools & libraries.
 
+> the diagram below shows Magda's use case
+
+![architecture](docs/architecture.png)
+
 ### Model Selection & Resources requirements
 
-> Only the default mode, will be included in the docker image to speed up the starting up. 
-> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjsut `startupProbe` to allow the longer starting up time introduced by the model downloading.
+> Only the default mode, will be included in the docker image to speed up the starting up.
+> If you want to use a different model (via `appConfig.modelList`), besides the resources requirements consideration here, you might also want to increase `pluginTimeout` and adjust `startupProbe` to allow the longer starting up time introduced by the model downloading.
 
 Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size (2 times higher).
 e.g. For the default 500MB model file, the peak memory usage could up to 1.8GB - 2GB.
-However, the memory usage will drop back to much lower (for default model, it's aroudn 800MB-900MB) after the model is loaded.
+However, the memory usage will drop back to much lower (for default model, it's around 800MB-900MB) after the model is loaded.
 Please make sure your Kubernetes cluster has enough resources to run the service.
 
-When specify `appConfig.modelList`, you can set `quantized` to `false` to use a quantized model. Please refer to helm chart document below for more informaiton. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
+When specify `appConfig.modelList`, you can set the value of `dtype` field to select precision (quantizations) of the model. The possible value are: full-precision ("fp32"), half-precision ("fp16"), 8-bit ("q8", "int8", "uint8"), and 4-bit ("q4", "bnb4", "q4f16"). Please refer to helm chart document below for more information. You can also find an example from this config file [here](./deploy/test-deploy.yaml).
 
 > Memory consumption test result for a few selected models can be found: https://github.com/magda-io/magda-embedding-api/issues/2
 
-
-
 {{ template "chart.maintainersSection" . }}
 
 {{ template "chart.requirementsSection" . }}
diff --git a/deploy/test-deploy.yaml b/deploy/test-deploy.yaml
@@ -17,8 +17,8 @@ appConfig:
   modelList:
   # use alternative model for embedding
   - name: Xenova/bge-small-en-v1.5
-    # set quantized to false to use the non-quantized version of the model
-    # by default, the quantized version of the model will be used
+    # You can set `dtype` to select the precision of the model
+    # Available values: "fp32" | "fp16" | "q8" | "int8" | "uint8" | "q4" | "bnb4" | "q4f16"
     dtype: "q8"
     # optional set max length of the input text
     # if not set, the value in model config will be used
diff --git a/docs/architecture.png b/docs/architecture.png