Skip to content

Commit 73db9c4

Browse files
committed
feat: add support for custom embedding providers
Add flexible embedding provider configuration to support both OpenAI and custom embedding endpoints (e.g., LiteLLM, local models). This enables users to use alternative embedding services while maintaining backward compatibility with OpenAI.
1 parent 38c48ef commit 73db9c4

File tree

10 files changed

+580
-254
lines changed

10 files changed

+580
-254
lines changed

README.md

Lines changed: 37 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -68,9 +68,21 @@ Configuration is managed through two files:
6868
```dotenv
6969
# .env
7070
71-
# Required: Your OpenAI API Key
71+
# Required: Your OpenAI API Key (used for both OpenAI and custom providers)
7272
OPENAI_API_KEY="sk-..."
7373
74+
# Optional: Embedding provider (defaults to "openai")
75+
PROVIDER="openai" # or "custom"
76+
77+
# Optional: Custom embedding model (defaults based on provider)
78+
EMBEDDING_MODEL="text-embedding-3-large" # or your preferred model
79+
80+
# Required if using custom provider: Custom endpoint URL
81+
CUSTOM_ENDPOINT="http://localhost:8000/v1/embeddings"
82+
83+
# Optional: Vector size of custom embedding model
84+
EMBEDDING_VECTOR_SIZE=1024
85+
7486
# Required for GitHub sources
7587
GITHUB_PERSONAL_ACCESS_TOKEN="ghp_..."
7688
@@ -84,27 +96,32 @@ Configuration is managed through two files:
8496
2. **`config.yaml` file:**
8597
This file defines the sources to process and how to handle them. Create a `config.yaml` file (or use a different name and pass it as an argument).
8698

99+
**Embedding Provider Configuration:**
100+
101+
Embedding providers are now configured via environment variables:
102+
- `OPENAI_API_KEY`: API key used for both providers
103+
- `PROVIDER`: Set to "openai" (default) or "custom"
104+
- `EMBEDDING_MODEL`: Model to use (default: "text-embedding-3-large")
105+
- `EMBEDDING_VECTOR_SIZE`: Vector size of the custom embedding model (default: 3072)
106+
- `CUSTOM_ENDPOINT`: Required when using custom provider (e.g., "http://localhost:8000/v1/embeddings")
107+
87108
**Structure:**
88109

89110
* `sources`: An array of source configurations.
90111
* `type`: Either `'website'`, `'github'`, `'local_directory'`, or `'zendesk'`
91-
92112
For websites (`type: 'website'`):
93113
* `url`: The starting URL for crawling the documentation site.
94114
* `sitemap_url`: (Optional) URL to the site's XML sitemap for discovering additional pages not linked in navigation.
95-
96115
For GitHub repositories (`type: 'github'`):
97116
* `repo`: Repository name in the format `'owner/repo'` (e.g., `'istio/istio'`).
98117
* `start_date`: (Optional) Starting date to fetch issues from (e.g., `'2025-01-01'`).
99-
100118
For local directories (`type: 'local_directory'`):
101119
* `path`: Path to the local directory to process.
102120
* `include_extensions`: (Optional) Array of file extensions to include (e.g., `['.md', '.txt', '.pdf']`). Defaults to `['.md', '.txt', '.html', '.htm', '.pdf']`.
103121
* `exclude_extensions`: (Optional) Array of file extensions to exclude.
104122
* `recursive`: (Optional) Whether to traverse subdirectories (defaults to `true`).
105123
* `url_rewrite_prefix` (Optional) URL prefix to rewrite `file://` URLs (e.g., `https://mydomain.com`)
106124
* `encoding`: (Optional) File encoding to use (defaults to `'utf8'`). Note: PDF files are processed as binary and this setting doesn't apply to them.
107-
108125
For Zendesk (`type: 'zendesk'`):
109126
* `zendesk_subdomain`: Your Zendesk subdomain (e.g., `'mycompany'` for mycompany.zendesk.com).
110127
* `email`: Your Zendesk admin email address.
@@ -131,6 +148,21 @@ Configuration is managed through two files:
131148

132149
**Example (`config.yaml`):**
133150
```yaml
151+
# Example with OpenAI embedding provider (default)
152+
embedding_config:
153+
provider: "openai"
154+
openai:
155+
api_key_env: "OPENAI_API_KEY"
156+
157+
# Example with custom embedding provider (LiteLLM)
158+
# embedding_config:
159+
# provider: "custom"
160+
# custom:
161+
# endpoint: "http://localhost:8000/v1/embeddings"
162+
# model: "text-embedding-ada-002"
163+
# api_key_env: "LITELLM_API_KEY"
164+
# timeout: 30000
165+
134166
sources:
135167
# Website source example
136168
- type: 'website'
@@ -155,7 +187,6 @@ Configuration is managed through two files:
155187
type: 'sqlite'
156188
params:
157189
db_path: './istio-issues.db'
158-
159190
# Local directory source example
160191
- type: 'local_directory'
161192
product_name: 'project-docs'
@@ -168,7 +199,6 @@ Configuration is managed through two files:
168199
type: 'sqlite'
169200
params:
170201
db_path: './project-docs.db'
171-
172202
# Zendesk example
173203
- type: 'zendesk'
174204
product_name: 'MyCompany'
@@ -186,7 +216,6 @@ Configuration is managed through two files:
186216
type: 'sqlite'
187217
params:
188218
db_path: './zendesk-kb.db'
189-
190219
# Qdrant example
191220
- type: 'website'
192221
product_name: 'Istio'

0 commit comments

Comments
 (0)