Skip to content

Commit 5b7f3e4

Browse files
authored
docs: reorganize and add new embedding provider documentation (#171)
* docs: reorganize AI integrations in mkdocs.yml and add new embedding provider documentation - Updated navigation structure in `mkdocs.yml` to categorize AI integrations under "AI Frameworks" and "Embeddings". - Added new documentation files for various embedding providers including Cohere, Google Gemini, Hugging Face, Jina AI, NVIDIA NIM, and OpenAI-compatible APIs. - Enhanced existing guides with examples for using TiDB Cloud hosted embeddings and auto-embedding features. * docs: update embedding tutorials for clarity and consistency - Revised introductory texts in embedding tutorials for Cohere, Google Gemini, Hugging Face, Jina AI, NVIDIA NIM, and OpenAI to emphasize semantic search and embedding generation. - Enhanced usage examples to focus on creating vector tables, inserting documents, and performing similarity searches. - Improved overall readability and consistency across all embedding integration documentation.
1 parent 16c5720 commit 5b7f3e4

12 files changed

+1581
-67
lines changed

mkdocs.yml

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -117,8 +117,20 @@ nav:
117117
- IDE & Tool Integration:
118118
- Cursor: ai/integrations/tidb-mcp-cursor.md
119119
- Claude Desktop: ai/integrations/tidb-mcp-claude-desktop.md
120-
- LlamaIndex: ai/integrations/llamaindex.md
121-
- LangChain: ai/integrations/langchain.md
120+
- AI Frameworks:
121+
- LlamaIndex: ai/integrations/llamaindex.md
122+
- LangChain: ai/integrations/langchain.md
123+
- Embeddings:
124+
- Overview: ai/integrations/embedding-overview.md
125+
- TiDB Cloud Hosted: ai/integrations/embedding-tidb-cloud-hosted.md
126+
- OpenAI: ai/integrations/embedding-openai.md
127+
- OpenAI Compatible: ai/integrations/embedding-openai-compatible.md
128+
- Cohere: ai/integrations/embedding-cohere.md
129+
- Jina AI: ai/integrations/embedding-jinaai.md
130+
- Google Gemini: ai/integrations/embedding-gemini.md
131+
- Hugging Face: ai/integrations/embedding-huggingface.md
132+
- NVIDIA NIM: ai/integrations/embedding-nvidia-nim.md
133+
122134
- Concepts:
123135
- Vector Search: ai/concepts/vector-search.md
124136
- Guides:
@@ -151,9 +163,19 @@ nav:
151163
- IDE & Tool Integration:
152164
- Cursor: ai/integrations/tidb-mcp-cursor.md
153165
- Claude Desktop: ai/integrations/tidb-mcp-claude-desktop.md
154-
- LlamaIndex: ai/integrations/llamaindex.md
155-
- LangChain: ai/integrations/langchain.md
156-
166+
- AI Frameworks:
167+
- LlamaIndex: ai/integrations/llamaindex.md
168+
- LangChain: ai/integrations/langchain.md
169+
- Embeddings:
170+
- Overview: ai/integrations/embedding-overview.md
171+
- TiDB Cloud Hosted: ai/integrations/embedding-tidb-cloud-hosted.md
172+
- OpenAI: ai/integrations/embedding-openai.md
173+
- OpenAI Compatible: ai/integrations/embedding-openai-compatible.md
174+
- Cohere: ai/integrations/embedding-cohere.md
175+
- Jina AI: ai/integrations/embedding-jinaai.md
176+
- Google Gemini: ai/integrations/embedding-gemini.md
177+
- Hugging Face: ai/integrations/embedding-huggingface.md
178+
- NVIDIA NIM: ai/integrations/embedding-nvidia-nim.md
157179

158180
extra:
159181
social:

src/ai/guides/auto-embedding.md

Lines changed: 3 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,19 @@ Auto embedding is a feature that allows you to automatically generate vector emb
88

99
## Basic Usage
1010

11+
In this example, we use TiDB Cloud hosted embedding models for demonstration, for other providers, please check the [Supported Providers](../integrations/embedding-overview.md#supported-providers) list.
12+
1113
### Step 1. Define a embedding function
1214

1315
=== "Python"
1416

1517
Define a embedding function to generate vector embeddings for text data.
16-
17-
In this example, we use OpenAI as the embedding provider for demonstration, for other providers, please check the [Supported Providers](#supported-providers) list.
1818

1919
```python
2020
from pytidb.embeddings import EmbeddingFunction
2121

2222
embed_func = EmbeddingFunction(
23-
model_name="openai/{model_name}", # openai/text-embedding-3-small
24-
api_key="{your-openai-api-key}",
23+
model_name="tidbcloud_free/amazon/titan-embed-text-v2",
2524
)
2625
```
2726

@@ -74,60 +73,3 @@ Auto embedding is a feature that allows you to automatically generate vector emb
7473
```python
7574
table.search("HTAP database").limit(3).to_list()
7675
```
77-
78-
## Embedding Function
79-
80-
`EmbeddingFunction` provides a unified interface in `pytidb` for accessing external embedding model services.
81-
82-
#### Constructor Parameters
83-
84-
- `model_name` *(required)*:
85-
Specifies the embedding model to use, in the format `{provider_name}/{model_name}`.
86-
87-
- `dimensions` *(optional)*:
88-
The dimensionality of the output vector embeddings. If not provided and the selected model does not include a default dimension, a test string will be embedded during initialization to automatically determine the actual dimension.
89-
90-
- `api_key` *(optional)*:
91-
The API key used to access the embedding service. If not explicitly set, the key will be retrieved from the default environment variable associated with the provider.
92-
93-
- `api_base` *(optional)*:
94-
The base URL of the embedding API service.
95-
96-
### Supported Providers
97-
98-
Below is a list of supported embedding model providers. You can follow the corresponding example to create an EmbeddingFunction instance for the provider you are using.
99-
100-
#### OpenAI
101-
102-
For OpenAI users, you can go to [OpenAI API Platform](https://platform.openai.com/api-keys) to create your own API key.
103-
104-
```python
105-
embed_func = EmbeddingFunction(
106-
model_name="openai/{model_name}", # openai/text-embedding-3-small
107-
api_key="{your-openai-api-key}",
108-
)
109-
```
110-
111-
#### OpenAI Like
112-
113-
If you're using a platform or tool that is compatible with the OpenAI API format, you can indicate this by adding the `openai/` prefix to the `model_name` parameter. Then, use the `api_base` parameter to specify the base URL of the API provided by your platform or tool.
114-
115-
```python
116-
embed_func = EmbeddingFunction(
117-
model_name="openai/{model_name}", # text-embedding-3-small
118-
api_key="{your-server-api-key}",
119-
api_base="{your-api-server-base-url}" # http://localhost:11434/
120-
)
121-
```
122-
123-
#### Jina AI
124-
125-
For Jina AI users, you can go to [Jina AI website](https://jina.ai/embeddings/) to create your own API key.
126-
127-
```python
128-
embed_func = EmbeddingFunction(
129-
model_name="jina_ai/{model_name}", # jina_ai/jina-embeddings-v3
130-
api_key="{your-jina-api-key}"
131-
)
132-
```
133-

src/ai/guides/image-search.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,14 @@ For demonstration, you can use Jina AI's multimodal embedding model to generate
1818

1919
Go to [Jina AI](https://jina.ai/embeddings) to create an API key, then initialize the embedding function as follows:
2020

21-
```python
21+
```python hl_lines="7"
2222
from pytidb.embeddings import EmbeddingFunction
2323

2424
image_embed = EmbeddingFunction(
2525
# Or another provider/model that supports multimodal input
2626
model_name="jina_ai/jina-embedding-v4",
2727
api_key="{your-jina-api-key}",
28+
multimodal=True,
2829
)
2930
```
3031

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
---
2+
title: "Integrate TiDB Vector Search with Cohere Embeddings API"
3+
description: "Learn how to integrate TiDB Vector Search with Cohere Embeddings API to store embeddings and perform semantic search."
4+
keywords: "TiDB, Cohere, Vector search, text embeddings, multilingual embeddings"
5+
---
6+
7+
# Integrate TiDB Vector Search with Cohere Embeddings API
8+
9+
This tutorial demonstrates how to use [Cohere](https://cohere.com/embed) to generate text embeddings, store them in TiDB vector storage, and perform semantic search.
10+
11+
!!! info
12+
13+
Currently, only the following product and regions support native SQL functions for integrating the Cohere Embeddings API:
14+
15+
- [TiDB Cloud Starter](https://tidbcloud.com/?utm_source=github&utm_medium=referral&utm_campaign=pytidb_readme) on AWS: `Frankfurt (eu-central-1)` and `Singapore (ap-southeast-1)`
16+
17+
## Cohere Embeddings
18+
19+
Cohere offers multilingual embedding models for search, RAG, and classification. The latest `embed-v4.0` model supports text, images, and mixed content. You can use the Cohere Embeddings API with TiDB through the AI SDK or native SQL functions for automatic embedding generation.
20+
21+
### Supported Models
22+
23+
| Model Name | Dimensions | Max Input Tokens | Description |
24+
|----------------------------------|------------|------------------|-------------|
25+
| `cohere/embed-v4.0` | 256, 512, 1024, 1536 (default) | 128k | Latest multimodal model supporting text, images, and mixed content (PDFs) |
26+
| `cohere/embed-english-v3.0` | 1024 | 512 | High-performance English embedding model optimized for search and classification |
27+
| `cohere/embed-multilingual-v3.0`| 1024 | 512 | Multilingual model supporting 100+ languages |
28+
| `cohere/embed-english-light-v3.0` | 384 | 512 | Lightweight English model for faster processing with similar performance |
29+
| `cohere/embed-multilingual-light-v3.0` | 384 | 512 | Lightweight multilingual model for faster processing with similar performance |
30+
31+
For a complete list of supported models and detailed specifications, see the [Cohere Embeddings Documentation](https://docs.cohere.com/docs/cohere-embed).
32+
33+
## Usage example
34+
35+
This example demonstrates creating a vector table, inserting documents, and performing similarity search using Cohere embedding models.
36+
37+
### Step 1: Connect to the database
38+
39+
=== "Python"
40+
41+
```python
42+
from pytidb import TiDBClient
43+
44+
tidb_client = TiDBClient.connect(
45+
host="{gateway-region}.prod.aws.tidbcloud.com",
46+
port=4000,
47+
username="{prefix}.root",
48+
password="{password}",
49+
database="{database}",
50+
ensure_db=True,
51+
)
52+
```
53+
54+
=== "SQL"
55+
56+
```bash
57+
mysql -h {gateway-region}.prod.aws.tidbcloud.com \
58+
-P 4000 \
59+
-u {prefix}.root \
60+
-p{password} \
61+
-D {database}
62+
```
63+
64+
### Step 2: Configure the API key
65+
66+
Create your API key from the [Cohere Dashboard](https://dashboard.cohere.com/api-keys) and bring your own key (BYOK) to use the embedding service.
67+
68+
=== "Python"
69+
70+
Configure the API key for the Cohere embedding provider using the TiDB Client:
71+
72+
```python
73+
tidb_client.configure_embedding_provider(
74+
provider="cohere",
75+
api_key="{your-cohere-api-key}",
76+
)
77+
```
78+
79+
=== "SQL"
80+
81+
Set the API key for the Cohere embedding provider using SQL:
82+
83+
```sql
84+
SET @@GLOBAL.TIDB_EXP_EMBED_COHERE_API_KEY = "{your-cohere-api-key}";
85+
```
86+
87+
### Step 3: Create a vector table
88+
89+
Create a table with a vector field that uses the `cohere/embed-v4.0` model to generate 1536-dimensional vectors (default dimension):
90+
91+
=== "Python"
92+
93+
```python
94+
from pytidb.schema import TableModel, Field
95+
from pytidb.embeddings import EmbeddingFunction
96+
from pytidb.datatype import TEXT
97+
98+
class Document(TableModel):
99+
__tablename__ = "sample_documents"
100+
id: int = Field(primary_key=True)
101+
content: str = Field(sa_type=TEXT)
102+
embedding: list[float] = EmbeddingFunction(
103+
model_name="cohere/embed-v4.0"
104+
).VectorField(source_field="content")
105+
106+
table = tidb_client.create_table(schema=Document, if_exists="overwrite")
107+
```
108+
109+
=== "SQL"
110+
111+
```sql
112+
CREATE TABLE sample_documents (
113+
`id` INT PRIMARY KEY,
114+
`content` TEXT,
115+
`embedding` VECTOR(1536) GENERATED ALWAYS AS (EMBED_TEXT(
116+
"cohere/embed-v4.0",
117+
`content`
118+
)) STORED
119+
);
120+
```
121+
122+
### Step 4: Insert data into the table
123+
124+
=== "Python"
125+
126+
Use the `table.insert()` or `table.bulk_insert()` API to add data:
127+
128+
```python
129+
documents = [
130+
Document(id=1, content="Python: High-level programming language for data science and web development."),
131+
Document(id=2, content="Python snake: Non-venomous constrictor found in tropical regions."),
132+
Document(id=3, content="Python framework: Django and Flask are popular web frameworks."),
133+
Document(id=4, content="Python libraries: NumPy and Pandas for data analysis."),
134+
Document(id=5, content="Python ecosystem: Rich collection of packages and tools."),
135+
]
136+
table.bulk_insert(documents)
137+
```
138+
139+
=== "SQL"
140+
141+
Insert data using the `INSERT INTO` statement:
142+
143+
```sql
144+
INSERT INTO sample_documents (id, content)
145+
VALUES
146+
(1, "Python: High-level programming language for data science and web development."),
147+
(2, "Python snake: Non-venomous constrictor found in tropical regions."),
148+
(3, "Python framework: Django and Flask are popular web frameworks."),
149+
(4, "Python libraries: NumPy and Pandas for data analysis."),
150+
(5, "Python ecosystem: Rich collection of packages and tools.");
151+
```
152+
153+
### Step 5: Search for similar documents
154+
155+
=== "Python"
156+
157+
Use the `table.search()` API to perform vector search:
158+
159+
```python
160+
results = table.search("How to learn Python programming?") \
161+
.limit(2) \
162+
.to_list()
163+
print(results)
164+
```
165+
166+
=== "SQL"
167+
168+
Use the `VEC_EMBED_COSINE_DISTANCE` function to perform vector search based on cosine distance metric:
169+
170+
```sql
171+
SELECT
172+
`id`,
173+
`content`,
174+
VEC_EMBED_COSINE_DISTANCE(embedding, "How to learn Python programming?") AS _distance
175+
FROM sample_documents
176+
ORDER BY _distance ASC
177+
LIMIT 2;
178+
```

0 commit comments

Comments
 (0)