From f07608a21adc68ed154c4cd3f7d94a5a1338d3ad Mon Sep 17 00:00:00 2001 From: CaroFG Date: Mon, 15 Sep 2025 15:02:20 +0200 Subject: [PATCH 1/2] Add guide on multilingual datasets --- guides/multilingual-datasets.mdx | 148 +++++++++++++++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 guides/multilingual-datasets.mdx diff --git a/guides/multilingual-datasets.mdx b/guides/multilingual-datasets.mdx new file mode 100644 index 000000000..1aeeff8ab --- /dev/null +++ b/guides/multilingual-datasets.mdx @@ -0,0 +1,148 @@ +--- +title: Handling multilingual datasets +description: This guide covers indexing strategies, language-specific tokenizers, and best practices for aligning document and query tokenization. +--- + +When working with datasets that include content in multiple languages, it’s important to ensure that both documents and queries are processed correctly. This guide explains how to index and search multilingual datasets in Meilisearch, highlighting best practices, useful features, and what to avoid. + +## Tokenizers and language differences + +Search quality in Meilisearch depends heavily on how text is broken down into tokens. Since each language has its own writing system and rules, they require different tokenization strategies: + +- **Space-separated languages** (English, French, Spanish): + +Words are clearly separated by spaces, making them straightforward to tokenize. + +- **Non-space-separated languages** (Chinese, Japanese): + +Words are written continuously without spaces. These languages require specialized tokenizers to correctly split text into searchable units. + +- **Languages with compound words** (German, Swedish): + +Words can be combined to form long terms, such as _Donaudampfschifffahrtsgesellschaft_ (German for Danube Steamship Company). Meilisearch provides specialized tokenizers to process them correctly. + +### Normalization differences + +Normalization ensures that different spellings or character variations (like accents or case differences) are treated consistently during search. + +- **Accents and diacritics**: + +In many languages, accents can often be ignored without losing meaning (e.g., éléphant vs elephant). + +In other languages like Swedish, diacritics may represent entirely different letters, so they must be preserved. + +## Recommended indexing strategy + +### Create a separate index for each language (recommended) + +If you have a multilingual dataset, the best practice is to create one index per language. + +#### Benefits + +- Provides natural sharding of your data by language, making it easier to maintain and scale. + +- Lets you apply language-specific settings, such as [stop words](/reference/api/settings#stop-words), and [separators](/reference/api/settings#separator-tokens). + +- Simplifies the handling of complex languages like Chinese or Japanese, which require specialized tokenizers. + +#### Searching across languages + +If you want to allow users to search in more than one language at once, you can: + +- Run a [multi-search](/reference/api/multi_search), querying several indexes in parallel. + +- Use [federated search](/reference/api/multi_search#federated-multi-search-requests), aggregating results from multiple language indexes into a single response. + +### Create a single index for multiple languages + +In some cases, you may prefer to keep multiple languages in a **single index**. This approach is generally acceptable for proof of concepts or datasets with fewer than ~1M documents. + +#### When it works well + +- Suitable for languages that use spaces to separate words and share similar tokenization behavior (e.g., English, French, Italian, Spanish, Portuguese). + +- Useful when you want a simple setup without maintaining multiple indexes. + +#### Limitations + +- Languages with compound words (like German) or diacritics that change meaning (like Swedish), as well as non-space-separated writing systems (like Chinese, or Japanese), work better in their own index since they require specialized tokenizers. + +- Chinese and Japanese documents should not be mixed in the same field, since distinguishing between them automatically is very difficult. Each of these languages works best in its own dedicated index. However, if fields are strictly separated by language (e.g., title_zh always Chinese, title_ja always Japanese), it is possible to store them in the same index. + +- As the number of documents and languages grows, performance and relevancy can decrease, since queries must run across a larger, mixed dataset. + +#### Best practices for the single index approach + +- Use language-specific field names with a prefix or suffix (e.g., title_fr, title_en, or fr_title). + +- Declare these fields as [localized attributes](/reference/api/settings#localized-attributes) so Meilisearch can apply the correct tokenizer to each one. + +- This allows you to filter and search by language, even when multiple languages are stored in the same index. + +## Language detection and configuration + +Accurate language detection is essential for applying the right tokenizer and normalization rules, which directly impact search quality. + +By default, Meilisearch automatically detects the language of your documents and queries. + +This automatic detection works well in most cases, especially with longer texts. However, results can vary depending on the type of input: + +- **Documents**: detection is generally reliable for longer content, but short snippets may produce less accurate results. +- **Queries**: short or partial inputs (such as type-as-you-search) are harder to identify correctly, making explicit configuration more important. + +When you explicitly set `localizedAttributes` for documents and `locales` for queries, you **restrict the detection to the languages you’ve declared**. + +**Benefits**: + +- Meilisearch only chooses between the specified languages (e.g., English vs German). + +- Detection is more **reliable and consistent**, reducing mismatches. + +For search to work effectively, **queries must be tokenized and normalized in the same way as documents**. If strategies are not aligned, queries may fail to match even when the correct terms exist in the index. + +### Aligning document and query tokenization + +To keep queries and documents consistent, Meilisearch provides configuration options for both sides. Meilisearch uses the same `locales` configuration concept for both documents and queries: + +- In **documents**, `locales` are declared through `localizedAttributes`. +- In **queries**, `locales` are passed as a [search parameter]. + +#### Declaring locales for documents + +The [`localizedAttributes` setting](/reference/api/settings#localized-attributes) allows you to explicitly define which languages are present in your dataset, and in which fields. + +For example, if your dataset contains multilingual titles, you can declare which attribute belongs to which language: + +```json +{ + "id": 1, + "title_en": "Danube Steamship Company", + "title_de": "Donaudampfschifffahrtsgesellschaft", + "title_fr": "Compagnie de navigation à vapeur du Danube" +} +``` + +```javascript +client.index('INDEX_NAME').updateLocalizedAttributes([ + { attributePatterns: ['*_en'], locales: ['eng'] }, + { attributePatterns: ['*_de'], locales: ['deu'] }, + { attributePatterns: ['*_fr'], locales: ['fra'] } +]) +``` + +#### Specifying locales for queries + +When performing searches, you can specify [query locales](/reference/api/search#query-locales) to ensure queries are tokenized with the correct rules. + +```javascript +client.index('INDEX_NAME').search('schiff', { locales: ['deu'] }) +``` + +This ensures queries are interpreted with the correct tokenizer and normalization rules, avoiding false mismatches. + +## Conclusion + +Handling multilingual datasets in Meilisearch requires careful planning of both indexing and querying. +By choosing the right indexing strategy, and explicitly configuring languages with `localizedAttributes` and `locales`, you ensure that documents and queries are processed consistently. + +This alignment leads to more accurate results, better relevancy across languages, and a smoother search experience for users—no matter what language they search in. From c5327ce5d4a12a515d2dcec1507f6fb829debe02 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Mon, 15 Sep 2025 13:03:06 +0000 Subject: [PATCH 2/2] Update code samples [skip ci] --- ...ield_guide_update_document_primary_key.mdx | 4 ++- .../code_samples_rename_an_index_1.mdx | 9 +++++ ...les_search_parameter_reference_media_1.mdx | 34 +++++++++++++++++++ .../samples/code_samples_swap_indexes_2.mdx | 9 +++++ .../code_samples_typo_tolerance_guide_5.mdx | 4 +++ .../code_samples_update_an_index_1.mdx | 4 ++- .../code_samples_update_an_index_2.mdx | 8 +++++ 7 files changed, 70 insertions(+), 2 deletions(-) create mode 100644 snippets/samples/code_samples_rename_an_index_1.mdx create mode 100644 snippets/samples/code_samples_swap_indexes_2.mdx create mode 100644 snippets/samples/code_samples_update_an_index_2.mdx diff --git a/snippets/samples/code_samples_primary_field_guide_update_document_primary_key.mdx b/snippets/samples/code_samples_primary_field_guide_update_document_primary_key.mdx index 9581144fe..5b10f2f6d 100644 --- a/snippets/samples/code_samples_primary_field_guide_update_document_primary_key.mdx +++ b/snippets/samples/code_samples_primary_field_guide_update_document_primary_key.mdx @@ -30,7 +30,9 @@ client.index('books').update(primary_key: 'title') ``` ```go Go -client.Index("books").UpdateIndex("title") +client.Index("books").UpdateIndex(&meilisearch.UpdateIndexRequestParams{ + PrimaryKey: "title", +}) ``` ```csharp C# diff --git a/snippets/samples/code_samples_rename_an_index_1.mdx b/snippets/samples/code_samples_rename_an_index_1.mdx new file mode 100644 index 000000000..c47acef51 --- /dev/null +++ b/snippets/samples/code_samples_rename_an_index_1.mdx @@ -0,0 +1,9 @@ + + +```bash cURL +curl \ + -X PATCH 'MEILISEARCH_URL/indexes/INDEX_A' \ + -H 'Content-Type: application/json' \ + --data-binary '{ "uid": "INDEX_B" }' +``` + \ No newline at end of file diff --git a/snippets/samples/code_samples_search_parameter_reference_media_1.mdx b/snippets/samples/code_samples_search_parameter_reference_media_1.mdx index 980280dd8..0e6f82187 100644 --- a/snippets/samples/code_samples_search_parameter_reference_media_1.mdx +++ b/snippets/samples/code_samples_search_parameter_reference_media_1.mdx @@ -18,6 +18,40 @@ curl \ }' ``` +```javascript JS +client.index('INDEX_NAME').search('a futuristic movie', { + hybrid: { + embedder: 'EMBEDDER_NAME' + }, + media: { + textAndPoster: { + text: 'a futuristic movie', + image: { + mime: 'image/jpeg', + data: 'base64EncodedImageData' + } + } + } +}) +``` + +```php PHP +$client->index('INDEX_NAME')->search('a futuristic movie', [ + 'hybrid' => [ + 'embedder' => 'EMBEDDER_NAME' + ], + 'media' => [ + 'textAndPoster' => [ + 'text' => 'a futuristic movie', + 'image' => [ + 'mime' => 'image/jpeg', + 'data' => 'base64EncodedImageData' + ] + ] + ] +]); +``` + ```go Go client.Index("INDEX_NAME").Search("", &meilisearch.SearchRequest{ Hybrid: &meilisearch.SearchRequestHybrid{ diff --git a/snippets/samples/code_samples_swap_indexes_2.mdx b/snippets/samples/code_samples_swap_indexes_2.mdx new file mode 100644 index 000000000..402cc310f --- /dev/null +++ b/snippets/samples/code_samples_swap_indexes_2.mdx @@ -0,0 +1,9 @@ + + +```go Go +client.SwapIndexes([]SwapIndexesParams{ + {Indexes: []string{"indexA", "indexB"}, Rename: true}, + {Indexes: []string{"indexX", "indexY"}, Rename: true}, +}) +``` + \ No newline at end of file diff --git a/snippets/samples/code_samples_typo_tolerance_guide_5.mdx b/snippets/samples/code_samples_typo_tolerance_guide_5.mdx index 55582c22b..51bf8e70e 100644 --- a/snippets/samples/code_samples_typo_tolerance_guide_5.mdx +++ b/snippets/samples/code_samples_typo_tolerance_guide_5.mdx @@ -27,6 +27,10 @@ $client->index('movies')->updateTypoTolerance([ ]); ``` +```ruby Ruby +index('books').update_typo_tolerance({ disable_on_numbers: true }) +``` + ```go Go client.Index("movies").UpdateTypoTolerance(&meilisearch.TypoTolerance{ DisableOnNumbers: true diff --git a/snippets/samples/code_samples_update_an_index_1.mdx b/snippets/samples/code_samples_update_an_index_1.mdx index d962419a7..54a2eccf4 100644 --- a/snippets/samples/code_samples_update_an_index_1.mdx +++ b/snippets/samples/code_samples_update_an_index_1.mdx @@ -28,7 +28,9 @@ client.index('movies').update(primary_key: 'movie_id') ``` ```go Go -client.Index("movies").UpdateIndex("id") +client.Index("movies").UpdateIndex(&meilisearch.UpdateIndexRequestParams{ + PrimaryKey: "id", +}) ``` ```csharp C# diff --git a/snippets/samples/code_samples_update_an_index_2.mdx b/snippets/samples/code_samples_update_an_index_2.mdx new file mode 100644 index 000000000..af5eddaf0 --- /dev/null +++ b/snippets/samples/code_samples_update_an_index_2.mdx @@ -0,0 +1,8 @@ + + +```go Go +client.Index("movies").UpdateIndex(&meilisearch.UpdateIndexRequestParams{ + UID: "movies_index_rename", +}) +``` + \ No newline at end of file