Add guide on multilingual datasets #3359

CaroFG · 2025-09-15T13:06:42Z

Pull Request

Related issue

Fixes #<issue_number>

What does this PR do?

Adds a guide on handling multilingual datasets

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

guimachiavelli

Since the "Guides" section atm is very focussed on integrating Meilisearch with third-party libs/tools (e.g. use meilisearch with mistral, deploy on AWS, etc) I'm thinking that adding it to /learn/indexing might actually make more sense. What do you think?

Other than that, I made a couple of suggestions to reduce the size of the article a bit.

guimachiavelli · 2025-09-16T11:37:08Z

guides/multilingual-datasets.mdx

+Handling multilingual datasets in Meilisearch requires careful planning of both indexing and querying.  
+By choosing the right indexing strategy, and explicitly configuring languages with `localizedAttributes` and `locales`, you ensure that documents and queries are processed consistently.  
+
+This alignment leads to more accurate results, better relevancy across languages, and a smoother search experience for users—no matter what language they search in.


Suggested change

This alignment leads to more accurate results, better relevancy across languages, and a smoother search experience for users—no matter what language they search in.

guimachiavelli · 2025-09-16T11:40:36Z

guides/multilingual-datasets.mdx

+
+## Tokenizers and language differences
+
+Search quality in Meilisearch depends heavily on how text is broken down into tokens. Since each language has its own writing system and rules, they require different tokenization strategies:
+
+- **Space-separated languages** (English, French, Spanish):
+
+Words are clearly separated by spaces, making them straightforward to tokenize.
+
+- **Non-space-separated languages** (Chinese, Japanese):
+
+Words are written continuously without spaces. These languages require specialized tokenizers to correctly split text into searchable units.
+
+- **Languages with compound words** (German, Swedish):
+
+Words can be combined to form long terms, such as _Donaudampfschifffahrtsgesellschaft_ (German for Danube Steamship Company). Meilisearch provides specialized tokenizers to process them correctly.
+
+### Normalization differences
+
+Normalization ensures that different spellings or character variations (like accents or case differences) are treated consistently during search.
+
+- **Accents and diacritics**:
+
+In many languages, accents can often be ignored without losing meaning (e.g., éléphant vs elephant).
+
+In other languages like Swedish, diacritics may represent entirely different letters, so they must be preserved.


This content is already available on the tokenization article and is to some extent theoretical knowledge. Given we generally want guides to be more oriented towards solving problems, and that this article is pretty long, I think we could skip these two sections and perhaps add links to the tokenization article in the middle of the body copy.

Suggested change

## Tokenizers and language differences

Search quality in Meilisearch depends heavily on how text is broken down into tokens. Since each language has its own writing system and rules, they require different tokenization strategies:

- **Space-separated languages** (English, French, Spanish):

Words are clearly separated by spaces, making them straightforward to tokenize.

- **Non-space-separated languages** (Chinese, Japanese):

Words are written continuously without spaces. These languages require specialized tokenizers to correctly split text into searchable units.

- **Languages with compound words** (German, Swedish):

Words can be combined to form long terms, such as _Donaudampfschifffahrtsgesellschaft_ (German for Danube Steamship Company). Meilisearch provides specialized tokenizers to process them correctly.

### Normalization differences

Normalization ensures that different spellings or character variations (like accents or case differences) are treated consistently during search.

- **Accents and diacritics**:

In many languages, accents can often be ignored without losing meaning (e.g., éléphant vs elephant).

In other languages like Swedish, diacritics may represent entirely different letters, so they must be preserved.

CaroFG and others added 2 commits September 15, 2025 15:02

Add guide on multilingual datasets

f07608a

Update code samples [skip ci]

c5327ce

CaroFG requested a review from guimachiavelli September 15, 2025 13:06

mintlify bot deployed to staging September 15, 2025 13:08 View deployment

This comment was marked as outdated.

Sign in to view

CaroFG closed this Sep 15, 2025

CaroFG reopened this Sep 15, 2025

guimachiavelli reviewed Sep 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add guide on multilingual datasets #3359

Add guide on multilingual datasets #3359

CaroFG commented Sep 15, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

guimachiavelli left a comment

Uh oh!

guimachiavelli Sep 16, 2025

Uh oh!

guimachiavelli Sep 16, 2025

Uh oh!

Uh oh!

Add guide on multilingual datasets #3359

Are you sure you want to change the base?

Add guide on multilingual datasets #3359

Conversation

CaroFG commented Sep 15, 2025

Pull Request

Related issue

What does this PR do?

PR checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

guimachiavelli left a comment

Choose a reason for hiding this comment

Uh oh!

guimachiavelli Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

guimachiavelli Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!