Skip to content

Conversation

CaroFG
Copy link
Contributor

@CaroFG CaroFG commented Sep 15, 2025

Pull Request

Related issue

Fixes #<issue_number>

What does this PR do?

  • Adds a guide on handling multilingual datasets

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

cursor[bot]

This comment was marked as outdated.

@CaroFG CaroFG closed this Sep 15, 2025
@CaroFG CaroFG reopened this Sep 15, 2025
Copy link
Member

@guimachiavelli guimachiavelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the "Guides" section atm is very focussed on integrating Meilisearch with third-party libs/tools (e.g. use meilisearch with mistral, deploy on AWS, etc) I'm thinking that adding it to /learn/indexing might actually make more sense. What do you think?

Other than that, I made a couple of suggestions to reduce the size of the article a bit.

Handling multilingual datasets in Meilisearch requires careful planning of both indexing and querying.
By choosing the right indexing strategy, and explicitly configuring languages with `localizedAttributes` and `locales`, you ensure that documents and queries are processed consistently.

This alignment leads to more accurate results, better relevancy across languages, and a smoother search experience for users—no matter what language they search in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This alignment leads to more accurate results, better relevancy across languages, and a smoother search experience for users—no matter what language they search in.

Comment on lines +7 to +32

## Tokenizers and language differences

Search quality in Meilisearch depends heavily on how text is broken down into tokens. Since each language has its own writing system and rules, they require different tokenization strategies:

- **Space-separated languages** (English, French, Spanish):

Words are clearly separated by spaces, making them straightforward to tokenize.

- **Non-space-separated languages** (Chinese, Japanese):

Words are written continuously without spaces. These languages require specialized tokenizers to correctly split text into searchable units.

- **Languages with compound words** (German, Swedish):

Words can be combined to form long terms, such as _Donaudampfschifffahrtsgesellschaft_ (German for Danube Steamship Company). Meilisearch provides specialized tokenizers to process them correctly.

### Normalization differences

Normalization ensures that different spellings or character variations (like accents or case differences) are treated consistently during search.

- **Accents and diacritics**:

In many languages, accents can often be ignored without losing meaning (e.g., éléphant vs elephant).

In other languages like Swedish, diacritics may represent entirely different letters, so they must be preserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This content is already available on the tokenization article and is to some extent theoretical knowledge. Given we generally want guides to be more oriented towards solving problems, and that this article is pretty long, I think we could skip these two sections and perhaps add links to the tokenization article in the middle of the body copy.

Suggested change
## Tokenizers and language differences
Search quality in Meilisearch depends heavily on how text is broken down into tokens. Since each language has its own writing system and rules, they require different tokenization strategies:
- **Space-separated languages** (English, French, Spanish):
Words are clearly separated by spaces, making them straightforward to tokenize.
- **Non-space-separated languages** (Chinese, Japanese):
Words are written continuously without spaces. These languages require specialized tokenizers to correctly split text into searchable units.
- **Languages with compound words** (German, Swedish):
Words can be combined to form long terms, such as _Donaudampfschifffahrtsgesellschaft_ (German for Danube Steamship Company). Meilisearch provides specialized tokenizers to process them correctly.
### Normalization differences
Normalization ensures that different spellings or character variations (like accents or case differences) are treated consistently during search.
- **Accents and diacritics**:
In many languages, accents can often be ignored without losing meaning (e.g., éléphant vs elephant).
In other languages like Swedish, diacritics may represent entirely different letters, so they must be preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants